# **Retrieving Tennis player data from the ATP website**

The objective is to extract information on the ATP ranking points obtained by all professional players during 2025. This data cannot be downloaded from the ATP website.


## Load libraries

Load the necessary libraries for web table extraction.


In [13]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time


In [28]:
#Extracting the table with all ATP profesional players.
url = "https://www.atptour.com/en/rankings/singles?rankRange=0-5000&region=all&dateWeek=2025-12-15&SortField=null&SortAscending=null"
response = requests.get(url)
response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
dfs_tables = pd.read_html(html_content)

  dfs_tables = pd.read_html(html_content)


In [29]:
#The table to be used is extracted and the column names are checked.
player_data = dfs_tables[1]
player_data.columns

Index(['Hidden header', 'Rank', 'Player', 'Player.1', 'Player.2', 'Player.3',
       'Player.4', 'Player.5', 'Player.6', 'Age', 'Age.1', 'Official Points',
       'Official Points.1', '+/-', '+/-.1', 'Tourn Played', 'Tourn Played.1',
       'Dropping', 'Dropping.1', 'Next Best', 'Next Best.1'],
      dtype='object')

In [30]:
#All unnecessary columns are removed and the columns to be used are renamed.
player_data_clean = player_data[["Hidden header", "Player", "Age", "Official Points", "Tourn Played"]]
player_data_clean = player_data_clean.rename(columns={"Hidden header": "rank",
                                                      "Player": "player_name",
                                                      "Age": "age",
                                                      "Official Points": "points",
                                                      "Tourn Played": "n_tournaments"})

In [31]:
#There are player names with numbers or the sign -. Unnecessary numbers or signs are removed.
player_data_clean['player_name'].to_string()
player_data_clean['player_name'] = player_data_clean['player_name'].str.replace(r'^[-\d\s]+', '', regex=True)
player_data_clean["player_name"].head(11)

Unnamed: 0,player_name
0,Carlos Alcaraz
1,Jannik Sinner
2,Alexander Zverev
3,Novak Djokovic
4,Felix Auger-Aliassime
5,Taylor Fritz
6,Alex de Minaur
7,Lorenzo Musetti
8,Ben Shelton
9,Jack Draper


In [32]:
#The row with index 10 is removed from the DataFrame because it is not a player.
player_data_clean = player_data_clean.drop(index=10)
player_data_clean.head(11)

Unnamed: 0,rank,player_name,age,points,n_tournaments
0,1,Carlos Alcaraz,22,12050,19
1,2,Jannik Sinner,24,11500,18
2,3,Alexander Zverev,28,5160,24
3,4,Novak Djokovic,38,4830,20
4,5,Felix Auger-Aliassime,25,4245,28
5,6,Taylor Fritz,28,4135,23
6,7,Alex de Minaur,26,4135,23
7,8,Lorenzo Musetti,23,4040,23
8,9,Ben Shelton,23,3970,23
9,10,Jack Draper,23,2990,17


In [33]:
# Convert the 'points' column to numeric
# 'errors="coerce"' will turn any values that cannot be converted into numbers into 'NaN' (Not a Number)
player_data_clean['points'] = pd.to_numeric(player_data_clean['points'], errors='coerce')

# Verify the data type has changed
player_data_clean['points'].info()

<class 'pandas.core.series.Series'>
Index: 2223 entries, 0 to 2223
Series name: points
Non-Null Count  Dtype
--------------  -----
2223 non-null   int64
dtypes: int64(1)
memory usage: 34.7 KB


In [34]:
points_rank_three = player_data_clean.loc[2, 'points']
print(f"The points for the third row are: {points_rank_three}")

The points for the third row are: 5160


In [36]:
player_data_clean["diff_rank_three"] = abs(player_data_clean["points"] - points_rank_three)
player_data_clean.head()

Unnamed: 0,rank,player_name,age,points,n_tournaments,diff_rank_three
0,1,Carlos Alcaraz,22,12050,19,6890
1,2,Jannik Sinner,24,11500,18,6340
2,3,Alexander Zverev,28,5160,24,0
3,4,Novak Djokovic,38,4830,20,330
4,5,Felix Auger-Aliassime,25,4245,28,915


In [39]:
#The table is saved in a CSV file to be stored in a GitHub link and thus create the graph in Vega Lite.
save_path = 'player_points_2025.csv'
player_data_clean.to_csv(save_path, index=False)
