# **Retrieving Tennis player data from the ATP website**

El objetivo final es extraer la información del premio monetario obtenido por los jugadores entre los rankings 1 a 500.
El desafío es que no existe una tabla disponible con la información del price money por jugador. Por lo que se debe armar extrayendo información de dos fuentes de la ATP.


# **Extrayendo la tabla con los primeros 500 jugadores del ranking ATP**

## Load libraries

Load the necessary libraries for web table extraction.


In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

print("Libraries loaded successfully.")

Libraries loaded successfully.


In [2]:
#Extracting the table with the first 500 atp playes
url = "https://www.atptour.com/en/rankings/singles?rankRange=0-500&region=all&dateWeek=2025-11-17&SortField=null&SortAscending=null"
response = requests.get(url)
response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
dfs_tables = pd.read_html(html_content)

  dfs_tables = pd.read_html(html_content)


In [3]:
player_data = dfs_tables[1]
player_data.columns

Index(['Hidden header', 'Rank', 'Player', 'Player.1', 'Player.2', 'Player.3',
       'Player.4', 'Player.5', 'Player.6', 'Age', 'Age.1', 'Official Points',
       'Official Points.1', '+/-', '+/-.1', 'Tourn Played', 'Tourn Played.1',
       'Dropping', 'Dropping.1', 'Next Best', 'Next Best.1'],
      dtype='object')

In [4]:
player_data_clean = player_data[["Hidden header", "Player", "Age", "Official Points", "Tourn Played"]]
player_data_clean = player_data_clean.rename(columns={"Hidden header": "rank",
                                                      "Player": "player_name",
                                                      "Age": "age",
                                                      "Official Points": "points",
                                                      "Tourn Played": "n_tournaments"})

In [5]:
player_data_clean['player_name'].to_string() #Hay nombres de jugadores con números o el signo -. Se eliminan
player_data_clean['player_name'] = player_data_clean['player_name'].str.replace(r'^[-\d\s]+', '', regex=True)
player_data_clean["player_name"].head(11)

Unnamed: 0,player_name
0,Carlos Alcaraz
1,Jannik Sinner
2,Alexander Zverev
3,Novak Djokovic
4,Felix Auger-Aliassime
5,Taylor Fritz
6,Alex de Minaur
7,Lorenzo Musetti
8,Ben Shelton
9,Jack Draper


In [6]:
#Eliminar la fila con índice 10 del DataFrame. Nos es un jugador
player_data_clean = player_data_clean.drop(index=10)
player_data_clean.head(11)

Unnamed: 0,rank,player_name,age,points,n_tournaments
0,1,Carlos Alcaraz,22,12050,19
1,2,Jannik Sinner,24,11500,18
2,3,Alexander Zverev,28,5160,24
3,4,Novak Djokovic,38,4830,20
4,5,Felix Auger-Aliassime,25,4245,28
5,6,Taylor Fritz,28,4135,23
6,7,Alex de Minaur,26,4135,23
7,8,Lorenzo Musetti,23,4040,23
8,9,Ben Shelton,23,3970,23
9,10,Jack Draper,23,2990,17


La primera tabla ya está ordenada y limpia. Ahora obtendremos la información del price money de cada uno de estos jugadores para el año 2025.

In [7]:
#Obtener el link con el overview de cada jugador
player_links = soup.find_all('a', href=lambda href: href and '/en/players/' in href and '/overview' in href)
player_overview_urls = []
for link in player_links:
          full_url = "https://www.atptour.com" + link['href']
          if full_url not in player_overview_urls:
            player_overview_urls.append(full_url)
player_overview_urls[0:5]

['https://www.atptour.com/en/players/carlos-alcaraz/a0e2/overview',
 'https://www.atptour.com/en/players/jannik-sinner/s0ag/overview',
 'https://www.atptour.com/en/players/alexander-zverev/z355/overview',
 'https://www.atptour.com/en/players/novak-djokovic/d643/overview',
 'https://www.atptour.com/en/players/felix-auger-aliassime/ag37/overview']

In [8]:
#Url de la actividad de cada jugador
player_activity_urls = []
activity_suffix = "player-activity?matchType=Singles&year=2025&tournament=all"

for url in player_overview_urls:
    # Reemplazar "overview" por la nueva cadena de actividad
    activity_url = url.replace("/overview", f"/{activity_suffix}")
    player_activity_urls.append(activity_url)
player_activity_urls[:5]


['https://www.atptour.com/en/players/carlos-alcaraz/a0e2/player-activity?matchType=Singles&year=2025&tournament=all',
 'https://www.atptour.com/en/players/jannik-sinner/s0ag/player-activity?matchType=Singles&year=2025&tournament=all',
 'https://www.atptour.com/en/players/alexander-zverev/z355/player-activity?matchType=Singles&year=2025&tournament=all',
 'https://www.atptour.com/en/players/novak-djokovic/d643/player-activity?matchType=Singles&year=2025&tournament=all',
 'https://www.atptour.com/en/players/felix-auger-aliassime/ag37/player-activity?matchType=Singles&year=2025&tournament=all']

In [9]:
#Incluir en el data frame de los jugadores las urls. Como la información se obtuve de manera ordenada, se pueden pegar fácilmente
player_data_clean['player_overview_url'] = player_overview_urls
player_data_clean['player_activity_url'] = player_activity_urls
player_data_clean.tail()

Unnamed: 0,rank,player_name,age,points,n_tournaments,player_overview_url,player_activity_url
496,496,Maxime Janvier,29,87,28,https://www.atptour.com/en/players/maxime-janv...,https://www.atptour.com/en/players/maxime-janv...
497,497,Florian Broska,27,86,18,https://www.atptour.com/en/players/florian-bro...,https://www.atptour.com/en/players/florian-bro...
498,498,Bor Artnak,21,86,18,https://www.atptour.com/en/players/bor-artnak/...,https://www.atptour.com/en/players/bor-artnak/...
499,499,Egor Agafonov,23,86,19,https://www.atptour.com/en/players/egor-agafon...,https://www.atptour.com/en/players/egor-agafon...
500,500,Oleksandr Ovcharenko,24,86,23,https://www.atptour.com/en/players/oleksandr-o...,https://www.atptour.com/en/players/oleksandr-o...


In [10]:
# Función para extraer el id de cada jugador y se guarda en la misma tabla.
def extract_player_id(url: str) -> str:
    """
    Extrae el id del jugador desde la URL player_overview_url.
    Asume formato /players/.../{player_id}/overview
    """
    if not isinstance(url, str):
        return None
    parts = url.strip("/").split("/")
    # Últimos elementos: [..., player_id, 'overview']
    if len(parts) >= 2:
        return parts[-2]
    return None

player_data_clean["player_id"] = player_data_clean["player_overview_url"].apply(extract_player_id)
player_data_clean.head()

Unnamed: 0,rank,player_name,age,points,n_tournaments,player_overview_url,player_activity_url,player_id
0,1,Carlos Alcaraz,22,12050,19,https://www.atptour.com/en/players/carlos-alca...,https://www.atptour.com/en/players/carlos-alca...,a0e2
1,2,Jannik Sinner,24,11500,18,https://www.atptour.com/en/players/jannik-sinn...,https://www.atptour.com/en/players/jannik-sinn...,s0ag
2,3,Alexander Zverev,28,5160,24,https://www.atptour.com/en/players/alexander-z...,https://www.atptour.com/en/players/alexander-z...,z355
3,4,Novak Djokovic,38,4830,20,https://www.atptour.com/en/players/novak-djoko...,https://www.atptour.com/en/players/novak-djoko...,d643
4,5,Felix Auger-Aliassime,25,4245,28,https://www.atptour.com/en/players/felix-auger...,https://www.atptour.com/en/players/felix-auger...,ag37


In [11]:
players_df = player_data_clean

In [12]:
players_df.to_csv("atp_ranking_2025.csv", index=False)

In [None]:
BASE_ACTIVITY_URL = "https://www.atptour.com/en/-/www/activity/sgl/{player_id}/?v=1"

#Función para descargar el JSON de activity de cada jugador

def fetch_activity_json(player_id: str, session: requests.Session = None, timeout: int = 20):
    """
    Descarga el JSON de activity para un player_id.
    Devuelve un dict (JSON parseado) o None si falla.
    """
    url = BASE_ACTIVITY_URL.format(player_id=player_id)
    sess = session or requests.Session()

    try:
        resp = sess.get(url, timeout=timeout)
        if resp.status_code != 200:
            print(f"[WARN] {player_id}: status {resp.status_code}")
            return None
        return resp.json()
    except Exception as e:
        print(f"[ERROR] {player_id}: {e}")
        return None

In [None]:
def activity_json_to_rows(player_id: str, data: dict) -> list[dict]:
    """
    Convierte el JSON de Activity del ATP en filas (año–evento),
    incluyendo el player_id en cada fila.
    """
    if data is None:
        return []

    activity = data.get("Activity", [])
    rows = []

    for year_block in activity:
        year = year_block.get("EventYear")

        tournaments = year_block.get("Tournaments", [])
        for t in tournaments:
            rows.append({
                "player_id": player_id,
                "year": year,
                "event_id": t.get("EventId"),
                "event_name": t.get("EventName"),
                "event_title": t.get("EventDisplayName"),
                "prize_raw": t.get("Prize"),          # premio en moneda local
                "currency": t.get("CurrSymbol"),      # "$", "€", "£", etc.
                "prize_usd": t.get("PrizeUsd"),       # premio convertido a USD
            })

    return rows

In [None]:
activity_rows = activity_json_to_rows(player_id="a0e2", data=resp)
df = pd.DataFrame(activity_rows)
print(df.head(10))

  player_id  year event_id                   event_name  \
0      a0e2  2025      605             Nitto ATP Finals   
1      a0e2  2025      352       ATP Masters 1000 Paris   
2      a0e2  2025      329                        Tokyo   
3      a0e2  2025     9210                    Laver Cup   
4      a0e2  2025      560                      US Open   
5      a0e2  2025      422  ATP Masters 1000 Cincinnati   
6      a0e2  2025      540                    Wimbledon   
7      a0e2  2025      311        London / Queen's Club   
8      a0e2  2025      520                Roland Garros   
9      a0e2  2025      416        ATP Masters 1000 Rome   

                                       event_title  prize_raw currency  \
0                                 Nitto ATP Finals    2704000        $   
1                              Rolex Paris Masters      44220        €   
2  Kinoshita Group Japan Open Tennis Championships     416365        $   
3                                        Laver Cup    

In [None]:
# Descargar la información de los jugadores que están en la primera tabla descargada.
session = requests.Session()

all_rows = []

for i, row in players_df.iterrows():
    player_id = row["player_id"]

    # Saltar si falta el id
    if pd.isna(player_id):
        continue

    print(f"[{i+1}/{len(players_df)}] Descargando activity de {player_id}...")

    resp = fetch_activity_json(player_id=player_id, session=session)

    # Si falló la descarga, seguimos con el siguiente
    if resp is None:
        continue

    player_rows = activity_json_to_rows(player_id=player_id, data=resp)

    # Si el jugador no tiene actividad, seguimos
    if not player_rows:
        continue

    # Agregamos las filas de este jugador a la lista global
    all_rows.extend(player_rows)

    # Pequeña pausa para no pegarle tan fuerte a la web
    time.sleep(0.3)

# Crear el DataFrame final con TODOS los jugadores
activity_all_df = pd.DataFrame(all_rows)

print(activity_all_df.head())
print(activity_all_df.shape)

[1/500] Descargando activity de a0e2...
[2/500] Descargando activity de s0ag...
[3/500] Descargando activity de z355...
[4/500] Descargando activity de d643...
[5/500] Descargando activity de ag37...
[6/500] Descargando activity de fb98...
[7/500] Descargando activity de dh58...
[8/500] Descargando activity de m0ej...
[9/500] Descargando activity de s0s1...
[10/500] Descargando activity de d0co...
[12/500] Descargando activity de bk92...
[13/500] Descargando activity de rh16...
[14/500] Descargando activity de mm58...
[15/500] Descargando activity de dh50...
[16/500] Descargando activity de r0dg...
[17/500] Descargando activity de re44...
[18/500] Descargando activity de l0bv...
[19/500] Descargando activity de ke29...
[20/500] Descargando activity de m0ni...
[21/500] Descargando activity de pl56...
[22/500] Descargando activity de c0au...
[23/500] Descargando activity de c0e9...
[24/500] Descargando activity de su55...
[25/500] Descargando activity de f0fv...
[26/500] Descargando acti

In [None]:
activity_all_df.head()

Unnamed: 0,player_id,year,event_id,event_name,event_title,prize_raw,currency,prize_usd
0,a0e2,2025,605,Nitto ATP Finals,Nitto ATP Finals,2704000,$,2704000
1,a0e2,2025,352,ATP Masters 1000 Paris,Rolex Paris Masters,44220,€,51410
2,a0e2,2025,329,Tokyo,Kinoshita Group Japan Open Tennis Championships,416365,$,416365
3,a0e2,2025,9210,Laver Cup,Laver Cup,0,$,0
4,a0e2,2025,560,US Open,US Open,5000000,$,5000000


In [None]:
import datetime
today = datetime.date.today().strftime("%Y-%m-%d")

activity_all_df.to_csv(f"atp_players_activity_{today}.csv", index=False)

In [None]:
prize_by_year = df_filtered.groupby("year")["prize_usd"].sum()
print(prize_by_year)

year
2018         438
2019       12212
2020       81932
2021     1617820
2022     7627612
2023    10753431
2024     9850338
2025    18803427
Name: prize_usd, dtype: int64


In [None]:
rows = []

for year_block in activity:          # cada bloque es un año
    year = year_block["EventYear"]
    for t in year_block["Tournaments"]:   # cada torneo dentro del año
        rows.append({
            "year": year,
            "event_id": t["EventId"],
            "event_name": t["EventName"],
            "event_title": t["EventDisplayName"],
            "prize_raw": t["Prize"],          # premio en la moneda local
            "currency": t["CurrSymbol"],      # "$", "€", "£", etc.
            "prize_usd": t["PrizeUsd"],       # premio convertido a USD
        })

# Ahora rows tiene un registro por evento-año
for r in rows[:5]:
    print(r)

{'year': 2025, 'event_id': '605', 'event_name': 'Nitto ATP Finals', 'event_title': 'Nitto ATP Finals', 'prize_raw': 5071000, 'currency': '$', 'prize_usd': 5071000}
{'year': 2025, 'event_id': '352', 'event_name': 'ATP Masters 1000 Paris', 'event_title': 'Rolex Paris Masters', 'prize_raw': 946610, 'currency': '€', 'prize_usd': 1100529}
{'year': 2025, 'event_id': '337', 'event_name': 'Vienna', 'event_title': 'Erste Bank Open', 'prize_raw': 511835, 'currency': '€', 'prize_usd': 596339}
{'year': 2025, 'event_id': '5014', 'event_name': 'ATP Masters 1000 Shanghai', 'event_title': 'Rolex Shanghai Masters', 'prize_raw': 60400, 'currency': '$', 'prize_usd': 60400}
{'year': 2025, 'event_id': '747', 'event_name': 'Beijing', 'event_title': 'China Open', 'prize_raw': 751075, 'currency': '$', 'prize_usd': 751075}


In [None]:
df = pd.DataFrame(rows)
print(df.head(40))

    year event_id                     event_name  \
0   2025      605               Nitto ATP Finals   
1   2025      352         ATP Masters 1000 Paris   
2   2025      337                         Vienna   
3   2025     5014      ATP Masters 1000 Shanghai   
4   2025      747                        Beijing   
5   2025      560                        US Open   
6   2025      422    ATP Masters 1000 Cincinnati   
7   2025      540                      Wimbledon   
8   2025      500                          Halle   
9   2025      520                  Roland Garros   
10  2025      416          ATP Masters 1000 Rome   
11  2025      580                Australian Open   
12  2024     4481                 500 Bonus Pool   
13  2024      607                1000 Bonus Pool   
14  2024      901               Davis Cup Finals   
15  2024      605               Nitto ATP Finals   
16  2024     5014      ATP Masters 1000 Shanghai   
17  2024      747                        Beijing   
18  2024    

In [None]:
df_filtered[df_filtered["year"] == 2021]

Unnamed: 0,year,event_id,event_name,event_title,prize_raw,currency,prize_usd
78,2021,607,Bonus Prize Money,Bonus Prize Money,145000,$,145000
79,2021,901,Davis Cup Finals,Davis Cup Finals,0,€,0
80,2021,605,Nitto ATP Finals,Nitto ATP Finals,266000,$,266000
81,2021,429,Stockholm,Stockholm,11230,€,12989
82,2021,352,ATP Masters 1000 Paris,ATP Masters 1000 Paris,39120,€,45544
83,2021,337,Vienna,Vienna,103000,€,119913
84,2021,7485,Antwerp,Antwerp,49885,€,57832
85,2021,404,ATP Masters 1000 Indian Wells,ATP Masters 1000 Indian Wells,92000,$,92000
86,2021,7434,Sofia,Sofia,41145,€,48218
87,2021,560,US Open,US Open,265000,$,265000


In [None]:
exclude = ["500 Bonus Pool", "1000 Bonus Pool", "Profit Sharing", "Profit Share", "Bonus Prize Money"]
df_filtered = activity_all_df[~activity_all_df["event_name"].isin(exclude)]

In [None]:
df_filtered.head()

Unnamed: 0,player_id,year,event_id,event_name,event_title,prize_raw,currency,prize_usd
0,a0e2,2025,605,Nitto ATP Finals,Nitto ATP Finals,2704000,$,2704000
1,a0e2,2025,352,ATP Masters 1000 Paris,Rolex Paris Masters,44220,€,51410
2,a0e2,2025,329,Tokyo,Kinoshita Group Japan Open Tennis Championships,416365,$,416365
3,a0e2,2025,9210,Laver Cup,Laver Cup,0,$,0
4,a0e2,2025,560,US Open,US Open,5000000,$,5000000


In [None]:
prize_by_year = df_filtered.groupby(["player_id", "year"])["prize_usd"].sum()
prize_by_year.tail(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,prize_usd
player_id,year,Unnamed: 2_level_1
z371,2019,148968
z371,2020,24899
z371,2021,125712
z371,2022,199673
z371,2023,1068483
z371,2024,1448942
z371,2025,378405
z419,2013,624
z419,2014,2738
z419,2015,6967
