In [1]:
import pandas as pd
import numpy as np

# An Indroduction to the Data

The data is Jeff Sackmann's ATP match data. I have downalded match data from 1998-2024, as this is the time period of Federer, Djockovic, Nadal and Murray's careers. It is worth note that Djockovic is still an active player, but there is not yet dtaa for 2025. As the data is stored in csv files for each year, but has been compiled into one csv "combined_ATP_results.csv". Let's start by loading the data as a dataframe.

In [2]:
combined_df = pd.read_csv("../data/processed/combined_ATP_results.csv")
combined_df.head()

Unnamed: 0.1,Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,0,1998-339,Adelaide,Hard,32,A,19980105,1,102035,1.0,...,29.0,18.0,11.0,9.0,2.0,5.0,4.0,2949.0,74.0,649.0
1,1,1998-339,Adelaide,Hard,32,A,19980105,2,101727,,...,32.0,19.0,11.0,8.0,9.0,13.0,79.0,617.0,87.0,537.0
2,2,1998-339,Adelaide,Hard,32,A,19980105,3,102765,,...,38.0,18.0,7.0,9.0,5.0,12.0,93.0,521.0,71.0,665.0
3,3,1998-339,Adelaide,Hard,32,A,19980105,4,102563,7.0,...,37.0,21.0,10.0,11.0,1.0,6.0,39.0,959.0,76.0,633.0
4,4,1998-339,Adelaide,Hard,32,A,19980105,5,102796,4.0,...,57.0,33.0,20.0,13.0,12.0,17.0,22.0,1450.0,65.0,708.0


In [22]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81831 entries, 0 to 81830
Data columns (total 50 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          81831 non-null  int64  
 1   tourney_id          81831 non-null  object 
 2   tourney_name        81831 non-null  object 
 3   surface             81778 non-null  object 
 4   draw_size           81831 non-null  int64  
 5   tourney_level       81831 non-null  object 
 6   tourney_date        81831 non-null  int64  
 7   match_num           81831 non-null  int64  
 8   winner_id           81831 non-null  int64  
 9   winner_seed         33619 non-null  float64
 10  winner_entry        10389 non-null  object 
 11  winner_name         81831 non-null  object 
 12  winner_hand         81831 non-null  object 
 13  winner_ht           80219 non-null  float64
 14  winner_ioc          81831 non-null  object 
 15  winner_age          81826 non-null  float64
 16  lose

Many of these features are not important for our question, such as handedness, height or match number. These will likely be removed in cleaning. There are also features which have a lot of missing data, such as seed or entry. These can be ignored for our analysis, and thus removed from the dataset, as they provide secondary information for our question, alongside bieng incomplete. 

# Looking into Players
When we look at each matchup, we can also see there are many somewhat 'irrelevant' matches listed. As we are only interested in the 'elite' players as such, the dataset may be reduced by not including lower level players. 

In [23]:
combined_df[["winner_name", "loser_name"]]

Unnamed: 0,winner_name,loser_name
0,Jonas Bjorkman,Grant Stafford
1,Jason Stoltenberg,Juan Antonio Marin
2,Nicolas Escude,Alex Radulescu
3,Thomas Johansson,Byron Black
4,Magnus Norman,Christian Ruud
...,...,...
81826,Joaquin Aguilar Cardozo,Ilya Snitari
81827,Nam Hoang Ly,Philip Henning
81828,Kris Van Wyk,Linh Giang Trinh
81829,Nam Hoang Ly,Kris Van Wyk


In [26]:
all_players = pd.concat([combined_df["winner_name"], combined_df["loser_name"]]).unique()
print(f"There are {len(all_players)} unique players in the dataset.")

There are 2840 unique players in the dataset.


In [36]:
matches_per_player = pd.concat([combined_df["winner_name"], combined_df["loser_name"]]).value_counts()
print(matches_per_player[matches_per_player > 800])

Roger Federer            1545
Novak Djokovic           1363
Rafael Nadal             1326
David Ferrer             1119
Richard Gasquet          1022
Andy Murray              1017
Fernando Verdasco        1007
Feliciano Lopez          1004
Tomas Berdych             988
Stan Wawrinka             948
Marin Cilic               933
Mikhail Youzhny           922
Gael Monfils              909
Gilles Simon              904
Tommy Robredo             896
Lleyton Hewitt            882
Philipp Kohlschreiber     882
Tommy Haas                873
Andy Roddick              827
Nikolay Davydenko         822
Fabio Fognini             819
Andreas Seppi             810
John Isner                809
Name: count, dtype: int64


In [42]:
matches_per_player = pd.concat([combined_df["winner_name"], combined_df["loser_name"]]).value_counts()
steps = [25, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
for step in steps:
    print(f"There are {len(matches_per_player[matches_per_player > step])} players with more than {step} matches.")

There are 715 players with more than 25 matches.
There are 553 players with more than 50 matches.
There are 398 players with more than 100 matches.
There are 253 players with more than 200 matches.
There are 172 players with more than 300 matches.
There are 119 players with more than 400 matches.
There are 78 players with more than 500 matches.
There are 53 players with more than 600 matches.
There are 31 players with more than 700 matches.
There are 23 players with more than 800 matches.
There are 14 players with more than 900 matches.
There are 8 players with more than 1000 matches.


In [35]:
player_wins = combined_df["winner_name"].value_counts()
print(player_wins[player_wins > 500])

winner_name
Roger Federer        1265
Novak Djokovic       1139
Rafael Nadal         1091
Andy Murray           748
David Ferrer          740
Tomas Berdych         643
Lleyton Hewitt        619
Andy Roddick          612
Richard Gasquet       609
Marin Cilic           592
Stan Wawrinka         581
Gael Monfils          567
Fernando Verdasco     559
Tommy Haas            546
Tommy Robredo         534
Feliciano Lopez       512
Gilles Simon          508
Name: count, dtype: int64


In [40]:
player_wins = combined_df["winner_name"].value_counts()
steps = [10, 20, 50, 100, 200, 300, 400, 500, 600, 700, 800]
for step in steps:
    print(f"There are {len(player_wins[player_wins > step])} players with more than {step} wins.")

There are 700 players with more than 10 wins.
There are 537 players with more than 20 wins.
There are 355 players with more than 50 wins.
There are 230 players with more than 100 wins.
There are 132 players with more than 200 wins.
There are 64 players with more than 300 wins.
There are 34 players with more than 400 wins.
There are 17 players with more than 500 wins.
There are 9 players with more than 600 wins.
There are 5 players with more than 700 wins.
There are 3 players with more than 800 wins.


We can see our players of interest are unsurprisingly close or at the top of matches played and matches won. It also also apparent of the elite group, with less than 20 players with over 500 wins, out of the nearly 3000 unique players in the data. 
I believe and ELO system could be implemented well to distinguish high levels players, and a new dataset created in feautre engineering to have stats for each player, making later analysis and modelling easier. 
Furthermore, with only 715 players with more than 25 matches, out of 2840 unique players, and over 81000 match entries, the same players play each other multiple times, so I belive it to be foolish to reduce the dataset on players at this point, but would be better to ignore some players later on during analysis, as required. These 'lower level' players may also have games against our players of interest, giving further justification to not remove them at this stage. 

In [None]:
# code to show number of matchups between each pair of players


              winner_name           loser_name  match_count
3329         Andre Agassi  Jan Michael Gambill           11
4703          Andy Murray         David Ferrer           15
4724          Andy Murray      Feliciano Lopez           11
4726          Andy Murray    Fernando Verdasco           13
4739          Andy Murray         Gilles Simon           16
...                   ...                  ...          ...
47937       Stan Wawrinka          Marin Cilic           13
48021       Stan Wawrinka        Tomas Berdych           11
48262  Stefanos Tsitsipas       Alex De Minaur           11
50910       Tomas Berdych    Fernando Verdasco           11
50973       Tomas Berdych       Kevin Anderson           12

[95 rows x 3 columns]


# A Summary looking toward Cleaning