# Features per player

These features were chosen based on a mix of general good features for tennis match analysis and longterm experience as a tennis players myself

## Basic information
- Player_id
- Player name (first and last)
- age at match
- current points ATP
- dominant hand
- dob
- ioc
- height
- current rank

## Career information
- percentage matches won
- grand slams percentage won
- wimbledon percentage won
- years on tour
- average rank opponents beaten
- common opponents difference


## Per Match Statistics
- average aces
- average double faults
- average first serve percentage
- average first serve points won
- first serve consistency score (percentage * won)
- average second serve points won
- average break points won
- average break points saved
- average winners
- average unforced errors
- average forced errors
- effectiveness score (winners - (un- + forced errors))
- average total points/sets won
- average ralley count
- average serve direction
- serve variation score
- average serve width
- average serve depth
- average return depth
- average serve speed
- average distance run

## Tourney information
- average level of tourney -> tourney score
- tourney round
- is home country
- match type (number of winning sets)
- average tourney location
- percentage matches played on surface xy (hard, clay, grass, carpet)
- suface speed score
- win percentage last 5 games
- fatigue score (number of games played in the last 7 days)
- rest days since last match
- (consecutive tournaments played)

First look into the data to get to know the data structure

In [12]:
import pandas as pd

# Load a test file
match = pd.read_csv('../data/atp_matches/atp_matches_2019.csv')

print(match.head())
match.columns

  tourney_id tourney_name surface  draw_size tourney_level  tourney_date  \
0  2019-M020     Brisbane    Hard         32             A      20181231   
1  2019-M020     Brisbane    Hard         32             A      20181231   
2  2019-M020     Brisbane    Hard         32             A      20181231   
3  2019-M020     Brisbane    Hard         32             A      20181231   
4  2019-M020     Brisbane    Hard         32             A      20181231   

   match_num  winner_id  winner_seed winner_entry  ... l_1stIn l_1stWon  \
0        300     105453          2.0          NaN  ...    54.0     34.0   
1        299     106421          4.0          NaN  ...    52.0     36.0   
2        298     105453          2.0          NaN  ...    27.0     15.0   
3        297     104542          NaN           PR  ...    60.0     38.0   
4        296     106421          4.0          NaN  ...    56.0     46.0   

   l_2ndWon l_SvGms  l_bpSaved  l_bpFaced  winner_rank winner_rank_points  \
0      20.0    

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'loser_id', 'loser_seed', 'loser_entry', 'loser_name', 'loser_hand',
       'loser_ht', 'loser_ioc', 'loser_age', 'score', 'best_of', 'round',
       'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon', 'w_2ndWon',
       'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt',
       'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced',
       'winner_rank', 'winner_rank_points', 'loser_rank', 'loser_rank_points'],
      dtype='object')

In [38]:
players = pd.read_csv('../data/atp_matches/atp_players.csv')

print(players.head())
print(players['hand'].unique())
players.columns

   player_id name_first name_last hand         dob  ioc  height wikidata_id
0     100001    Gardnar    Mulloy    R  19131122.0  USA   185.0      Q54544
1     100002     Pancho    Segura    R  19210620.0  ECU   168.0      Q54581
2     100003      Frank   Sedgman    R  19271002.0  AUS   180.0     Q962049
3     100004   Giuseppe     Merlo    R  19271011.0  ITA     NaN    Q1258752
4     100005    Richard  Gonzalez    R  19280509.0  USA   188.0      Q53554
['R' 'L' 'U' 'A' nan]


  players = pd.read_csv('../data/atp_matches/atp_players.csv')


Index(['player_id', 'name_first', 'name_last', 'hand', 'dob', 'ioc', 'height',
       'wikidata_id'],
      dtype='object')

In [24]:
rankings = pd.read_csv('../data/atp_matches/atp_rankings_current.csv')
rankings = rankings.sort_values(by='rank')

print(rankings.head())
rankings.columns

       ranking_date  rank  player  points
0          20240101     1  104925   11245
62449      20240923     1  206173   11180
64575      20240930     1  206173   11010
12340      20240219     1  104925    9855
66704      20241014     1  206173   11920


Index(['ranking_date', 'rank', 'player', 'points'], dtype='object')

In [27]:
# Combine rankings with players
rankings = rankings.rename(columns={'player': 'player_id'})

ranked_players = pd.merge(rankings, players, on='player_id', how='left')
ranked_players = ranked_players.sort_values(by = ['ranking_date','rank'], ascending = [False, True])
print(ranked_players.head())

     ranking_date  rank  player_id  points name_first name_last hand  \
42       20241230     1     206173   11830     Jannik    Sinner    R   
49       20241230     2     100644    7915  Alexander    Zverev    R   
105      20241230     3     207989    7010     Carlos   Alcaraz    R   
134      20241230     4     126203    5100     Taylor     Fritz    R   
209      20241230     5     106421    5030     Daniil  Medvedev    R   

            dob  ioc  height wikidata_id  
42   20010816.0  ITA   191.0   Q54812588  
49   19970420.0  GER   198.0   Q13990552  
105  20030505.0  ESP   183.0   Q85518537  
134  19971028.0  USA   196.0   Q17660516  
209  19960211.0  RUS   198.0   Q21622022  


In [34]:
# get latest atp rankings
latest_rankings = ranked_players.drop_duplicates(subset=['rank'], keep='first')
print(latest_rankings)
latest_rankings.columns

       ranking_date  rank  player_id  points name_first    name_last hand  \
42         20241230     1     206173   11830     Jannik       Sinner    R   
49         20241230     2     100644    7915  Alexander       Zverev    R   
105        20241230     3     207989    7010     Carlos      Alcaraz    R   
134        20241230     4     126203    5100     Taylor        Fritz    R   
209        20241230     5     106421    5030     Daniil     Medvedev    R   
...             ...   ...        ...     ...        ...          ...  ...   
85419      20240108  1920     205831       1      Ammar    Alhogbani    U   
70075      20240101  1586     212263       2   Xing Dao         Chen    R   
82949      20240101  1853     211341       1      Tommy  Czaplinski     U   
85521      20240101  1923     208036       1      Maxim     Krapivin    U   
89758      20240101  2023     211722       1     Manish       Ganesh    U   

              dob  ioc  height wikidata_id  
42     20010816.0  ITA   191.0

Index(['ranking_date', 'rank', 'player_id', 'points', 'name_first',
       'name_last', 'hand', 'dob', 'ioc', 'height', 'wikidata_id'],
      dtype='object')

In [37]:
# getting to know point by point data
# single matches not point by point
match_data = pd.read_csv('../data/slam_pointbypoint/2024-usopen-matches.csv')
print(match_data.head())
print(match_data.columns)

           match_id  year    slam  match_num                player1  \
0  2024-usopen-1101  2024  usopen       1101          Jannik Sinner   
1  2024-usopen-1102  2024  usopen       1102        Eliot Spizzirri   
2  2024-usopen-1103  2024  usopen       1103        Mattia Bellucci   
3  2024-usopen-1104  2024  usopen       1104  Christopher O'Connell   
4  2024-usopen-1105  2024  usopen       1105            Arthur Fils   

              player2  status  winner  event_name  round  court_name  \
0  Mackenzie McDonald     NaN     NaN         NaN    NaN         NaN   
1      Alex Michelsen     NaN     NaN         NaN    NaN         NaN   
2       Stan Wawrinka     NaN     NaN         NaN    NaN         NaN   
3       Nicolas Jarry     NaN     NaN         NaN    NaN         NaN   
4        Learner Tien     NaN     NaN         NaN    NaN         NaN   

   court_id  player1id  player2id  nation1  nation2  
0       NaN        NaN        NaN      NaN      NaN  
1       NaN        NaN        Na

In [33]:
# points data
points_data = pd.read_csv('../data/slam_pointbypoint/2024-usopen-points.csv')
print(points_data.head())
print(points_data.columns)

           match_id ElapsedTime  SetNo  P1GamesWon  P2GamesWon  SetWinner  \
0  2024-usopen-1101     0:00:00      1           0           0          0   
1  2024-usopen-1101     0:00:00      1           0           0          0   
2  2024-usopen-1101     0:00:00      1           0           0          0   
3  2024-usopen-1101     0:00:20      1           0           0          0   
4  2024-usopen-1101     0:01:06      1           0           0          0   

   GameNo  GameWinner PointNumber  PointWinner  ...  P2TurningPoint  \
0       1           0          0X            0  ...             NaN   
1       1           0          0Y            0  ...             NaN   
2       1           0           1            1  ...             NaN   
3       1           0           2            2  ...             NaN   
4       1           0           3            2  ...             NaN   

   ServeNumber  WinnerType WinnerShotType P1DistanceRun  P2DistanceRun  \
0            0           0          

In [36]:
# add match_id to atp_matches
match['match_id'] = match['tourney_id'] + '-' + match['match_num'].astype('str')
match

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points,match_id
0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,,...,34.0,20.0,14.0,10.0,15.0,9.0,3590.0,16.0,1977.0,2019-M020-300
1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,,...,36.0,7.0,10.0,10.0,13.0,16.0,1977.0,239.0,200.0,2019-M020-299
2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,,...,15.0,6.0,8.0,1.0,5.0,9.0,3590.0,40.0,1050.0,2019-M020-298
3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,PR,...,38.0,9.0,11.0,4.0,6.0,239.0,200.0,31.0,1298.0,2019-M020-297
4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,,...,46.0,19.0,15.0,2.0,4.0,16.0,1977.0,18.0,1855.0,2019-M020-296
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2801,2019-9210,Laver Cup,Hard,8,A,20190920,105,104545,,,...,,,,,,20.0,1805.0,6.0,4095.0,2019-9210-105
2802,2019-9210,Laver Cup,Hard,8,A,20190920,106,126774,,,...,,,,,,7.0,3420.0,30.0,1385.0,2019-9210-106
2803,2019-9210,Laver Cup,Hard,8,A,20190920,107,104745,,,...,,,,,,2.0,9225.0,24.0,1450.0,2019-9210-107
2804,2019-9210,Laver Cup,Hard,8,A,20190920,108,106233,,,...,,,,,,5.0,4575.0,33.0,1310.0,2019-9210-108


In [39]:
missing_ratio = players["dob"].isna().mean()
print(f"{missing_ratio:.2%} of dob values are missing.")

27.89% of dob values are missing.


We need to fill missing data for dob.
What we can do is calculate the dob by assuming a debut age of 18 (average debut age of tennis players) and using the date of their first professional match.

This is also not optimal since we do not know if we really have the debut game of a player in the data set or if his age was actually 18 when he played professionally the first time. However, it gives us a reasonable estimate.