In [35]:
import pandas as pd
import numpy as np

# Adding ELO Features
When cleaning the data we removed all data to do with rankings. This data would have supplied us information on general standings, consistency and being able to compare the levels of players who hadn't played each other often. I decided that an ELO system, commonly used in games and e-sports such as chess, would be a good alternative. It works as follows.

Each player is assigned an intial rating (1500 in our case). The players ratings are compared before the before the match, and are each assigned an expected score from the following formula: 
$$ E_A = \frac{1}{1 + 10^{\frac{R_B - R_A}{400}}} $$
Where $ E_A $ is the expected score of Player A, and $ R_A $ and $ R_B $ are the ratings of Players A and B respectively. The expected score of Player B is calculated by:
$$ E_B = 1 - E_A $$

The ratings of the players are then updated by the formula:
$$ R_{new} = R_{old} + k*({S - E}) $$
Where $ E $ is the expected score from above, $ R $ is the ratings, $ S $ is the score (1 for a win, 0 for a loss), and $ k $ is the k-factor (adjustment factor, 32 in our case).

I implemented this ELO system as a class object in my source code, which has been imported below.

In [36]:
import sys
if 'src.features.elo_system' in sys.modules:
    del sys.modules['src.features.elo_system']
from src.features.elo_system import EloSystem

We can run the 'process_match_df' class method on our clean dataframe to create a new dataframe which includes each players post-match ELO. I provides an overall rating, as well as a rating for each surface the players play on, as a player's playstyle and ability can determine how good they are on different surfaces.

In [37]:
clean_df = pd.read_csv("../data/processed/cleaned_ATP_results.csv")
elo = EloSystem()
elo_df = elo.process_match_df(clean_df)
elo_df.head()

Unnamed: 0,tourney_name,surface,draw_size,tourney_level,tourney_date,winner_name,winner_age,loser_name,loser_age,score,best_of,round,winner_elo,loser_elo,winner_elo_surface,loser_elo_surface
0,Adelaide,Hard,32,A,19980105,Jonas Bjorkman,25.7,Grant Stafford,26.6,6-4 6-2,3,R32,1516.0,1484.0,1516.0,1484.0
1,Adelaide,Hard,32,A,19980105,Jason Stoltenberg,27.7,Juan Antonio Marin,22.8,6-4 6-1,3,R32,1516.0,1484.0,1516.0,1484.0
2,Adelaide,Hard,32,A,19980105,Nicolas Escude,21.7,Alex Radulescu,23.0,6-0 7-5,3,R32,1516.0,1484.0,1516.0,1484.0
3,Adelaide,Hard,32,A,19980105,Thomas Johansson,22.7,Byron Black,28.2,7-5 6-3,3,R32,1516.0,1484.0,1516.0,1484.0
4,Adelaide,Hard,32,A,19980105,Magnus Norman,21.6,Christian Ruud,25.3,6-3 1-6 6-4,3,R32,1516.0,1484.0,1516.0,1484.0


From the head of this dataframe, it appears that ELO does not change that much, but we can look at the description of the dataframe to see that it does indeed vary, once more matches per player have been processed.

In [38]:
elo_df.describe()

Unnamed: 0,draw_size,tourney_date,winner_age,loser_age,best_of,winner_elo,loser_elo,winner_elo_surface,loser_elo_surface
count,81771.0,81771.0,81771.0,81771.0,81771.0,81771.0,81771.0,81771.0,81771.0
mean,55.019237,20104410.0,26.184722,26.291018,3.448619,1717.529905,1619.733684,1665.537321,1575.435915
std,39.999847,78132.2,3.93878,4.054371,0.834258,174.362228,139.016744,158.458639,121.513016
min,2.0,19980100.0,14.9,14.5,3.0,1352.57536,1323.786008,1350.781773,1269.6194
25%,32.0,20040210.0,23.3,23.3,3.0,1584.393353,1507.514903,1541.891327,1485.225812
50%,32.0,20100610.0,26.0,26.1,3.0,1693.667263,1597.545545,1631.612134,1542.314372
75%,64.0,20170510.0,28.8,29.0,3.0,1813.241605,1702.049906,1746.969401,1640.659226
max,128.0,20241220.0,44.6,44.0,5.0,2449.792328,2418.40481,2377.945159,2347.864373


We can now see that player's ELO varies from 1323 to 2449, with standard deviation from 139-174. We will now save this datafram to a csv file to use later for more feature engineering or later analysis.

In [40]:
elo_df.to_csv("../data/processed/elo_match_results.csv", index=False)

To make it easier to preform ELO analysis later on, I will convert the ratings into a time-series dataframe, with a column for each player. I will create a separate dataframe for each surface also. Note as players play multiple times under the same date (date is of tournament not match), each rating recorded will be the closing rating of that player for the given week.

# Career Aggregated Statistics
For each player in our data, our questions requires us to look at long term career success. To do this effectively, a dataframe with the following features per player would prove useful: 