# Machine Learning On Tennis Data
## Contents
>* [Setup](#0)
>* [Adding Features](#1)
>* [Pre-Processing](#2)
>* [Getting Ready For ML](#3)

## 0 - Setup <a class="anchor" id="0"></a>

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.model_selection import train_test_split

In [2]:
mens_df = pd.read_csv('../data/mens.csv',header=0,parse_dates=["Date"])
womens_df = pd.read_csv('../data/womens.csv',header=0,parse_dates=["Date"])

# Remove walkovers
mens_df = mens_df[mens_df['Comment']!='Walkover']
womens_df = womens_df[womens_df['Comment']!='Walkover']

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
cols = ['WRank', 'LRank']
mens_df[cols] = mens_df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

In [4]:
mens_df = mens_df[~mens_df['WRank'].isnull() & ~mens_df['LRank'].isnull()]

## 1 - Adding Features <a class="anchor" id="1"></a>

When we train this model, the data will be shuffled so we'll lose any notion of time. We can add a few easily computable features to capture some of this information. For the moment we can add
* Streak - length of current winning streak
* Prefered suface - boolen true if current surface has most wins (players play ~ 80 games/year so maybe look at last 30 games)
* Prefered court - same as prefered surface for court
* Historical winner - If the players have played before, who won?

### Data Structures to do this
We'll need some additional data structures to do this
* A dict for tracking player wins
> player -> dict = {'player_name':boolean,...}
* A dict for tracking player preferences over time
Hardcourt/carpet weighting/Greenset, Clay and Grass
> player -> dict = {'winning streak':int,'losing streak':int,'court wins':int[],'surface wins':string[],'surface losses':string[]}

In [5]:
complete_player_list = set(list(mens_df['Winner'].values)+list(mens_df['Loser'].values)) # a list of all unique players

In [6]:
p_observables = {}
historical_wins = {}

for p in complete_player_list:
    p_observables[p] = {'form':[],'court_wins':[],'surface_wins':[],'surface_losses':[]}
    historical_wins[p] = defaultdict(lambda: 0.5,{}) # If have no data assume players evenly matched (hence the 0.5 default)

In [7]:
n_wins_in_feature_lists = 20 #number of previous wins/losses to consider
n_games_for_form_calc = 15 #number of games to consider for recent form

In [8]:
def match_to_new_features(x):
    data_to_return = []
    
    for i,p in enumerate([x.Winner,x.Loser]):
        ## Historical results
        hist_form = np.mean(historical_wins[p][[x.Winner,x.Loser][i-1]])
        
        ## Recent form
        form = np.mean(p_observables[p]['form']) if (len(p_observables[p]['form'])>0) else 0.5 # Assume 0.5 if have no data
        if (len(p_observables[p]['form'])>=n_games_for_form_calc): p_observables[p]['form'].pop()
        
        ## Court form
        desired_court = 'Outdoor' if sum(p_observables[p]['court_wins']) > 0 else 'Indoor' # The court type with most recent wins
        prefered_court = 1 if (x.Court == desired_court) else 0
        
        ## Surface form
        try:
            desired_surface = max(set(p_observables[p]['surface_wins']), key=p_observables[p]['surface_wins'].count)
        except:
            desired_surface = 'None'
        try:
            undesired_surface = max(set(p_observables[p]['surface_losses']), key=p_observables[p]['surface_losses'].count)
        except:
            undesired_surface = 'None'
            
        preferred_surface = 1 if x.Surface == desired_surface else (-1 if x.Surface == undesired_surface else 0)
        
        ## Updating the historical info for use next time
        if i == 0: # On 1st loop so p is the winner
            historical_wins[p][x.Loser] = [1] if historical_wins[p][x.Loser] == 0.5 else historical_wins[p][x.Loser]+[1] # Add a 1 to the list, if the list exsists else make a new list
            
            p_observables[p]['form'].insert(0,1)
            
            if (len(p_observables[p]['surface_wins']) > n_wins_in_feature_lists): p_observables[p]['surface_wins'].pop() # remove last item from list
            p_observables[p]['surface_wins'].insert(0,x.Surface)
            
            if (len(p_observables[p]['court_wins']) > n_wins_in_feature_lists): p_observables[p]['court_wins'].pop() # if too long remove last one
            p_observables[p]['court_wins'].insert(0,1 if x.Court=='Outdoor' else -1)
        else:
            historical_wins[p][x.Winner] = [0] if historical_wins[p][x.Winner] == 0.5 else historical_wins[p][x.Winner]+[0] # Add a 0 entry if list already exsists, else make a new list
            
            p_observables[p]['form'].insert(0,-1)
            
            if (len(p_observables[p]['surface_losses']) > n_wins_in_feature_lists): p_observables[p]['surface_losses'].pop() # remove last item from list
            p_observables[p]['surface_losses'].insert(0,x.Surface)

        data_to_return += [hist_form,form,prefered_court,preferred_surface]
        
    return data_to_return

In [9]:
%%time
mens_df[['w_hist_form','w_form','w_court_form','w_surface_form','l_hist_form','l_form','l_court_form','l_surface_form']] = mens_df.apply(match_to_new_features,axis=1, result_type="expand")

Wall time: 44.7 s


### Adding alpha data

In [10]:
alpha_df = pd.read_pickle(r"K:\Code\Tennis_Betting\2. Model exploration\alphas\mens_alphas.pickle")
alpha_df.dropna(inplace=True)
alpha_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,p_dict,x
start,end,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-02,2017-04-01,"{'Choinski J.': 0, 'Broady L.': 1, 'Delbonis F...","[0.7239695125774777, 1.069924249476819, 1.0938..."
2002-05-27,2002-05-27,"{'Pretzsch A.': 0, 'Larsson M.': 1, 'Pavel A.'...","[0.800685292740611, 0.964876756885678, 1.52120..."
2012-05-22,2012-05-26,"{'Del Bonis F.': 0, 'Blake J.': 1, 'Delbonis F...","[1.6033600263722116, 0.9296558179990959, 1.334..."
2000-07-10,2000-07-10,"{'Pretzsch A.': 0, 'Larsson M.': 1, 'Pavel A.'...","[1.0469830051536664, 1.7652306826038533, 1.447..."
2013-09-16,2013-09-22,"{'Del Bonis F.': 0, 'Broady L.': 1, 'Blake J.'...","[1.3234143857328917, 0.7012684908314621, 1.191..."


In [11]:
dates_list = []
for i,df in mens_df.groupby([mens_df.Date.dt.year,'ATP']):
    dates_list.append((min(df['Date']),max(df['Date'])))
dates_list=list(set(dates_list))

In [12]:
for i,x in alpha_df.iterrows():
    start=i[0];end=i[1];p_dict=x.p_dict; alpha = x.x; alpha[-1]=1.1 # avg alpha value
    df_slice = mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),:]
    mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),'w_alpha'] = alpha[df_slice['Winner'].map(p_dict).fillna(-1).astype('int')]
    mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),'l_alpha'] = alpha[df_slice['Loser'].map(p_dict).fillna(-1).astype('int')]

In [13]:
mens_df = mens_df[mens_df['w_alpha'].notna() & mens_df['l_alpha'].notna()]

In [14]:
mens_df['ranking_p'] = -2*(mens_df[['WRank','LRank']].max(axis=1) / (mens_df['WRank'] + mens_df['LRank'])-0.5)
mens_df['alpha_p'] = mens_df[['w_alpha','l_alpha']].max(axis=1) / (mens_df['w_alpha'] + mens_df['l_alpha'])

## 2 - Pre-Processing <a class="anchor" id=2></a>

First remove data that we think isn't important
> It's really important we remove the results for the match <br>
> For the moment remove betting odds, could convert to impled prob and use later

In [15]:
cols_to_drop = ['ATP','Series','Location', 'Date','Best of','W1', 'L1','W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets','Comment', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW', 'IWL', 'SBW', 'SBL','B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL', 'PSW', 'PSL', 'WPts','LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL', 'MaxW', 'MaxL','AvgW', 'AvgL']

filtered_mens_df = mens_df.drop(cols_to_drop,axis=1)

In [16]:
## Replace tournament names which appear infrequently with other
tournament_names_and_counts = filtered_mens_df['Tournament'].value_counts(normalize=True)
infrequent_tournaments = list(tournament_names_and_counts[tournament_names_and_counts<0.01].index)

filtered_mens_df.loc[filtered_mens_df['Tournament'].isin(infrequent_tournaments),'Tournament'] = 'Other'

In [17]:
## Bundle surfaces into three categores
surface_dict = {'Clay':'Clay','Grass':'Grass','Hard':'Hard','Carpet':'Hard','Greenset':'Hard'}
mens_df.replace({"Surface": surface_dict},inplace=True)
mens_df.Surface.unique()

array(['Hard', 'Clay', 'Grass'], dtype=object)

Now we need to swap the winner & loser about a bit so that the ML model doesn't always just pick the player in col 5

In [18]:
winner_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'w']
loser_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'l']
other_cols = [c for c in filtered_mens_df.columns if not (c in winner_cols or c in loser_cols)]

In [19]:
def random_transpose_of_data(x):
    if np.random.choice([True,False]):
        temp = x[winner_cols]
        x[winner_cols] = x[loser_cols]
        x[loser_cols] = temp
        x['y'] = 0
    else:
        x['y'] = 1
    return x

In [20]:
col_rename_dict = {'Winner':'A','Loser':'B','WRank':'A_rank','LRank':'B_rank'}
for col in winner_cols:
    if col not in  ['Winner','WRank']:
        col_rename_dict[col]="A"+col[1:]
for col in loser_cols:
    if col not in  ['Loser','LRank']:
        col_rename_dict[col]="B"+col[1:]

In [21]:
formatted_mens_df = filtered_mens_df.apply(random_transpose_of_data,axis=1).rename(columns=col_rename_dict)

## 3 - Getting Ready For ML <a class="anchor" id=3></a>

Adding a few more quick features

We now have the mens_df in the following format

In [22]:
# quickly add more features, easier to do this after relabelling A/B
formatted_mens_df['d_alpha'] = formatted_mens_df['A_alpha']-formatted_mens_df['B_alpha']
formatted_mens_df['d_rank'] = formatted_mens_df['A_rank']-formatted_mens_df['B_rank']
formatted_mens_df.drop(['B_hist_form'],axis=1,inplace=True) # Drop B_hist_form as it's 1-A_hist_form
formatted_mens_df

Unnamed: 0,Tournament,Court,Surface,Round,A,B,A_rank,B_rank,A_hist_form,A_form,...,B_form,B_court_form,B_surface_form,A_alpha,B_alpha,ranking_p,alpha_p,y,d_alpha,d_rank
155,Australian Open,Outdoor,Hard,1st Round,Puerta M.,Agassi A.,112.0,1.0,0.500000,-1.000000,...,0.500000,0.0,0.0,2.011340,1.100000,-0.982301,0.646455,0,0.911340,111.0
156,Australian Open,Outdoor,Hard,1st Round,Manta L.,Alami K.,107.0,35.0,0.500000,0.333333,...,0.500000,0.0,0.0,1.504490,1.100000,-0.507042,0.577652,0,0.404490,72.0
157,Australian Open,Outdoor,Hard,1st Round,Alonso J.,Arazi H.,111.0,41.0,0.500000,0.500000,...,-0.333333,1.0,1.0,1.100000,1.133794,-0.460526,0.507564,0,-0.033794,70.0
158,Australian Open,Outdoor,Hard,1st Round,Meligeni F.,Behrend T.,28.0,106.0,0.500000,-1.000000,...,0.000000,1.0,1.0,2.179620,0.919816,-0.582090,0.703231,0,1.259804,-78.0
159,Australian Open,Outdoor,Hard,1st Round,Bjorkman J.,Stoltenberg J.,76.0,81.0,0.500000,-1.000000,...,0.500000,1.0,1.0,1.142161,1.738549,-0.031847,0.603514,1,-0.596389,-5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53169,French Open,Outdoor,Clay,Quarterfinals,Tsitsipas S.,Rublev A.,6.0,12.0,0.000000,0.466667,...,0.733333,1.0,0.0,1.475698,1.561433,-0.333333,0.514114,1,-0.085735,-6.0
53170,French Open,Outdoor,Clay,Quarterfinals,Djokovic N.,Carreno Busta P.,1.0,18.0,0.666667,0.866667,...,0.333333,1.0,0.0,1.651171,0.983119,-0.894737,0.626799,1,0.668051,-17.0
53171,French Open,Outdoor,Clay,Semifinals,Schwartzman D.,Nadal R.,14.0,2.0,0.111111,0.466667,...,0.733333,1.0,0.0,1.446942,2.011917,-0.750000,0.581671,0,-0.564975,12.0
53172,French Open,Outdoor,Clay,Semifinals,Djokovic N.,Tsitsipas S.,1.0,6.0,0.600000,0.866667,...,0.466667,1.0,0.0,1.651171,1.475698,-0.714286,0.528059,1,0.175473,-5.0


In [23]:
categorical_cols = ['Tournament', 'Court', 'Surface', 'Round', 'A', 'B']

In [24]:
dummy_mens_df = pd.get_dummies(formatted_mens_df,columns=categorical_cols)

In [25]:
corr = dummy_mens_df.iloc[:, dummy_mens_df.columns != 'y'].corrwith(dummy_mens_df['y'],axis=0).sort_values(ascending=False)

In [26]:
categorical_cols += ['y','B_hist_form']
corr_cols = [x for x in formatted_mens_df.columns if x not in categorical_cols]
corr.loc[corr_cols].sort_values()

d_rank           -0.250506
B_alpha          -0.216070
B_form           -0.175283
A_rank           -0.169436
B_court_form     -0.015966
B_surface_form   -0.007665
ranking_p        -0.004666
A_surface_form    0.000628
alpha_p           0.002153
A_court_form      0.008652
A_hist_form       0.151949
B_rank            0.174229
A_form            0.183041
A_alpha           0.215040
d_alpha           0.328939
dtype: float64

In [27]:
X = dummy_mens_df.iloc[:, dummy_mens_df.columns != 'y']
y = dummy_mens_df['y']

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [29]:
import pickle

# Saving the objects:
with open('train_test_split.pkl', 'wb') as f:
    pickle.dump([X_train, X_test, y_train, y_test], f)