# Machine Learning On Tennis Data
## Contents
>* [Setup](#0)
>* [Adding Features](#1)
>* [Pre-Processing](#2)
>* [Feature Selection](#3)
>* [Exporting Data](#4)

## 0 - Setup <a class="anchor" id="0"></a>

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.model_selection import train_test_split

In [2]:
mens_df = pd.read_csv('../data/mens.csv',header=0,parse_dates=["Date"])
womens_df = pd.read_csv('../data/womens.csv',header=0,parse_dates=["Date"])

# Remove walkovers
mens_df = mens_df[mens_df['Comment']!='Walkover']
womens_df = womens_df[womens_df['Comment']!='Walkover']

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
cols = ['WRank', 'LRank']
mens_df[cols] = mens_df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

In [4]:
mens_df.shape

(53203, 54)

In [5]:
mens_df = mens_df[~mens_df['WRank'].isnull() & ~mens_df['LRank'].isnull()]

In [6]:
mens_df.shape

(53069, 54)

## 1 - Adding Features <a class="anchor" id="1"></a>

When we train this model, the data will be shuffled so we'll lose any notion of time. We can add a few easily computable features to capture some of this information. For the moment we can add
* Streak - length of current winning streak
* Prefered suface - boolen true if current surface has most wins (players play ~ 80 games/year so maybe look at last 30 games)
* Prefered court - same as prefered surface for court
* Historical winner - If the players have played before, who won?

### Data Structures to do this
We'll need some additional data structures to do this
* A dict for tracking player wins
> player -> dict = {'player_name':boolean,...}
* A dict for tracking player preferences over time
Hardcourt/carpet weighting/Greenset, Clay and Grass
> player -> dict = {'winning streak':int,'losing streak':int,'court wins':int[],'surface wins':string[],'surface losses':string[]}

In [7]:
complete_player_list = set(list(mens_df['Winner'].values)+list(mens_df['Loser'].values)) # a list of all unique players

In [8]:
p_observables = {}
historical_wins = {}

for p in complete_player_list:
    p_observables[p] = {'form':[],'court_wins':[],'surface_wins':[],'surface_losses':[]}
    historical_wins[p] = defaultdict(lambda: 0.5,{}) # If have no data assume players evenly matched (hence the 0.5 default)

In [9]:
n_wins_in_feature_lists = 20 #number of previous wins/losses to consider
n_games_for_form_calc = 15 #number of games to consider for recent form

In [10]:
def match_to_new_features(x):
    data_to_return = []
    
    for i,p in enumerate([x.Winner,x.Loser]):
        ## Historical results
        hist_form = np.mean(historical_wins[p][[x.Winner,x.Loser][i-1]])
        
        ## Recent form
        form = np.mean(p_observables[p]['form']) if (len(p_observables[p]['form'])>0) else 0.5 # Assume 0.5 if have no data
        if (len(p_observables[p]['form'])>=n_games_for_form_calc): p_observables[p]['form'].pop()
        
        ## Court form
        desired_court = 'Outdoor' if sum(p_observables[p]['court_wins']) > 0 else 'Indoor' # The court type with most recent wins
        prefered_court = 1 if (x.Court == desired_court) else 0
        
        ## Surface form
        try:
            desired_surface = max(set(p_observables[p]['surface_wins']), key=p_observables[p]['surface_wins'].count)
        except:
            desired_surface = 'None'
        try:
            undesired_surface = max(set(p_observables[p]['surface_losses']), key=p_observables[p]['surface_losses'].count)
        except:
            undesired_surface = 'None'
            
        preferred_surface = 1 if x.Surface == desired_surface else (-1 if x.Surface == undesired_surface else 0)
        
        ## Updating the historical info for use next time
        if i == 0: # On 1st loop so p is the winner
            historical_wins[p][x.Loser] = [1] if historical_wins[p][x.Loser] == 0.5 else historical_wins[p][x.Loser]+[1] # Add a 1 to the list, if the list exsists else make a new list
            
            p_observables[p]['form'].insert(0,1)
            
            if (len(p_observables[p]['surface_wins']) > n_wins_in_feature_lists): p_observables[p]['surface_wins'].pop() # remove last item from list
            p_observables[p]['surface_wins'].insert(0,x.Surface)
            
            if (len(p_observables[p]['court_wins']) > n_wins_in_feature_lists): p_observables[p]['court_wins'].pop() # if too long remove last one
            p_observables[p]['court_wins'].insert(0,1 if x.Court=='Outdoor' else -1)
        else:
            historical_wins[p][x.Winner] = [0] if historical_wins[p][x.Winner] == 0.5 else historical_wins[p][x.Winner]+[0] # Add a 0 entry if list already exsists, else make a new list
            
            p_observables[p]['form'].insert(0,-1)
            
            if (len(p_observables[p]['surface_losses']) > n_wins_in_feature_lists): p_observables[p]['surface_losses'].pop() # remove last item from list
            p_observables[p]['surface_losses'].insert(0,x.Surface)

        data_to_return += [hist_form,form,prefered_court,preferred_surface]
        
    return data_to_return

In [11]:
%%time
mens_df[['w_hist_form','w_form','w_court_form','w_surface_form','l_hist_form','l_form','l_court_form','l_surface_form']] = mens_df.apply(match_to_new_features,axis=1, result_type="expand")

Wall time: 29.7 s


### Adding alpha data

In [12]:
alpha_df = pd.read_pickle(r"K:\Code\Tennis_Betting\2. Model exploration\alphas\mens_alphas.pickle")
alpha_df.dropna(inplace=True)
alpha_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,p_dict,x
start,end,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-02,2017-04-01,"{'Choinski J.': 0, 'Broady L.': 1, 'Delbonis F...","[0.7239695125774777, 1.069924249476819, 1.0938..."
2002-05-27,2002-05-27,"{'Pretzsch A.': 0, 'Larsson M.': 1, 'Pavel A.'...","[0.800685292740611, 0.964876756885678, 1.52120..."
2012-05-22,2012-05-26,"{'Del Bonis F.': 0, 'Blake J.': 1, 'Delbonis F...","[1.6033600263722116, 0.9296558179990959, 1.334..."
2000-07-10,2000-07-10,"{'Pretzsch A.': 0, 'Larsson M.': 1, 'Pavel A.'...","[1.0469830051536664, 1.7652306826038533, 1.447..."
2013-09-16,2013-09-22,"{'Del Bonis F.': 0, 'Broady L.': 1, 'Blake J.'...","[1.3234143857328917, 0.7012684908314621, 1.191..."


In [13]:
dates_list = []
for i,df in mens_df.groupby([mens_df.Date.dt.year,'ATP']):
    dates_list.append((min(df['Date']),max(df['Date'])))
dates_list=list(set(dates_list))

In [14]:
for i,x in alpha_df.iterrows():
    start=i[0];end=i[1];p_dict=x.p_dict; alpha = x.x; alpha[-1]=1.1 # avg alpha value
    df_slice = mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),:]
    mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),'w_alpha'] = alpha[df_slice['Winner'].map(p_dict).fillna(-1).astype('int')]
    mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),'l_alpha'] = alpha[df_slice['Loser'].map(p_dict).fillna(-1).astype('int')]

In [15]:
mens_df = mens_df[mens_df['w_alpha'].notna() & mens_df['l_alpha'].notna()]

In [16]:
mens_df['ranking_p'] = -2*(mens_df[['WRank','LRank']].max(axis=1) / (mens_df['WRank'] + mens_df['LRank'])-0.5)
mens_df['alpha_p'] = mens_df[['w_alpha','l_alpha']].max(axis=1) / (mens_df['w_alpha'] + mens_df['l_alpha'])

In [17]:
mens_df.shape

(50587, 66)

## 2 - Pre-Processing <a class="anchor" id=2></a>

First remove data that we think isn't important
> It's really important we remove the results for the match <br>

In [18]:
cols_to_drop = ['ATP','Series','Location','Best of','W1', 'L1','W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets','Comment','CBW','GBW','IWW','SBW','B365W','B&WW','EXW','PSW','UBW','LBW','SJW','CBL','GBL','IWL','SBL','B365L','B&WL','EXL','PSL','UBL','LBL','SJL']

filtered_mens_df = mens_df.drop(cols_to_drop,axis=1)

In [19]:
## Replace tournament names which appear infrequently with other
tournament_names_and_counts = filtered_mens_df['Tournament'].value_counts(normalize=True)
infrequent_tournaments = list(tournament_names_and_counts[tournament_names_and_counts<0.01].index)

filtered_mens_df.loc[filtered_mens_df['Tournament'].isin(infrequent_tournaments),'Tournament'] = 'Other'

In [20]:
## Bundle surfaces into three categores
surface_dict = {'Clay':'Clay','Grass':'Grass','Hard':'Hard','Carpet':'Hard','Greenset':'Hard'}
mens_df.replace({"Surface": surface_dict},inplace=True)
mens_df.Surface.unique()

array(['Hard', 'Clay', 'Grass'], dtype=object)

Now we need to swap the winner & loser about a bit so that the ML model doesn't always just pick the player in col 5

In [21]:
winner_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'w'] + ['MaxW','AvgW']
loser_cols = [c for c in filtered_mens_df.columns if (c[0].lower() == 'l' and c[0:2]!='LB')]  + ['MaxL','AvgL']
other_cols = [c for c in filtered_mens_df.columns if not (c in winner_cols or c in loser_cols)]

In [22]:
def random_transpose_of_data(x):
    if np.random.choice([True,False]):
        temp = x[winner_cols]
        x[winner_cols] = x[loser_cols]
        x[loser_cols] = temp
        x['y'] = 0
    else:
        x['y'] = 1
    return x

In [23]:
col_rename_dict = {'Winner':'A','Loser':'B','WRank':'A_rank','LRank':'B_rank'}
for col in winner_cols:
    if col not in  ['Winner','WRank'] + ['MaxW','AvgW']:
        col_rename_dict[col]="A"+col[1:]
    elif col not in ['Winner','WRank']:
        col_rename_dict[col] = "A_" + col[:-1] 
for col in loser_cols:
    if col not in  ['Loser','LRank'] + ['MaxL','AvgL']:
        col_rename_dict[col]="B"+col[1:]
    elif col not in ['Loser','LRank']:
        col_rename_dict[col] = "B_" + col[:-1] 
for col in other_cols:
    col_rename_dict[col] = col

In [24]:
%%time
formatted_mens_df = filtered_mens_df.apply(random_transpose_of_data,axis=1).rename(columns=col_rename_dict)

Wall time: 1min 30s


## 3 - Feature Selection <a class="anchor" id=3></a>

Adding a few more quick features

In [25]:
# quickly add more features, easier to do this after relabelling A/B
formatted_mens_df['d_alpha'] = formatted_mens_df['A_alpha']-formatted_mens_df['B_alpha']
formatted_mens_df['d_rank'] = formatted_mens_df['A_rank']-formatted_mens_df['B_rank']
formatted_mens_df.drop(['B_hist_form'],axis=1,inplace=True) # Drop B_hist_form as it's 1-A_hist_form
formatted_mens_df

Unnamed: 0,Tournament,Date,Court,Surface,Round,A,B,A_rank,B_rank,APts,...,B_form,B_court_form,B_surface_form,A_alpha,B_alpha,ranking_p,alpha_p,y,d_alpha,d_rank
93,Other,2000-01-10,Outdoor,Hard,1st Round,Balcells J.,Squillari F.,211.0,49.0,,...,0.000000,1.0,1.0,1.286733,0.993303,-0.623077,0.564348,1,0.293430,162.0
94,Other,2000-01-10,Outdoor,Hard,1st Round,Hantschk M.,Behrend T.,95.0,115.0,,...,0.000000,1.0,1.0,1.192691,0.502757,-0.095238,0.703467,0,0.689934,-20.0
95,Other,2000-01-10,Outdoor,Hard,1st Round,Black B.,Chang M.,70.0,50.0,,...,0.500000,0.0,0.0,0.100000,1.100000,-0.166667,0.916667,0,-1.000000,20.0
96,Other,2000-01-10,Outdoor,Hard,1st Round,Ferrero J.C.,Federer R.,45.0,61.0,,...,0.000000,1.0,1.0,1.100000,1.857534,-0.150943,0.628068,1,-0.757534,-16.0
97,Other,2000-01-10,Outdoor,Hard,1st Round,Fromberg R.,Gambill J.M.,80.0,57.0,,...,0.000000,1.0,1.0,2.163354,2.084119,-0.167883,0.509327,0,0.079235,23.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53463,Other,2020-11-20,Indoor,Hard,Round Robin,Zverev A.,Djokovic N.,7.0,1.0,5525.0,...,0.600000,0.0,-1.0,1.605476,1.637105,-0.750000,0.504877,0,-0.031629,6.0
53464,Other,2020-11-20,Indoor,Hard,Round Robin,Medvedev D.,Schwartzman D.,4.0,9.0,6970.0,...,0.333333,0.0,-1.0,1.506192,1.321431,-0.384615,0.532671,1,0.184762,-5.0
53465,Other,2020-11-21,Indoor,Hard,Semifinals,Djokovic N.,Thiem D.,1.0,3.0,11830.0,...,0.600000,0.0,1.0,1.637105,1.530198,-0.500000,0.516877,0,0.106907,-2.0
53466,Other,2020-11-21,Indoor,Hard,Semifinals,Medvedev D.,Nadal R.,4.0,2.0,6970.0,...,0.600000,0.0,1.0,1.506192,1.662794,-0.333333,0.524708,1,-0.156602,2.0


In [26]:
formatted_mens_df.to_csv('../data/formatted_mens.csv',index=False)

In [27]:
cols_to_ignore = ['Date','Tournament', 'Court', 'Surface', 'Round', 'A', 'B','y']

X = formatted_mens_df.loc[:, [ x for x in formatted_mens_df.columns if x not in cols_to_ignore]]
y = formatted_mens_df['y']

In [28]:
corr = X.corrwith(y,axis=0).sort_values(ascending=False)

In [29]:
corr

B_Avg             0.336002
d_alpha           0.317410
APts              0.209286
A_alpha           0.206445
A_form            0.165922
B_rank            0.163554
A_hist_form       0.141114
A_court_form      0.018690
A_surface_form    0.012151
B_Max             0.010879
ranking_p         0.005108
alpha_p          -0.001236
B_surface_form   -0.005700
B_court_form     -0.006174
A_Max            -0.011053
A_rank           -0.169949
B_form           -0.172668
B_alpha          -0.210136
BPts             -0.210340
d_rank           -0.245055
A_Avg            -0.332216
dtype: float64

## 4 - Exporting Data <a class="anchor" id=4></a>

In [30]:
formatted_mens_df.sort_values(by='Date',inplace=True)
formatted_mens_df = formatted_mens_df.dropna()

In [31]:
formatted_mens_df

Unnamed: 0,Tournament,Date,Court,Surface,Round,A,B,A_rank,B_rank,APts,...,B_form,B_court_form,B_surface_form,A_alpha,B_alpha,ranking_p,alpha_p,y,d_alpha,d_rank
26824,Other,2010-04-19,Outdoor,Clay,1st Round,Riba P.,Greul S.,98.0,60.0,549.0,...,-0.466667,1.0,-1.0,0.878284,0.885783,-0.240506,0.502125,0,-0.007499,38.0
26823,Other,2010-04-19,Outdoor,Clay,1st Round,Kubot L.,Granollers M.,41.0,93.0,977.0,...,-0.066667,1.0,1.0,1.065031,1.057037,-0.388060,0.501884,0,0.007994,-52.0
26822,Other,2010-04-19,Outdoor,Clay,1st Round,Zeballos H.,Cuevas P.,50.0,54.0,883.0,...,-0.200000,1.0,1.0,1.130595,1.092356,-0.038462,0.508601,0,0.038239,-4.0
26821,Other,2010-04-19,Outdoor,Clay,1st Round,Almagro N.,Ventura S.,34.0,129.0,1255.0,...,-0.333333,1.0,1.0,0.951979,1.110223,-0.582822,0.538368,1,-0.158244,-95.0
26825,Other,2010-04-19,Outdoor,Clay,1st Round,Andreev I.,Gasquet R.,48.0,78.0,885.0,...,-0.066667,1.0,0.0,1.246820,1.261685,-0.238095,0.502963,0,-0.014865,-30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53463,Other,2020-11-20,Indoor,Hard,Round Robin,Zverev A.,Djokovic N.,7.0,1.0,5525.0,...,0.600000,0.0,-1.0,1.605476,1.637105,-0.750000,0.504877,0,-0.031629,6.0
53464,Other,2020-11-20,Indoor,Hard,Round Robin,Medvedev D.,Schwartzman D.,4.0,9.0,6970.0,...,0.333333,0.0,-1.0,1.506192,1.321431,-0.384615,0.532671,1,0.184762,-5.0
53465,Other,2020-11-21,Indoor,Hard,Semifinals,Djokovic N.,Thiem D.,1.0,3.0,11830.0,...,0.600000,0.0,1.0,1.637105,1.530198,-0.500000,0.516877,0,0.106907,-2.0
53466,Other,2020-11-21,Indoor,Hard,Semifinals,Medvedev D.,Nadal R.,4.0,2.0,6970.0,...,0.600000,0.0,1.0,1.506192,1.662794,-0.333333,0.524708,1,-0.156602,2.0


In [32]:
categorical_cols = ['Tournament', 'Court', 'Surface', 'Round', 'A', 'B']

In [33]:
dummy_mens_df = pd.get_dummies(formatted_mens_df,columns=categorical_cols,drop_first=True)

In [34]:
X = dummy_mens_df.iloc[:, dummy_mens_df.columns != 'y']
y = dummy_mens_df[['Date','y']]

In [35]:
split_time = pd.to_datetime('2018-01-01')

In [36]:
X_train = X.loc[X.Date<split_time,[col for col in X.columns if col != 'Date']]
X_test = X.loc[X.Date>=split_time,[col for col in X.columns if col != 'Date']]
y_train = y.loc[y.Date<split_time,'y']
y_test = y.loc[y.Date>=split_time,'y']            

In [37]:
for df in [X_train, X_test, y_train, y_test]:
    print(df.shape)

(19401, 1411)
(5922, 1411)
(19401,)
(5922,)


In [38]:
import pickle

# Saving the objects:
with open('train_test_split.pkl', 'wb') as f:
    pickle.dump([X_train, X_test, y_train, y_test], f)