# Machine Learning On Tennis Data
## Contents
>* [Setup](#0)
>* [Adding Features](#1)
>* [Pre-Processing](#2)
>* [Feature Selection](#3)
>* [Exporting Data](#4)

## 0 - Setup <a class="anchor" id="0"></a>

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.model_selection import train_test_split

In [2]:
mens_df = pd.read_csv('../data/mens.csv',header=0,parse_dates=["Date"])
womens_df = pd.read_csv('../data/womens.csv',header=0,parse_dates=["Date"])

# Remove walkovers
mens_df = mens_df[mens_df['Comment']!='Walkover']
womens_df = womens_df[womens_df['Comment']!='Walkover']

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
cols = ['WRank', 'LRank']
mens_df[cols] = mens_df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

In [4]:
mens_df.shape

(53203, 54)

In [5]:
mens_df = mens_df[~mens_df['WRank'].isnull() & ~mens_df['LRank'].isnull()]

In [6]:
mens_df.shape

(53069, 54)

## 1 - Adding Features <a class="anchor" id="1"></a>

When we train this model, the data will be shuffled so we'll lose any notion of time. We can add a few easily computable features to capture some of this information. For the moment we can add
* Streak - length of current winning streak
* Prefered suface - boolen true if current surface has most wins (players play ~ 80 games/year so maybe look at last 30 games)
* Prefered court - same as prefered surface for court
* Historical winner - If the players have played before, who won?

### Data Structures to do this
We'll need some additional data structures to do this
* A dict for tracking player wins
> player -> dict = {'player_name':boolean,...}
* A dict for tracking player preferences over time
Hardcourt/carpet weighting/Greenset, Clay and Grass
> player -> dict = {'winning streak':int,'losing streak':int,'court wins':int[],'surface wins':string[],'surface losses':string[]}

In [7]:
complete_player_list = set(list(mens_df['Winner'].values)+list(mens_df['Loser'].values)) # a list of all unique players

In [8]:
p_observables = {}
historical_wins = {}

for p in complete_player_list:
    p_observables[p] = {'form':[],'court_wins':[],'surface_wins':[],'surface_losses':[]}
    historical_wins[p] = defaultdict(lambda: 0.5,{}) # If have no data assume players evenly matched (hence the 0.5 default)

In [9]:
n_wins_in_feature_lists = 20 #number of previous wins/losses to consider
n_games_for_form_calc = 15 #number of games to consider for recent form

In [10]:
def match_to_new_features(x):
    data_to_return = []
    
    for i,p in enumerate([x.Winner,x.Loser]):
        ## Historical results
        hist_form = np.mean(historical_wins[p][[x.Winner,x.Loser][i-1]])
        
        ## Recent form
        form = np.mean(p_observables[p]['form']) if (len(p_observables[p]['form'])>0) else 0.5 # Assume 0.5 if have no data
        if (len(p_observables[p]['form'])>=n_games_for_form_calc): p_observables[p]['form'].pop()
        
        ## Court form
        desired_court = 'Outdoor' if sum(p_observables[p]['court_wins']) > 0 else 'Indoor' # The court type with most recent wins
        prefered_court = 1 if (x.Court == desired_court) else 0
        
        ## Surface form
        try:
            desired_surface = max(set(p_observables[p]['surface_wins']), key=p_observables[p]['surface_wins'].count)
        except:
            desired_surface = 'None'
        try:
            undesired_surface = max(set(p_observables[p]['surface_losses']), key=p_observables[p]['surface_losses'].count)
        except:
            undesired_surface = 'None'
            
        preferred_surface = 1 if x.Surface == desired_surface else (-1 if x.Surface == undesired_surface else 0)
        
        ## Updating the historical info for use next time
        if i == 0: # On 1st loop so p is the winner
            historical_wins[p][x.Loser] = [1] if historical_wins[p][x.Loser] == 0.5 else historical_wins[p][x.Loser]+[1] # Add a 1 to the list, if the list exsists else make a new list
            
            p_observables[p]['form'].insert(0,1)
            
            if (len(p_observables[p]['surface_wins']) > n_wins_in_feature_lists): p_observables[p]['surface_wins'].pop() # remove last item from list
            p_observables[p]['surface_wins'].insert(0,x.Surface)
            
            if (len(p_observables[p]['court_wins']) > n_wins_in_feature_lists): p_observables[p]['court_wins'].pop() # if too long remove last one
            p_observables[p]['court_wins'].insert(0,1 if x.Court=='Outdoor' else -1)
        else:
            historical_wins[p][x.Winner] = [0] if historical_wins[p][x.Winner] == 0.5 else historical_wins[p][x.Winner]+[0] # Add a 0 entry if list already exsists, else make a new list
            
            p_observables[p]['form'].insert(0,-1)
            
            if (len(p_observables[p]['surface_losses']) > n_wins_in_feature_lists): p_observables[p]['surface_losses'].pop() # remove last item from list
            p_observables[p]['surface_losses'].insert(0,x.Surface)

        data_to_return += [hist_form,form,prefered_court,preferred_surface]
        
    return data_to_return

In [11]:
%%time
mens_df[['w_hist_form','w_form','w_court_form','w_surface_form','l_hist_form','l_form','l_court_form','l_surface_form']] = mens_df.apply(match_to_new_features,axis=1, result_type="expand")

Wall time: 28.1 s


### Adding alpha data

In [12]:
alpha_df = pd.read_pickle(r"K:\Code\Tennis_Betting\2. Model exploration\alphas\mens_alphas.pickle")
alpha_df.dropna(inplace=True)
alpha_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,p_dict,x
start,end,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-01-02,2017-04-01,"{'Choinski J.': 0, 'Broady L.': 1, 'Delbonis F...","[0.7239695125774777, 1.069924249476819, 1.0938..."
2002-05-27,2002-05-27,"{'Pretzsch A.': 0, 'Larsson M.': 1, 'Pavel A.'...","[0.800685292740611, 0.964876756885678, 1.52120..."
2012-05-22,2012-05-26,"{'Del Bonis F.': 0, 'Blake J.': 1, 'Delbonis F...","[1.6033600263722116, 0.9296558179990959, 1.334..."
2000-07-10,2000-07-10,"{'Pretzsch A.': 0, 'Larsson M.': 1, 'Pavel A.'...","[1.0469830051536664, 1.7652306826038533, 1.447..."
2013-09-16,2013-09-22,"{'Del Bonis F.': 0, 'Broady L.': 1, 'Blake J.'...","[1.3234143857328917, 0.7012684908314621, 1.191..."


In [13]:
dates_list = []
for i,df in mens_df.groupby([mens_df.Date.dt.year,'ATP']):
    dates_list.append((min(df['Date']),max(df['Date'])))
dates_list=list(set(dates_list))

In [14]:
for i,x in alpha_df.iterrows():
    start=i[0];end=i[1];p_dict=x.p_dict; alpha = x.x; alpha[-1]=1.1 # avg alpha value
    df_slice = mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),:]
    mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),'w_alpha'] = alpha[df_slice['Winner'].map(p_dict).fillna(-1).astype('int')]
    mens_df.loc[(mens_df['Date']<=end) & (mens_df['Date']>=start),'l_alpha'] = alpha[df_slice['Loser'].map(p_dict).fillna(-1).astype('int')]

In [15]:
mens_df = mens_df[mens_df['w_alpha'].notna() & mens_df['l_alpha'].notna()]

In [16]:
mens_df['ranking_p'] = -2*(mens_df[['WRank','LRank']].max(axis=1) / (mens_df['WRank'] + mens_df['LRank'])-0.5)
mens_df['alpha_p'] = mens_df[['w_alpha','l_alpha']].max(axis=1) / (mens_df['w_alpha'] + mens_df['l_alpha'])

## 2 - Pre-Processing <a class="anchor" id=2></a>

First remove data that we think isn't important
> It's really important we remove the results for the match <br>
> For the moment remove betting odds, could convert to impled prob and use later

In [17]:
cols_to_drop = ['ATP','Series','Location', 'Date','Best of','W1', 'L1','W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets','Comment', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW', 'IWL', 'SBW', 'SBL','B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL', 'PSW', 'PSL', 'WPts','LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL', 'MaxW', 'MaxL','AvgW', 'AvgL']

filtered_mens_df = mens_df.drop(cols_to_drop,axis=1)

In [18]:
## Replace tournament names which appear infrequently with other
tournament_names_and_counts = filtered_mens_df['Tournament'].value_counts(normalize=True)
infrequent_tournaments = list(tournament_names_and_counts[tournament_names_and_counts<0.01].index)

filtered_mens_df.loc[filtered_mens_df['Tournament'].isin(infrequent_tournaments),'Tournament'] = 'Other'

In [19]:
## Bundle surfaces into three categores
surface_dict = {'Clay':'Clay','Grass':'Grass','Hard':'Hard','Carpet':'Hard','Greenset':'Hard'}
mens_df.replace({"Surface": surface_dict},inplace=True)
mens_df.Surface.unique()

array(['Hard', 'Clay', 'Grass'], dtype=object)

Now we need to swap the winner & loser about a bit so that the ML model doesn't always just pick the player in col 5

In [20]:
winner_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'w']
loser_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'l']
other_cols = [c for c in filtered_mens_df.columns if not (c in winner_cols or c in loser_cols)]

In [21]:
def random_transpose_of_data(x):
    if np.random.choice([True,False]):
        temp = x[winner_cols]
        x[winner_cols] = x[loser_cols]
        x[loser_cols] = temp
        x['y'] = 0
    else:
        x['y'] = 1
    return x

In [22]:
col_rename_dict = {'Winner':'A','Loser':'B','WRank':'A_rank','LRank':'B_rank'}
for col in winner_cols:
    if col not in  ['Winner','WRank']:
        col_rename_dict[col]="A"+col[1:]
for col in loser_cols:
    if col not in  ['Loser','LRank']:
        col_rename_dict[col]="B"+col[1:]

In [23]:
formatted_mens_df = filtered_mens_df.apply(random_transpose_of_data,axis=1).rename(columns=col_rename_dict)

## 3 - Feature Selection <a class="anchor" id=3></a>

Adding a few more quick features

In [24]:
# quickly add more features, easier to do this after relabelling A/B
formatted_mens_df['d_alpha'] = formatted_mens_df['A_alpha']-formatted_mens_df['B_alpha']
formatted_mens_df['d_rank'] = formatted_mens_df['A_rank']-formatted_mens_df['B_rank']
formatted_mens_df.drop(['B_hist_form'],axis=1,inplace=True) # Drop B_hist_form as it's 1-A_hist_form
formatted_mens_df

Unnamed: 0,Tournament,Court,Surface,Round,A,B,A_rank,B_rank,A_hist_form,A_form,...,B_form,B_court_form,B_surface_form,A_alpha,B_alpha,ranking_p,alpha_p,y,d_alpha,d_rank
93,Heineken Open,Outdoor,Hard,1st Round,Balcells J.,Squillari F.,211.0,49.0,0.5,-1.000000,...,0.000000,1.0,1.0,1.286733,0.993303,-0.623077,0.564348,1,0.293430,162.0
94,Heineken Open,Outdoor,Hard,1st Round,Behrend T.,Hantschk M.,115.0,95.0,0.0,0.000000,...,0.600000,1.0,1.0,0.502757,1.192691,-0.095238,0.703467,1,-0.689934,20.0
95,Heineken Open,Outdoor,Hard,1st Round,Black B.,Chang M.,70.0,50.0,0.5,-1.000000,...,0.500000,0.0,0.0,0.100000,1.100000,-0.166667,0.916667,0,-1.000000,20.0
96,Heineken Open,Outdoor,Hard,1st Round,Ferrero J.C.,Federer R.,45.0,61.0,0.5,0.500000,...,0.000000,1.0,1.0,1.100000,1.857534,-0.150943,0.628068,1,-0.757534,-16.0
97,Heineken Open,Outdoor,Hard,1st Round,Gambill J.M.,Fromberg R.,57.0,80.0,0.5,0.000000,...,0.000000,1.0,1.0,2.084119,2.163354,-0.167883,0.509327,1,-0.079235,-23.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53448,Other,Indoor,Hard,Quarterfinals,Millman J.,Pospisil V.,38.0,74.0,0.5,0.200000,...,0.200000,1.0,1.0,1.156403,1.448456,-0.321429,0.556059,0,-0.292053,-36.0
53449,Other,Indoor,Hard,Quarterfinals,Caruso S.,Gasquet R.,82.0,49.0,0.5,-0.200000,...,-0.066667,0.0,1.0,0.642802,1.296198,-0.251908,0.668488,0,-0.653396,33.0
53450,Other,Indoor,Hard,Semifinals,Sinner J.,Mannarino A.,44.0,35.0,0.5,0.466667,...,0.333333,1.0,1.0,1.259033,1.331611,-0.113924,0.514008,1,-0.072578,9.0
53451,Other,Indoor,Hard,Semifinals,Gasquet R.,Pospisil V.,49.0,74.0,0.4,0.066667,...,0.200000,1.0,1.0,1.296198,1.448456,-0.203252,0.527737,0,-0.152259,-25.0


In [39]:
cols_to_ignore = categorical_cols = ['Tournament', 'Court', 'Surface', 'Round', 'A', 'B','y']

X = formatted_mens_df.loc[:, [ x for x in formatted_mens_df.columns if x not in cols_to_ignore]]
y = formatted_mens_df['y']

In [58]:
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

[1.324e+03 1.303e+03 9.741e+02 1.343e+03 1.530e+00 1.736e+00 1.420e+03
 1.241e+01 1.529e+01 1.989e+03 2.176e+03 2.493e+00 7.732e-01 5.131e+03
 2.937e+03]


In [59]:
sorted_features = sorted(zip(fit.scores_,range(len(fit.scores_))), key = lambda t: -t[0])
for score,i in sorted_features:
    print(X.columns[i],score)

d_alpha 5131.298113354098
d_rank 2936.5236294215206
B_alpha 2176.4763007313327
A_alpha 1988.6608864969492
B_form 1420.2947821651303
A_form 1342.7726595482482
A_rank 1324.1614187082703
B_rank 1302.7096861521184
A_hist_form 974.0774295845296
B_surface_form 15.29448023267822
B_court_form 12.413580571809135
ranking_p 2.492521640788826
A_surface_form 1.7355006544696352
A_court_form 1.5297140237047466
alpha_p 0.7731801639501608


In [61]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, y)
print(model.feature_importances_)

[0.088 0.086 0.039 0.077 0.018 0.031 0.079 0.017 0.031 0.089 0.092 0.088
 0.081 0.093 0.09 ]


In [62]:
sorted_features = sorted(zip(model.feature_importances_,range(len(model.feature_importances_))), key = lambda t: -t[0])
for score,i in sorted_features:
    print(X.columns[i],score)

d_alpha 0.09299527697682991
B_alpha 0.09245062158031882
d_rank 0.08974357458727464
A_alpha 0.08943135799202709
ranking_p 0.08821803119825614
A_rank 0.08795684256292742
B_rank 0.0859923520571512
alpha_p 0.08069728348367257
B_form 0.07927647838113626
A_form 0.07747163758984363
A_hist_form 0.03899810684954945
A_surface_form 0.031240122159098233
B_surface_form 0.03114419219304069
A_court_form 0.017652315470560023
B_court_form 0.01673180691831396


In [63]:
corr = X.corrwith(y,axis=0).sort_values(ascending=False)

In [64]:
corr

d_alpha           0.315996
A_alpha           0.203026
A_form            0.167957
B_rank            0.165502
A_hist_form       0.143609
ranking_p         0.007340
A_surface_form    0.006125
A_court_form      0.005751
alpha_p          -0.004088
B_court_form     -0.016380
B_surface_form   -0.018181
A_rank           -0.166822
B_form           -0.172597
B_alpha          -0.211985
d_rank           -0.244322
dtype: float64

## 4 - Exporting Data <a class="anchor" id=4></a>

In [25]:
categorical_cols = ['Tournament', 'Court', 'Surface', 'Round', 'A', 'B']

In [26]:
dummy_mens_df = pd.get_dummies(formatted_mens_df,columns=categorical_cols)

In [29]:
X = dummy_mens_df.iloc[:, dummy_mens_df.columns != 'y']
y = dummy_mens_df['y']

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [31]:
import pickle

# Saving the objects:
with open('train_test_split.pkl', 'wb') as f:
    pickle.dump([X_train, X_test, y_train, y_test], f)