# Machine Learning On Tennis Data
## Contents
>* [Setup](#0)
>* [Adding Features](#1)
>* [Pre-Processing](#2)
>* [Getting Ready For ML](#3)

## 0 - Setup <a class="anchor" id="0"></a>

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.model_selection import train_test_split

In [2]:
mens_df = pd.read_csv('../data/mens.csv',header=0,parse_dates=["Date"])
womens_df = pd.read_csv('../data/womens.csv',header=0,parse_dates=["Date"])

# Remove walkovers
mens_df = mens_df[mens_df['Comment']!='Walkover']
womens_df = womens_df[womens_df['Comment']!='Walkover']

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
cols = ['WRank', 'LRank']
mens_df[cols] = mens_df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

## 1 - Adding Features <a class="anchor" id="1"></a>

When we train this model, the data will be shuffled so we'll lose any notion of time. We can add a few easily computable features to capture some of this information. For the moment we can add
* Streak - length of current winning streak
* Prefered suface - boolen true if current surface has most wins (players play ~ 80 games/year so maybe look at last 30 games)
* Prefered court - same as prefered surface for court
* Historical winner - If the players have played before, who won?

### Data Structures to do this
We'll need some additional data structures to do this
* A dict for tracking player wins
> player -> dict = {'player_name':boolean,...}
* A dict for tracking player preferences over time
Hardcourt/carpet weighting/Greenset, Clay and Grass
> player -> dict = {'winning streak':int,'losing streak':int,'court wins':int[],'surface wins':string[],'surface losses':string[]}

In [4]:
complete_player_list = set(list(mens_df['Winner'].values)+list(mens_df['Loser'].values)) # a list of all unique players

In [5]:
p_observables = {}
historical_wins = {}

for p in complete_player_list:
    p_observables[p] = {'form':[],'court_wins':[],'surface_wins':[],'surface_losses':[]}
    historical_wins[p] = defaultdict(lambda: 0.5,{}) # If have no data assume players evenly matched (hence the 0.5 default)

In [6]:
n_wins_in_feature_lists = 20 #number of previous wins/losses to consider
n_games_for_form_calc = 15 #number of games to consider for recent form

In [7]:
def match_to_new_features(x):
    data_to_return = []
    
    for i,p in enumerate([x.Winner,x.Loser]):
        ## Historical results
        hist_form = np.mean(historical_wins[p][[x.Winner,x.Loser][i-1]])
        
        ## Recent form
        form = np.mean(p_observables[p]['form']) if (len(p_observables[p]['form'])>0) else 0.5 # Assume 0.5 if have no data
        if (len(p_observables[p]['form'])>=n_games_for_form_calc): p_observables[p]['form'].pop()
        
        ## Court form
        desired_court = 'Outdoor' if sum(p_observables[p]['court_wins']) > 0 else 'Indoor' # The court type with most recent wins
        prefered_court = 1 if (x.Court == desired_court) else 0
        
        ## Surface form
        try:
            desired_surface = max(set(p_observables[p]['surface_wins']), key=p_observables[p]['surface_wins'].count)
        except:
            desired_surface = 'None'
        try:
            undesired_surface = max(set(p_observables[p]['surface_losses']), key=p_observables[p]['surface_losses'].count)
        except:
            undesired_surface = 'None'
            
        preferred_surface = 1 if x.Surface == desired_surface else (-1 if x.Surface == undesired_surface else 0)
        
        ## Updating the historical info for use next time
        if i == 0: # On 1st loop so p is the winner
            historical_wins[p][x.Loser] = [1] if historical_wins[p][x.Loser] == 0.5 else historical_wins[p][x.Loser]+[1] # Add a 1 to the list, if the list exsists else make a new list
            
            p_observables[p]['form'].insert(0,1)
            
            if (len(p_observables[p]['surface_wins']) > n_wins_in_feature_lists): p_observables[p]['surface_wins'].pop() # remove last item from list
            p_observables[p]['surface_wins'].insert(0,x.Surface)
            
            if (len(p_observables[p]['court_wins']) > n_wins_in_feature_lists): p_observables[p]['court_wins'].pop() # if too long remove last one
            p_observables[p]['court_wins'].insert(0,1 if x.Court=='Outdoor' else -1)
        else:
            historical_wins[p][x.Winner] = [0] if historical_wins[p][x.Winner] == 0.5 else historical_wins[p][x.Winner]+[0] # Add a 0 entry if list already exsists, else make a new list
            
            p_observables[p]['form'].insert(0,-1)
            
            if (len(p_observables[p]['surface_losses']) > n_wins_in_feature_lists): p_observables[p]['surface_losses'].pop() # remove last item from list
            p_observables[p]['surface_losses'].insert(0,x.Surface)

        data_to_return += [hist_form,form,prefered_court,preferred_surface]
        
    return data_to_return

In [8]:
%%time
mens_df[['w_hist_form','w_form','w_court_form','w_surface_form','l_hist_form','l_form','l_court_form','l_surface_form']] = mens_df.apply(match_to_new_features,axis=1, result_type="expand")

Wall time: 38.3 s


## 2 - Pre-Processing <a class="anchor" id=2></a>

First remove data that we think isn't important
> It's really important we remove the results for the match <br>
> For the moment remove betting odds, could convert to impled prob and use later

In [9]:
cols_to_drop = ['ATP','Series','Location', 'Date','Best of','W1', 'L1','W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets','Comment', 'CBW', 'CBL', 'GBW', 'GBL', 'IWW', 'IWL', 'SBW', 'SBL','B365W', 'B365L', 'B&WW', 'B&WL', 'EXW', 'EXL', 'PSW', 'PSL', 'WPts','LPts', 'UBW', 'UBL', 'LBW', 'LBL', 'SJW', 'SJL', 'MaxW', 'MaxL','AvgW', 'AvgL']

filtered_mens_df = mens_df.drop(cols_to_drop,axis=1)

In [10]:
## Replace tournament names which appear infrequently with other
tournament_names_and_counts = filtered_mens_df['Tournament'].value_counts(normalize=True)
infrequent_tournaments = list(tournament_names_and_counts[tournament_names_and_counts<0.01].index)

filtered_mens_df.loc[filtered_mens_df['Tournament'].isin(infrequent_tournaments),'Tournament'] = 'Other'

In [11]:
## Bundle surfaces into three categores
surface_dict = {'Clay':'Clay','Grass':'Grass','Hard':'Hard','Carpet':'Hard','Greenset':'Hard'}
mens_df.replace({"Surface": surface_dict},inplace=True)
mens_df.Surface.unique()

array(['Hard', 'Clay', 'Grass'], dtype=object)

Now we need to swap the winner & loser about a bit so that the ML model doesn't always just pick the player in col 5

In [12]:
winner_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'w']
loser_cols = [c for c in filtered_mens_df.columns if c[0].lower() == 'l']
other_cols = [c for c in filtered_mens_df.columns if not (c in winner_cols or c in loser_cols)]

In [13]:
def random_transpose_of_data(x):
    if np.random.choice([True,False]):
        temp = x[winner_cols]
        x[winner_cols] = x[loser_cols]
        x[loser_cols] = temp
        x['y'] = 0
    else:
        x['y'] = 1
    return x

In [14]:
col_rename_dict = {'Winner':'A','Loser':'B','WRank':'A_rank','LRank':'B_rank'}
for col in winner_cols:
    if col not in  ['Winner','WRank']:
        col_rename_dict[col]="A"+col[1:]
for col in loser_cols:
    if col not in  ['Loser','LRank']:
        col_rename_dict[col]="B"+col[1:]

In [15]:
formatted_mens_df = filtered_mens_df.apply(random_transpose_of_data,axis=1).rename(columns=col_rename_dict)

## 3 - Getting Ready For ML <a class="anchor" id=3></a>

We now have the mens_df in the following format

In [16]:
formatted_mens_df['ranking_p'] = -2*(formatted_mens_df['A_rank'] / (formatted_mens_df['A_rank'] + formatted_mens_df['B_rank'])-0.5)
formatted_mens_df.drop(['B_hist_form'],axis=1,inplace=True) # Drop B_hist_form as it's 1-A_hist_form
formatted_mens_df

Unnamed: 0,Tournament,Court,Surface,Round,A,B,A_rank,B_rank,A_hist_form,A_form,A_court_form,A_surface_form,B_form,B_court_form,B_surface_form,y,ranking_p
0,Other,Outdoor,Hard,1st Round,Dosedel S.,Ljubicic I.,63.0,77.0,0.50,0.500000,0.0,0.0,0.500000,0.0,0.0,1,0.100000
1,Other,Outdoor,Hard,1st Round,Enqvist T.,Clement A.,5.0,56.0,0.50,0.500000,0.0,0.0,0.500000,0.0,0.0,1,0.836066
2,Other,Outdoor,Hard,1st Round,Escude N.,Baccanello P.,40.0,655.0,0.50,0.500000,0.0,0.0,0.500000,0.0,0.0,1,0.884892
3,Other,Outdoor,Hard,1st Round,Federer R.,Knippschild J.,65.0,87.0,0.50,0.500000,0.0,0.0,0.500000,0.0,0.0,1,0.144737
4,Other,Outdoor,Hard,1st Round,Woodbridge T.,Fromberg R.,198.0,81.0,0.50,0.500000,0.0,0.0,0.500000,0.0,0.0,0,-0.419355
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53463,Other,Indoor,Hard,Round Robin,Djokovic N.,Zverev A.,1.0,7.0,0.60,0.600000,0.0,-1.0,0.733333,1.0,1.0,1,0.750000
53464,Other,Indoor,Hard,Round Robin,Schwartzman D.,Medvedev D.,9.0,4.0,0.00,0.333333,0.0,-1.0,0.333333,1.0,1.0,0,-0.384615
53465,Other,Indoor,Hard,Semifinals,Djokovic N.,Thiem D.,1.0,3.0,0.70,0.600000,0.0,-1.0,0.600000,0.0,1.0,0,0.500000
53466,Other,Indoor,Hard,Semifinals,Medvedev D.,Nadal R.,4.0,2.0,0.00,0.466667,1.0,1.0,0.600000,0.0,1.0,1,-0.333333


In [17]:
categorical_cols = ['Tournament', 'Court', 'Surface', 'Round', 'A', 'B']

In [18]:
dummy_mens_df = pd.get_dummies(formatted_mens_df,columns=categorical_cols)

In [19]:
corr = dummy_mens_df.iloc[:, dummy_mens_df.columns != 'y'].corrwith(dummy_mens_df['y'],axis=0).sort_values(ascending=False)

In [20]:
corr.loc[['A_rank','B_rank','A_hist_form','A_form','A_court_form','A_surface_form','B_form','B_court_form','B_surface_form','ranking_p']]

A_rank           -0.169207
B_rank            0.160918
A_hist_form       0.140511
A_form            0.162680
A_court_form      0.008145
A_surface_form    0.010384
B_form           -0.171558
B_court_form     -0.018442
B_surface_form   -0.010546
ranking_p         0.370038
dtype: float64

In [21]:
X = dummy_mens_df.iloc[:, dummy_mens_df.columns != 'y']
y = dummy_mens_df['y']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [23]:
import pickle

# Saving the objects:
with open('train_test_split.pkl', 'wb') as f:
    pickle.dump([X_train, X_test, y_train, y_test], f)