## Machine Learning On Tennis Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
mens_df = pd.read_csv('../data/mens.csv',header=0,parse_dates=["Date"])
womens_df = pd.read_csv('../data/womens.csv',header=0,parse_dates=["Date"])

# Remove walkovers
mens_df = mens_df[mens_df['Comment']!='Walkover']
womens_df = womens_df[womens_df['Comment']!='Walkover']

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


## Data Dictionary
Key to results data:

* ATP = Tournament number (men)
* WTA = Tournament number (women)
* Location = Venue of tournament
* Tournament = Name of tounament (including sponsor if relevant)
* Data = Date of match (note: prior to 2003 the date shown for all matches played in a single tournament is the start date)
* Series = Name of ATP tennis series (Grand Slam, Masters, International or International Gold)
* Tier = Tier (tournament ranking) of WTA tennis series.
* Court = Type of court (outdoors or indoors)
* Surface = Type of surface (clay, hard, carpet or grass)
* Round = Round of match
* Best of = Maximum number of sets playable in match
* Winner = Match winner
* Loser = Match loser
* WRank = ATP Entry ranking of the match winner as of the start of the tournament
* LRank = ATP Entry ranking of the match loser as of the start of the tournament
* WPts = ATP Entry points of the match winner as of the start of the tournament
* LPts = ATP Entry points of the match loser as of the start of the tournament
* W1 = Number of games won in 1st set by match winner
* L1 = Number of games won in 1st set by match loser
* W2 = Number of games won in 2nd set by match winner
* L2 = Number of games won in 2nd set by match loser
* W3 = Number of games won in 3rd set by match winner
* L3 = Number of games won in 3rd set by match loser
* W4 = Number of games won in 4th set by match winner
* L4 = Number of games won in 4th set by match loser
* W5 = Number of games won in 5th set by match winner
* L5 = Number of games won in 5th set by match loser
* Wsets = Number of sets won by match winner
* Lsets = Number of sets won by match loser
* Comment = Comment on the match (Completed, won through retirement of loser, or via Walkover)


Key to match betting odds data:

* B365W = Bet365 odds of match winner
* B365L = Bet365 odds of match loser
* B&WW = Bet&Win odds of match winner
* B&WL = Bet&Win odds of match loser
* CBW = Centrebet odds of match winner
* CBL = Centrebet odds of match loser
* EXW = Expekt odds of match winner
* EXL = Expekt odds of match loser
* LBW = Ladbrokes odds of match winner
* LBL = Ladbrokes odds of match loser
* GBW = Gamebookers odds of match winner
* GBL = Gamebookers odds of match loser
* IWW = Interwetten odds of match winner
* IWL = Interwetten odds of match loser
* PSW = Pinnacles Sports odds of match winner
* PSL = Pinnacles Sports odds of match loser
* SBW = Sportingbet odds of match winner
* SBL = Sportingbet odds of match loser
* SJW = Stan James odds of match winner
* SJL = Stan James odds of match loser
* UBW = Unibet odds of match winner
* UBL = Unibet odds of match loser

* MaxW= Maximum odds of match winner (as shown by Oddsportal.com)
* MaxL= Maximum odds of match loser (as shown by Oddsportal.com)
* AvgW= Average odds of match winner (as shown by Oddsportal.com)
* AvgL= Average odds of match loser (as shown by Oddsportal.com)

## Adding Features

When we train this model, the data will be shuffled so we'll lose any notion of time. We can add a few easily computable features to capture some of this information. For the moment we can add
* Streak - length of current winning streak
* Prefered suface - boolen true if current surface has most wins (players play ~ 80 games/year so maybe look at last 30 games)
* Prefered court - same as prefered surface for court
* Historical winner - If the players have played before, who won?

### Data Structures to do this
We'll need some additional data structures to do this
* A dict for tracking player wins
> player -> dict = {'player_name':boolean,...}
* A dict for tracking player preferences over time
Hardcourt/carpet weighting/Greenset, Clay and Grass
> player -> dict = {'winning streak':int,'losing streak':int,'court wins':int[],'surface wins':string[],'surface losses':string[]}

In [3]:
complete_player_list = set(list(mens_df['Winner'].values)+list(mens_df['Loser'].values))

In [4]:
p_observables = {}
historical_wins = {}

for p in complete_player_list:
    p_observables[p] = {'form':[],'court_wins':[],'surface_wins':[],'surface_losses':[]}
    historical_wins[p] = {}

In [5]:
n_wins_in_feature_lists = 2
n_games_for_form_calc = 15

In [6]:
def match_to_new_features(x):
    data_to_return = []
    
    i = 0
    for p in [x.Winner,x.Loser]:
        form = np.mean(p_observables[p]['form']) if (len(p_observables[p]['form'])>0) else 0
        if (len(p_observables[p]['form'])>=n_games_for_form_calc): p_observables[p]['form'].pop()
        
        desired_court = 'Outdoor' if sum(p_observables[p]['court_wins']) > 0 else 'Indoor' # The court type with most recent wins
        prefered_court = 1 if (x.Court == desired_court) else 0
        
        ## Need to use try except in case lists are empty
        try:
            desired_surface = max(set(p_observables[p]['surface_wins']), key=p_observables[p]['surface_wins'].count)
        except:
            desired_surface = 'None'
        try:
            undesired_surface = max(set(p_observables[p]['surface_losses']), key=p_observables[p]['surface_losses'].count)
        except:
            undesired_surface = 'None'
            
        preferred_surface = 1 if x.Surface == desired_surface else (-1 if x.Surface == undesired_surface else 0)
        
        if i == 0: # On 1st loop so p is the winner
            p_observables[p]['form'].insert(0,1)
            
            if (len(p_observables[p]['surface_wins']) > n_wins_in_feature_lists): p_observables[p]['surface_wins'].pop() # remove last item from list
            p_observables[p]['surface_wins'].insert(0,x.Surface)
            
            if (len(p_observables[p]['court_wins']) > n_wins_in_feature_lists): p_observables[p]['court_wins'].pop() # if too long remove last one
            p_observables[p]['court_wins'].insert(0,1 if x.Court=='Outdoor' else -1)
        else:
            p_observables[p]['form'].insert(0,-1)
            
            if (len(p_observables[p]['surface_losses']) > n_wins_in_feature_lists): p_observables[p]['surface_losses'].pop() # remove last item from list
            p_observables[p]['surface_losses'].insert(0,x.Surface)
        i+=1
        data_to_return += [form,prefered_court,preferred_surface]
        
    return data_to_return

In [22]:
%%time
mens_df[['w_form','w_court_form','w_surface_form','l_form','l_court_form','l_surface_form']] = mens_df.apply(match_to_new_features,axis=1, result_type="expand")

In [23]:
mens_df

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,MaxW,MaxL,AvgW,AvgL,w_form,w_court_form,w_surface_form,l_form,l_court_form,l_surface_form
0,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Dosedel S.,...,,,,,0.000000,0.0,0.0,0.000000,0.0,0.0
1,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Enqvist T.,...,,,,,0.000000,0.0,0.0,0.000000,0.0,0.0
2,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Escude N.,...,,,,,0.000000,0.0,0.0,0.000000,0.0,0.0
3,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Federer R.,...,,,,,1.000000,1.0,1.0,-1.000000,0.0,-1.0
4,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Fromberg R.,...,,,,,0.000000,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53463,32,London,Masters Cup,2020-11-20,Masters Cup,Indoor,Hard,Round Robin,3,Djokovic N.,...,1.35,3.92,1.31,3.52,0.600000,1.0,1.0,0.733333,1.0,1.0
53464,32,London,Masters Cup,2020-11-20,Masters Cup,Indoor,Hard,Round Robin,3,Medvedev D.,...,1.40,4.00,1.29,3.60,0.333333,1.0,1.0,0.333333,1.0,1.0
53465,32,London,Masters Cup,2020-11-21,Masters Cup,Indoor,Hard,Semifinals,3,Thiem D.,...,2.70,1.66,2.47,1.56,0.600000,1.0,1.0,0.600000,1.0,1.0
53466,32,London,Masters Cup,2020-11-21,Masters Cup,Indoor,Hard,Semifinals,3,Medvedev D.,...,1.95,2.20,1.80,2.04,0.466667,1.0,1.0,0.600000,1.0,1.0
