# Making predictions for the 2020 Australian open

This notebook provides the code to produce my a priori predictions for all possible match ups in the 2020 Australian Open. My approach leans very heavily on Qile Tan's tutorial for Betfair's 2019 AO Datathon, you can check it out here: https://github.com/betfair-datascientists/aus-open-datathon/blob/master/python-machine-learning-walkthrough.ipynb.

The data used comes from the R package Deuce which is assembled by GitHub user Skoval (https://github.com/skoval/deuce). I believe he obtains a large part of the data from Jeff Sackman (https://github.com/JeffSackmann/tennis_atp). This is publicly available as far as I'm aware.

Disclaimer: I perform this work as an independent researcher, I am in no way sponsored or associated with Betfair (besides my participation in their 2019 AO Datathon). I do not endorse gambling in any way, shape or form. If you choose to gamble based on my predictions, please do so responsibly. I am not responsible for any winnings or losses you may incur.

In [1]:
import numpy as np
import pandas as pd
import os
import eli5
from eli5.sklearn import PermutationImportance
import re
import datetime
import itertools

pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

# Methodology overview

Before we begin training our model, we need to wrangle the date into a format that is more conducive for machine learning. To this end we will break up the work into four main steps:

1. Initial data cleaning
    - Subsetting to relevant matches
    - Parsing scores
    - Partial construction and imputation of features
    - Removing unnecessary colmns
    - Filling missing values
    
2. Conversion to long format and creating new features
    - Creating separate rows for both winner and loser
    - Will construct features such as percentages of games won and breakpoint conversion ratio etc., not all features will be used

3. Calculation of rolling averages:
    - For each player before a given tournament, take a rolling average of their match statistics for their previous x(=21) matches
    - For player rank, take the most recent value
    
4. Creation of target variable and merging of rolling features with match level data
    - Create target which indicates whether player_1 (the winner) wins, column of ones. Duplicate for the opposite case where player_1 is the loser, resulting in the target = 0.
    - Get rolling average of features from previous tournaments for both the winner and the loser
    - We will also take the difference of features between winner and loser to cut the number of features in half

In [2]:
def data_cleaning(df, 
                  tourneys_to_include = ['Grand Slams', 'Masters', '250 or 500', 'Tour Finals', 'Davis Cup'], 
                  start_year=2000 ):
    
    #Renaming columns
    new_cols = [
    'Unnamed: 0', 'tourney_id', 'tourney_name', 'surface', 'draw_size',
       'tourney_level', 'match_num', 'winner_id', 'winner_seed',
       'winner_entry', 'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc',
       'winner_age', 'winner_rank', 'winner_rank_points', 'loser_id',
       'loser_seed', 'loser_entry', 'loser_name', 'loser_hand', 'loser_ht',
       'loser_ioc', 'loser_age', 'loser_rank', 'loser_rank_points', 'score',
       'best_of', 'round', 'minutes', 'winner_ace', 'winner_df', 'winner_svpt', 'winner_1stIn',
       'winner_1stWon', 'winner_2ndWon', 'winner_SvGms', 'winner_bpSaved', 'winner_bpFaced', 'loser_ace',
       'loser_df', 'loser_svpt', 'loser_1stIn', 'loser_1stWon', 'loser_2ndWon', 'loser_SvGms',
       'loser_bpSaved', 'loser_bpFaced', 'W1', 'W2', 'W3', 'W4', 'W5', 'L1', 'L2',
       'L3', 'L4', 'L5', 'retirement', 'WTB1', 'LTB1', 'WTB2', 'LTB2', 'WTB3',
       'LTB3', 'WTB4', 'LTB4', 'WTB5', 'LTB5', 'tourney_start_date', 'year',
       'match_id'
    ]
    
    df.columns = new_cols
    
    #You can change what matches to include. I've chosen to exclude Futures matches and the Challenger tour
    # tourney_levels = 'Grand Slams', '250 or 500', 'Davis Cup', 'Masters', 'Challenger', 'Tour Finals', 'Futures'
    df = df[(df['tourney_level'].isin(tourneys_to_include)) &\
            (df['year'] >= start_year) & (df['surface'] == 'Hard')&\
            (~df['round'].isin(['Q1', 'Q2', 'Q3', 'Q4']))
           ]

    #Converting dates to datetime
    df.loc[:,'tourney_start_date'] = pd.to_datetime(df['tourney_start_date'])
    df.loc[:,'year'] = pd.to_datetime(df['year'])
    
    #Parsing scores
    scores = df.loc[:,'score'].str.split(' ')
    scores = scores.fillna(0)     
    loser_total_games = []
    winner_total_games = []
    
    for index, value in scores.items():
        loser_game_score = 0
        winner_game_score = 0
        try:
            if value == 0 or value == ['W/O']:            
                loser_total_games.append(loser_game_score)
                winner_total_games.append(winner_game_score)

            else:
                loser_game_score = 0
                winner_game_score = 0

                for set_ in value:                
                    try:
                        text = re.match(r"(\d)\-(\d)",set_ )
                        loser_game_score += int(text.group(2))
                        winner_game_score += int(text.group(1))
                    except:
                        pass
                loser_total_games.append(loser_game_score)
                winner_total_games.append(winner_game_score)
        except:
            print(index, value)

    df.loc[:,'winner_total_games'] = winner_total_games
    df.loc[:,'loser_total_games'] = loser_total_games
    df.loc[:,'total_games'] = df['winner_total_games'] + df['loser_total_games']
    df.loc[:,'loser_RtGms'] = df['winner_SvGms']
    df.loc[:,'winner_RtGms'] = df['loser_SvGms']
    df.loc[:,'loser_bp'] = df['winner_bpFaced']
    df.loc[:,'winner_bp'] = df['loser_bpFaced']


    df.loc[:,'loser_bpWon'] = df['winner_bpFaced'] - df['winner_bpSaved'] 
    df.loc[:,'winner_bpWon'] = df['loser_bpFaced'] - df['loser_bpSaved'] 
    
    #Imputing returns data so we can construct features
    df.loc[:,'winner_2ndIn'] = df['winner_svpt'] - df['winner_1stIn'] - df['winner_df']
    df.loc[:,'loser_2ndIn'] = df['loser_svpt'] - df['loser_1stIn'] - df['loser_df']
    df.loc[:,'loser_rtpt'] = df['winner_svpt']
    df.loc[:,'winner_rtpt'] = df['loser_svpt']
    df.loc[:,'winner_rtptWon'] = df['loser_svpt'] -  df['loser_1stWon'] - df['loser_2ndWon']
    df.loc[:,'loser_rtptWon'] = df['winner_svpt'] -  df['winner_1stWon'] - df['winner_2ndWon']
    df.loc[:,'winner_svptWon'] = df['winner_1stWon'] + df['winner_2ndWon']
    df.loc[:,'loser_svptWon'] = df['loser_1stWon'] + df['loser_2ndWon']
    df.loc[:,'winner_total_points'] = df['winner_svptWon'] + df['winner_rtptWon']
    df.loc[:,'loser_total_points'] = df['loser_svptWon'] + df['loser_rtptWon']
    df.loc[:,'total_points'] = df['winner_total_points'] + df['loser_total_points']
    
    #Dropping columns
    cols_to_drop =[
        'draw_size',
        'winner_seed',
        'winner_entry',
        'loser_seed',
        'loser_entry',
        'score',
        'W1', 'W2', 'W3', 'W4', 'W5', 'L1', 'L2',
        'L3', 'L4', 'L5', 'WTB1', 'LTB1', 'WTB2', 'LTB2', 'WTB3',
        'LTB3', 'WTB4', 'LTB4', 'WTB5', 'LTB5'
        ]
    
    df.drop(cols_to_drop, axis=1, inplace=True)
    
    #Filling nans values
    df.loc[:,'loser_rank'] = df['loser_rank'].fillna(500)
    df.loc[:,'winner_rank'] = df['winner_rank'].fillna(500)
    df = df.fillna(df.mean())
    
    return(df)

def convert_long(df):
    
    #Separating features into winner and loser so we can create rolling averages for each major tournament
    winner_cols = [col for col in df.columns if col.startswith('w')]
    loser_cols = [col for col in df.columns if col.startswith('l')]
    common_cols = [
        'tourney_id', 'tourney_name', 'surface', 'tourney_level',
       'match_num','best_of', 'round',
       'minutes','retirement', 'tourney_start_date', 'year', 'match_id',
        'total_points', 'total_games'
    ]
    
    #Will also add opponent's rank
    df_winner = df[winner_cols + common_cols + ['loser_rank']]
    df_loser = df[loser_cols + common_cols + ['winner_rank']]
    
    df_winner['won'] = 1
    df_loser['won'] = 0
    
    #Renaming columns
    df_winner.columns = [col.replace('winner','player').replace('loser', 'opponent') for col in df_winner.columns]
    df_loser.columns = df_winner.columns
    
    df_long = df_winner.append(df_loser, ignore_index=True)
    
    return(df_long)

def get_new_features(df):
    
    #Creating new features we can play around with, note that not all features may be used
    df.loc[:,'player_serve_win_ratio'] = (df['player_1stWon'] + df['player_2ndWon'])/\
    (df['player_1stIn'] + df['player_2ndIn'] + df['player_df'] )
    
    df.loc[:,'player_return_win_ratio'] = df['player_rtptWon']/df['player_rtpt']
    
    df.loc[:,'player_bp_per_game'] = df['player_bp']/df['player_RtGms']
    
    df.loc[:,'player_bp_conversion_ratio'] = df['player_bpWon']/df['player_bp']
    
    #Setting nans to zero for breakpoint conversion ratio
    df.loc[:,'player_bp_conversion_ratio'].fillna(0, inplace=True)
    
    df.loc[:,'player_game_win_ratio'] = df['player_total_games']/df['total_games']
    
    df.loc[:,'player_point_win_ratio'] = df['player_total_points']/df['total_points']
    
    #df['player_set_Win_Ratio'] = df['Player_Sets_Won']/df['Total_Sets']
    
    df.loc[:,'player_clutch_factor'] = df['player_game_win_ratio'] - df['player_point_win_ratio']
    
    df.loc[:,'player_log_rank'] = np.log(df['player_rank'])
    
    df.loc[:,'player_win_weight'] = df['won'] * np.exp(-df['opponent_rank']/100)

    #Let's try weighting some of the features by the opponent's rank
    
    #df['Player_Set_Win_Ratio_Weighted'] = df['Player_Set_Win_Ratio']*np.exp((df['Player_Rank']-df['Opponent_Rank'])/500)
    df.loc[:,'player_game_win_ratio_weighted'] = df['player_game_win_ratio']*np.exp((df['player_rank']-df['opponent_rank'])/500)
    df.loc[:,'player_point_win_ratio_weighted'] = df['player_point_win_ratio']*np.exp((df['player_rank']-df['opponent_rank'])/500)
    
    return(df)

def get_rolling_features(df, date_df, rolling_cols, last_cols, window):
    
    #This code is basically copied straight from Qile Tan's notebook
    
    df = df.sort_values(['player_name', 'tourney_name', 'tourney_start_date'], ascending=True)
    
    for index, tournament_date in enumerate(date_df.tourney_start_date):
        print(index, tournament_date)
        
        #Subsetting to tournaments at most 1 year before tournament date to reduce computation time
        df_temp = df.loc[(df['tourney_start_date']< tournament_date) & (df['tourney_start_date'] > tournament_date - datetime.timedelta(days=365))]

        #Only taking the most recent value for the feature, if specified in last_cols
        if last_cols != None:
            df_temp_last = df_temp.groupby('player_name')[last_cols].last().reset_index()

        #Taking a rolling average of the x (window_length) most recent matches before specified tournament date,
        #for features specified in rolling_cols
        df_temp = df_temp.groupby('player_name')[rolling_cols].rolling(window,1).mean().reset_index()

        #Only taking the most recent rolling average
        df_temp = df_temp.groupby('player_name').tail(1)

        df_temp = df_temp.merge(df_temp_last, on = 'player_name', how='left')

        #Adding a column telling us what tournament the rolling average is for
        if index == 0:
            df_result = df_temp
            df_result['tournament_date_index'] = tournament_date

        else:
            df_temp['tournament_date_index'] = tournament_date
            df_result = df_result.append(df_temp)
        
    
    df_result.drop('level_1', axis=1, inplace=True)
    
    return(df_result)

def merge_data(df, df_rolling_atp):
    
    df_atp = df.copy()
    #Subsetting match data to Grand Slams and Masters
    df_atp = df_atp.loc[df_atp['tourney_level'].isin(['Grand Slams', 'Masters'])]

    #Removing unnecessary columns from match data
    cols_to_keep = ['winner_name','loser_name','tourney_name','tourney_start_date', 'tourney_level']

    df_atp = df_atp[cols_to_keep]
    df1 = df_atp.copy()
    df1.columns = ['player_1','player_2','tourney_name','tourney_start_date', 'tourney_level']
    df1['player_1_win'] = 1

    df2 = df_atp.copy()
    df2.columns = ['player_2','player_1','tourney_name','tourney_start_date', 'tourney_level']
    df2['player_1_win'] = 0

    df_atp = pd.concat([df1, df2], sort=False)
    df_atp.reset_index(drop=True, inplace=True)
    

    #Joining rolling features for p1 with match data
    df_atp = df_atp.merge(df_rolling_atp, how='left',
                         left_on = ['player_1', 'tourney_start_date'],
                         right_on = ['player_name', 'tournament_date_index'],
                         validate = 'm:1')


    df_atp = df_atp.merge(df_rolling_atp, how='left',
                         left_on = ['player_2', 'tourney_start_date'],
                         right_on = ['player_name', 'tournament_date_index'],
                         validate = 'm:1',
                         suffixes=('_p1', '_p2'))
    
    return(df_atp)

def get_player_difference(df, diff_cols = None):
    
    p1_cols = [i + '_p1' for i in diff_cols] # column names for player 1 stats
    p2_cols = [i + '_p2' for i in diff_cols] # column names for player 2 stats


    # Filling missing values
    df['player_rank_p1'] = df['player_rank_p1'].fillna(500)
    df['player_log_rank_p1'] = df['player_log_rank_p1'].fillna(np.log(500))
    df[p1_cols] = df[p1_cols].fillna(-1)
    
    df['player_rank_p2'] = df['player_rank_p2'].fillna(500)
    df['player_log_rank_p2'] = df['player_log_rank_p2'].fillna(np.log(500))
    df[p2_cols] = df[p2_cols].fillna(-1)

    
    new_column_name = [i + '_diff' for i in diff_cols]

    # Take the difference
    df_p1 = df[p1_cols]
    df_p2 = df[p2_cols]
    
    df_p1.columns=new_column_name
    df_p2.columns=new_column_name
    
    df_diff = df_p1 - df_p2
    df_diff.columns = new_column_name
    
    #Dropping spare columns
    df.drop(p1_cols + p2_cols, axis=1, inplace=True)
    
    # Concat the df_diff and raw_df
    df = pd.concat([df, df_diff], axis=1)
    
    return(df)



In [3]:
deuce_atp = pd.read_csv('Data/atp_matches_deuce.csv')
deuce_atp = data_cleaning(deuce_atp, ['Grand Slams', '250 or 500', 'Davis Cup', 'Masters', 'Challenger', 'Tour Finals'])
deuce_atp_long = convert_long(deuce_atp)

  interactivity=interactivity, compiler=compiler, result=result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

S

In [4]:
deuce_atp.columns

Index(['Unnamed: 0', 'tourney_id', 'tourney_name', 'surface', 'tourney_level',
       'match_num', 'winner_id', 'winner_name', 'winner_hand', 'winner_ht',
       'winner_ioc', 'winner_age', 'winner_rank', 'winner_rank_points',
       'loser_id', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'best_of', 'round',
       'minutes', 'winner_ace', 'winner_df', 'winner_svpt', 'winner_1stIn',
       'winner_1stWon', 'winner_2ndWon', 'winner_SvGms', 'winner_bpSaved',
       'winner_bpFaced', 'loser_ace', 'loser_df', 'loser_svpt', 'loser_1stIn',
       'loser_1stWon', 'loser_2ndWon', 'loser_SvGms', 'loser_bpSaved',
       'loser_bpFaced', 'retirement', 'tourney_start_date', 'year', 'match_id',
       'winner_total_games', 'loser_total_games', 'total_games', 'loser_RtGms',
       'winner_RtGms', 'loser_bp', 'winner_bp', 'loser_bpWon', 'winner_bpWon',
       'winner_2ndIn', 'loser_2ndIn', 'loser_rtpt', 'winner_rtpt',
       'winner_rtpt

In [5]:
deuce_atp_long = get_new_features(deuce_atp_long)

In [6]:
# These are the tournaments we want to get the rolling average of features for, they will then be used for training
roll_dates = deuce_atp.loc[deuce_atp['tourney_level'].isin(['Grand Slams'])].groupby(['tourney_name', 'tourney_start_date'])\
.size().reset_index()[['tourney_name', 'tourney_start_date']]

# We also want to aggregate matches just before the 2020 AO
roll_dates.loc[-1] = ['Australian Open', pd.to_datetime('2020-01-20')]

last_cols = ['player_rank', 'player_log_rank']
rolling_cols = [
    'player_serve_win_ratio', 'player_return_win_ratio',
    'player_bp_per_game', 'player_bp_conversion_ratio',
    'player_game_win_ratio', 'player_point_win_ratio',
    'player_clutch_factor', 'player_win_weight',
    'player_game_win_ratio_weighted', 'player_point_win_ratio_weighted'
]

rolling_features = get_rolling_features(deuce_atp_long, roll_dates, rolling_cols, last_cols, 21  )

0 2000-01-17 00:00:00
1 2001-01-15 00:00:00
2 2002-01-14 00:00:00
3 2003-01-13 00:00:00
4 2004-01-19 00:00:00
5 2005-01-17 00:00:00
6 2006-01-16 00:00:00
7 2007-01-15 00:00:00
8 2008-01-14 00:00:00
9 2009-01-19 00:00:00
10 2010-01-18 00:00:00
11 2011-01-17 00:00:00
12 2012-01-16 00:00:00
13 2013-01-14 00:00:00
14 2014-01-13 00:00:00
15 2015-01-19 00:00:00
16 2016-01-18 00:00:00
17 2017-01-16 00:00:00
18 2018-01-15 00:00:00
19 2019-01-14 00:00:00
20 2000-08-28 00:00:00
21 2001-08-27 00:00:00
22 2002-08-26 00:00:00
23 2003-08-25 00:00:00
24 2004-08-30 00:00:00
25 2005-08-29 00:00:00
26 2006-08-28 00:00:00
27 2007-08-27 00:00:00
28 2008-08-25 00:00:00
29 2009-08-31 00:00:00
30 2010-08-30 00:00:00
31 2011-08-29 00:00:00
32 2012-08-27 00:00:00
33 2013-08-26 00:00:00
34 2014-08-25 00:00:00
35 2015-08-31 00:00:00
36 2016-08-29 00:00:00
37 2017-08-28 00:00:00
38 2018-08-27 00:00:00
39 2019-08-26 00:00:00
40 2020-01-20 00:00:00


In [7]:
deuce_atp_features = merge_data(deuce_atp, rolling_features)

In [8]:
# This is what the datafrmae should look like with the rolling features for p1 and p2 added
deuce_atp_features

Unnamed: 0,player_1,player_2,tourney_name,tourney_start_date,tourney_level,player_1_win,player_name_p1,player_serve_win_ratio_p1,player_return_win_ratio_p1,player_bp_per_game_p1,player_bp_conversion_ratio_p1,player_game_win_ratio_p1,player_point_win_ratio_p1,player_clutch_factor_p1,player_win_weight_p1,player_game_win_ratio_weighted_p1,player_point_win_ratio_weighted_p1,player_rank_p1,player_log_rank_p1,tournament_date_index_p1,player_name_p2,player_serve_win_ratio_p2,player_return_win_ratio_p2,player_bp_per_game_p2,player_bp_conversion_ratio_p2,player_game_win_ratio_p2,player_point_win_ratio_p2,player_clutch_factor_p2,player_win_weight_p2,player_game_win_ratio_weighted_p2,player_point_win_ratio_weighted_p2,player_rank_p2,player_log_rank_p2,tournament_date_index_p2
0,Andre Agassi,Mariano Puerta,Australian Open,2000-01-17,Grand Slams,1,,,,,,,,,,,,,,NaT,Mariano Puerta,0.603448,0.287879,0.400000,0.000000,0.350000,0.435484,-0.085484,0.0,0.369420,0.459647,101.0,4.615121,2000-01-17
1,Sjeng Schalken,Galo Blanco,Australian Open,2000-01-17,Grand Slams,1,Sjeng Schalken,0.598205,0.439965,0.927790,0.329615,0.472162,0.509979,-0.037817,0.302716,0.437212,0.480205,44.0,3.784190,2000-01-17,Galo Blanco,0.657895,0.166667,0.000000,0.000000,0.454545,0.467742,-0.013196,0.0,0.460033,0.473389,75.0,4.317488,2000-01-17
2,Mariano Zabaleta,Felix Mantilla,Australian Open,2000-01-17,Grand Slams,1,Mariano Zabaleta,0.559322,0.366972,0.466667,0.285714,0.448276,0.466960,-0.018684,0.000000,0.428122,0.445967,32.0,3.465736,2000-01-17,Felix Mantilla,0.490196,0.306150,0.300000,0.000000,0.255639,0.388664,-0.133025,0.0,0.242943,0.366497,25.0,3.218876,2000-01-17
3,Todd Woodbridge,Jan Siemerink,Australian Open,2000-01-17,Grand Slams,1,Todd Woodbridge,0.577586,0.378378,0.705882,0.416667,0.485714,0.480176,0.005538,0.000000,0.616230,0.609204,197.0,5.283204,2000-01-17,Jan Siemerink,0.539683,0.350877,0.333333,0.333333,0.333333,0.450000,-0.116667,0.0,0.358218,0.483595,88.0,4.477337,2000-01-17
4,Andrew Ilie,Jeff Tarango,Australian Open,2000-01-17,Grand Slams,1,Andrew Ilie,0.528821,0.385436,0.507143,0.416667,0.360317,0.452724,-0.092406,0.278423,0.368503,0.463249,47.0,3.850148,2000-01-17,Jeff Tarango,0.640032,0.331765,0.272059,0.533333,0.466063,0.490099,-0.024036,0.0,0.515612,0.542155,55.0,4.007333,2000-01-17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25385,Gael Monfils,Denis Shapovalov,Paris Masters,2019-10-28,Masters,0,,,,,,,,,,,,,,NaT,,,,,,,,,,,,,,NaT
25386,Jo Wilfried Tsonga,Rafael Nadal,Paris Masters,2019-10-28,Masters,0,,,,,,,,,,,,,,NaT,,,,,,,,,,,,,,NaT
25387,Grigor Dimitrov,Novak Djokovic,Paris Masters,2019-10-28,Masters,0,,,,,,,,,,,,,,NaT,,,,,,,,,,,,,,NaT
25388,Rafael Nadal,Denis Shapovalov,Paris Masters,2019-10-28,Masters,0,,,,,,,,,,,,,,NaT,,,,,,,,,,,,,,NaT


In [9]:
diff_cols = [
    'player_rank', 'player_log_rank',
    'player_serve_win_ratio', 'player_return_win_ratio',
    'player_bp_per_game', 'player_bp_conversion_ratio',
    'player_game_win_ratio', 'player_point_win_ratio',
    'player_clutch_factor', 'player_win_weight',
    'player_game_win_ratio_weighted', 'player_point_win_ratio_weighted'
]

deuce_atp_final = get_player_difference(deuce_atp_features, diff_cols)

In [10]:
deuce_atp_final.columns

Index(['player_1', 'player_2', 'tourney_name', 'tourney_start_date',
       'tourney_level', 'player_1_win', 'player_name_p1',
       'tournament_date_index_p1', 'player_name_p2',
       'tournament_date_index_p2', 'player_rank_diff', 'player_log_rank_diff',
       'player_serve_win_ratio_diff', 'player_return_win_ratio_diff',
       'player_bp_per_game_diff', 'player_bp_conversion_ratio_diff',
       'player_game_win_ratio_diff', 'player_point_win_ratio_diff',
       'player_clutch_factor_diff', 'player_win_weight_diff',
       'player_game_win_ratio_weighted_diff',
       'player_point_win_ratio_weighted_diff'],
      dtype='object')

# Model training and validation

Validating your model in the time-series setting is somewhat difficult, I've opted to follow Qile Tan's method and do a simple train-val split. I will train on the Australian and US Opens from between 2000 to 2017, and validate on the Australian and US Opens for 2018 and 2019. 

In future I might consider using the forward chaining technique. I actually wanted to implement it before making my ex-ante predictions, but I ran out of time...

In [11]:
def train_val_split(df_atp_final, ML_cols):
    
    df_atp_train = df_atp_final.loc[(df_atp_final.tourney_start_date<'2018-01-15') &\
                                    (df_atp_final.tourney_level.isin(['Grand Slams'])), ML_cols]
    #df_atp_val = df_atp_final.loc[df_atp_final.tourney_start_date=='2018-01-15', ML_cols]
    df_atp_val = df_atp_final.loc[('2018-01-15' <= df_atp_final.tourney_start_date) &\
                                  (df_atp_final.tourney_start_date < '2019-12-31') &\
                                  (df_atp_final.tourney_level.isin(['Grand Slams'])), ML_cols]

    X_train = df_atp_train.drop('player_1_win', axis =1)
    y_train = df_atp_train['player_1_win']
    X_val = df_atp_val.drop('player_1_win', axis =1)
    y_val = df_atp_val['player_1_win']

    return(X_train, X_val, y_train, y_val)


In [12]:
ML_cols = [
       'player_1_win', 'player_rank_diff', 'player_log_rank_diff',
       'player_serve_win_ratio_diff', 'player_return_win_ratio_diff',
       'player_bp_per_game_diff', 'player_bp_conversion_ratio_diff',
       'player_game_win_ratio_diff', 'player_point_win_ratio_diff',
       'player_clutch_factor_diff', 'player_win_weight_diff',
       'player_game_win_ratio_weighted_diff',
       'player_point_win_ratio_weighted_diff'
]

ML_cols_subset = ['player_log_rank_diff',
 'player_rank_diff',
 'player_serve_win_ratio_diff',
 'player_return_win_ratio_diff',
 'player_game_win_ratio_diff',
 'player_point_win_ratio_weighted_diff',
 'player_1_win']

X_train, X_val, y_train, y_val = train_val_split(deuce_atp_final, ML_cols_subset)

In [13]:
from xgboost import XGBClassifier
#Changing some settings to prevent xgboost from killing the kernal
#see https://stackoverflow.com/questions/51164771/python-xgboost-kernel-died
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [14]:
model = XGBClassifier(
    objective = "binary:logistic",
    n_estimators = 300,
    learning_rate = 0.02,
    max_depth = 6
)

eval_set = [(X_val, y_val)]
model.fit(X_train,
          y_train,
         eval_set = eval_set,
         eval_metric="auc",
         early_stopping_rounds = 20)

[0]	validation_0-auc:0.769662
Will train until validation_0-auc hasn't improved in 20 rounds.
[1]	validation_0-auc:0.774186
[2]	validation_0-auc:0.776387
[3]	validation_0-auc:0.774788
[4]	validation_0-auc:0.775064
[5]	validation_0-auc:0.775395
[6]	validation_0-auc:0.776732
[7]	validation_0-auc:0.776776
[8]	validation_0-auc:0.776519
[9]	validation_0-auc:0.776708
[10]	validation_0-auc:0.776862
[11]	validation_0-auc:0.777086
[12]	validation_0-auc:0.776929
[13]	validation_0-auc:0.777156
[14]	validation_0-auc:0.777528
[15]	validation_0-auc:0.777497
[16]	validation_0-auc:0.777199
[17]	validation_0-auc:0.777247
[18]	validation_0-auc:0.777296
[19]	validation_0-auc:0.777228
[20]	validation_0-auc:0.777179
[21]	validation_0-auc:0.776931
[22]	validation_0-auc:0.776976
[23]	validation_0-auc:0.776581
[24]	validation_0-auc:0.77685
[25]	validation_0-auc:0.776689
[26]	validation_0-auc:0.776718
[27]	validation_0-auc:0.776896
[28]	validation_0-auc:0.776986
[29]	validation_0-auc:0.777185
[30]	validation_0

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.02, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=None, n_estimators=300, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [15]:
pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)

player_log_rank_diff                    0.617868
player_game_win_ratio_diff              0.109231
player_point_win_ratio_weighted_diff    0.080940
player_serve_win_ratio_diff             0.075152
player_rank_diff                        0.060499
player_return_win_ratio_diff            0.056310
dtype: float32

In [16]:
perm = PermutationImportance(model).fit(X_val, y_val)
eli5.show_weights(perm, feature_names = X_val.columns.tolist())

Weight,Feature
0.1748  ± 0.0233,player_log_rank_diff
0.0148  ± 0.0134,player_game_win_ratio_diff
0.0138  ± 0.0109,player_rank_diff
0.0041  ± 0.0055,player_return_win_ratio_diff
0.0031  ± 0.0049,player_point_win_ratio_weighted_diff
-0.0053  ± 0.0073,player_serve_win_ratio_diff


# Making predictions 

In [17]:
#Creating our dummy submission file based on the AO draw
players = [
    'Rafael Nadal',
    'Hugo Dellien',
    'Federico Delbonis',
    'Joao Sousa',
    'Christopher Eubanks',
    'Peter Gojowczyk',
    'Jozef Kovalik',
    'Pablo Carreno Busta',
    'Nick Kyrgios',
    'Lorenzo Sonego',
    'Pablo Cuevas',
    'Gilles Simon',
    'Yasutaka Uchiyama',
    'Mikael Ymer',
    'Mario Vilella Martinez',
    'Karen Khachanov',
    'Gael Monfils',
    'Yen hsun Lu',
    'Ivo Karlovic',
    'Vasek Pospisil',
    'James Duckworth',
    'Aljaz Bedene',
    'Ernests Gulbis',
    'Felix Auger Aliassime',
    'Taylor Harry Fritz',
    'Tallon Griekspoor',
    'Ilya Ivashka',
    'Kevin Anderson',
    'Alex Bolt',
    'Albert Ramos',
    'Adrian Mannarino',
    'Dominic Thiem',
    'Daniil Medvedev',
    'Francis Tiafoe',
    'Dominik Koepfer',
    'Pedro Martinez',
    'Hugo Gaston',
    'Jaume Munar',
    'Alexei Popyrin',
    'Jo Wilfried Tsonga',
    'John Isner',
    'Thiago Monteiro',
    'Alejandro Tabilo',
    'Daniel Elahi Galan',
    'Miomir Kecmanovic',
    'Andreas Seppi',
    'Damir Dzumhur',
    'Stan Wawrinka',
    'David Goffin',
    'Jeremy Chardy',
    'Pierre Hugues Herbert',
    'Cameron Norrie',
    'Yuichi Sugita',
    "Christopher OConnell",
    'Andrey Rublev',
    'Nikoloz Basilashvili',
    'Soon woo Kwon',
    'Fernando Verdasco',
    'Evgeny Donskoy',
    'Casper Ruud',
    'Egor Gerasimov',
    'Marco Cecchinato',
    'Alexander Zverev',
    'Matteo Berrettini',
    'Andrew Harris',
    'Tennys Sandgren',
    'Marco Trungelliti',
    'Roberto Carballes Baena',
    'Ricardas Berankis',
    'Sam Querrey',
    'Borna Coric',
    'Guido Pella',
    'John Patrick Smith',
    'Mohamed Safwat',
    'Gregoire Barrere',
    'Jordan Thompson',
    'Alexander Bublik',
    'Reilly Opelka',
    'Fabio Fognini',
    'Denis Shapovalov',
    'Marton Fucsovics',
    'Jannik Sinner',
    'Max Purcell',
    'Leonardo Mayer',
    'Tommy Paul',
    'Juan Ignacio Londero',
    'Grigor Dimitrov',
    'Hubert Hurkacz',
    'Dennis Novak',
    'John Millman',
    'Ugo Humbert',
    'Quentin Halys',
    'Filip Krajinovic',
    'Steve Johnson',
    'Roger Federer',
    'Stefanos Tsitsipas',
    'Salvatore Caruso',
    'Philipp Kohlschreiber',
    'Marcos Giron',
    'Christian Garin',
    'Stefano Travaglia',
    'Radu Albot',
    'Milos Raonic',
    'Benoit Paire',
    'Cedrik Marcel Stebe',
    'Marin Cilic',
    'Corentin Moutet',
    'Pablo Andujar',
    'Michael Mmoh',
    'Feliciano Lopez',
    'Roberto Bautista Agut',
    'Diego Sebastian Schwartzman',
    'Lloyd George Muirhead Harris',
    'Alejandro Fokina',
    'Norbert Gombos',
    'Marc Polmans',
    'Mikhail Kukushkin',
    'Kyle Edmund',
    'Dusan Lajovic',
    'Daniel Evans',
    'Mackenzie Mcdonald',
    'Yoshihito Nishioka',
    'Laslo Djere',
    'Tatsuma Ito',
    'Prajnesh Gunneswaran',
    'Jan Lennard Struff',
    'Novak Djokovic'
]

players_df = pd.DataFrame(players)
players_df.to_csv('Data/players_temp.csv')

player_permutations = list(itertools.permutations(players, 2))
dummy_submission_df = pd.DataFrame(player_permutations, columns=['player_1','player_2'])
dummy_submission_df.loc[:,'player_1_win_probability'] = 0.5
dummy_submission_df.to_csv('Data/dummy_submission_2020.csv')

In [18]:
dummy_submission_df

Unnamed: 0,player_1,player_2,player_1_win_probability
0,Rafael Nadal,Hugo Dellien,0.5
1,Rafael Nadal,Federico Delbonis,0.5
2,Rafael Nadal,Joao Sousa,0.5
3,Rafael Nadal,Christopher Eubanks,0.5
4,Rafael Nadal,Peter Gojowczyk,0.5
...,...,...,...
15997,Novak Djokovic,Yoshihito Nishioka,0.5
15998,Novak Djokovic,Laslo Djere,0.5
15999,Novak Djokovic,Tatsuma Ito,0.5
16000,Novak Djokovic,Prajnesh Gunneswaran,0.5


In [19]:
df_predict_atp = pd.read_csv('Data/dummy_submission_2020.csv')
df_predict_atp['player_1'] = df_predict_atp['player_1'].str.lower() 
df_predict_atp['player_2'] = df_predict_atp['player_2'].str.lower()

rolling_features['player_name'] = rolling_features['player_name'].str.lower() 

#Adding tournament date to prediction df
df_predict_atp['tourney_start_date'] = pd.to_datetime('2020-01-20')


df_predict_atp = df_predict_atp.merge(rolling_features, how='left',
                                     left_on = ['player_1', 'tourney_start_date'],
                                     right_on = ['player_name', 'tournament_date_index'],
                                     validate = 'm:1')




In [20]:
#Used to check for players who do not have rolling features to match or players whose names are spelt incorrectly
missing_names = df_predict_atp[df_predict_atp.isnull().any(axis=1)]['player_1'].unique()
missing_names.sort()
missing_names

array(['alejandro fokina', 'daniel elahi galan', 'pedro martinez',
       'yen hsun lu'], dtype=object)

In [21]:
all_players = deuce_atp_final.player_1.unique()
all_players.sort()
all_players

array(['Adam Pavlasek', 'Adrian Mannarino', 'Adrian Menendez Maceiras',
       'Adrian Ungur', 'Adrian Voinea', 'Agustin Calleri',
       'Aisam Ul Haq Qureshi', 'Alan Mackin', 'Albano Olivetti',
       'Albert Costa', 'Albert Montanes', 'Albert Portas', 'Albert Ramos',
       'Alberto Berasategui', 'Alberto Martin', 'Alejandro Falla',
       'Alejandro Gonzalez', 'Aleksandr Nedovyesov',
       'Alessandro Giannessi', 'Alessio Di Mauro', 'Alex Bogdanovic',
       'Alex Bogomolov Jr', 'Alex Bolt', 'Alex Calatrava',
       'Alex Corretja', 'Alex De Minaur', 'Alex Kim', 'Alex Kuznetsov',
       'Alex Lopez Moron', 'Alex Obrien', 'Alexander Bublik',
       'Alexander Kudryavtsev', 'Alexander Peya', 'Alexander Popp',
       'Alexander Sarkissian', 'Alexander Waske', 'Alexander Zverev',
       'Alexandr Dolgopolov', 'Alexandre Simoni', 'Alexei Popyrin',
       'Aljaz Bedene', 'Alun Jones', 'Amer Delic', 'Amir Weintraub',
       'Andre Agassi', 'Andre Sa', 'Andrea Gaudenzi', 'Andrea Stoppini'

In [22]:
df_predict_atp = df_predict_atp.merge(rolling_features, how='left',
                                     left_on = ['player_2', 'tourney_start_date'],
                                     right_on = ['player_name', 'tournament_date_index'],
                                     validate = 'm:1',
                                     suffixes = ('_p1','_p2'))


In [23]:
#Filling in missing values for players with no aggregates prior to 2020 AO
df_predict_atp.loc[:,'player_rank_p1'].fillna(500, inplace=True)
df_predict_atp.loc[:,'player_rank_p2'].fillna(500, inplace=True)
df_predict_atp.loc[:,'player_log_rank_p1'].fillna(np.log(500), inplace=True)
df_predict_atp.loc[:,'player_log_rank_p2'].fillna(np.log(500), inplace=True)
df_predict_atp.loc[:,['player_log_rank_p1','player_log_rank_p2']].fillna(np.log(500), inplace=True)
df_predict_atp.fillna(-1, inplace=True) ##### <- important

#These players with no previous match history are slightly suspect, good to check if we have filled in their values correctly
df_predict_atp[(df_predict_atp.player_1.isin(['alejandro fokina', 'andrew harris', 'christopher oconnell',
       'hugo gaston', 'james duckworth', 'john patrick smith',
       'michael mmoh', 'yen hsun lu']))]

Unnamed: 0.1,Unnamed: 0,player_1,player_2,player_1_win_probability,tourney_start_date,player_name_p1,player_serve_win_ratio_p1,player_return_win_ratio_p1,player_bp_per_game_p1,player_bp_conversion_ratio_p1,player_game_win_ratio_p1,player_point_win_ratio_p1,player_clutch_factor_p1,player_win_weight_p1,player_game_win_ratio_weighted_p1,player_point_win_ratio_weighted_p1,player_rank_p1,player_log_rank_p1,tournament_date_index_p1,player_name_p2,player_serve_win_ratio_p2,player_return_win_ratio_p2,player_bp_per_game_p2,player_bp_conversion_ratio_p2,player_game_win_ratio_p2,player_point_win_ratio_p2,player_clutch_factor_p2,player_win_weight_p2,player_game_win_ratio_weighted_p2,player_point_win_ratio_weighted_p2,player_rank_p2,player_log_rank_p2,tournament_date_index_p2
2142,2142,yen hsun lu,rafael nadal,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,rafael nadal,0.701927,0.395480,0.639543,0.442393,0.622003,0.542192,0.070236,0.567970,0.567058,0.492981,2.0,0.693147,2020-01-20 00:00:00
2143,2143,yen hsun lu,hugo dellien,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,hugo dellien,0.580818,0.377711,0.623797,0.366715,0.530976,0.487694,0.043283,0.068890,0.372648,0.383517,84.0,4.430817,2020-01-20 00:00:00
2144,2144,yen hsun lu,federico delbonis,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,federico delbonis,0.601685,0.358896,0.564886,0.525092,0.471781,0.480563,-0.008782,0.156573,0.484047,0.493144,67.0,4.204693,2020-01-20 00:00:00
2145,2145,yen hsun lu,joao sousa,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,joao sousa,0.626402,0.337025,0.441870,0.329192,0.475559,0.484985,-0.009426,0.148395,0.440014,0.455121,43.0,3.761200,2020-01-20 00:00:00
2146,2146,yen hsun lu,christopher eubanks,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,christopher eubanks,0.674508,0.337458,0.461262,0.410166,0.506715,0.505611,0.001105,0.042914,0.455516,0.458638,193.0,5.262690,2020-01-20 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14359,14359,alejandro fokina,laslo djere,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,laslo djere,0.555545,0.333069,0.398504,0.520089,0.390236,0.448281,-0.058045,0.051848,0.378729,0.436881,39.0,3.663562,2020-01-20 00:00:00
14360,14360,alejandro fokina,tatsuma ito,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,tatsuma ito,0.651605,0.403785,0.831916,0.371623,0.567402,0.523978,0.043424,0.054782,0.402794,0.381343,141.0,4.948760,2020-01-20 00:00:00
14361,14361,alejandro fokina,prajnesh gunneswaran,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,prajnesh gunneswaran,0.670919,0.365199,0.576871,0.383220,0.535219,0.514290,0.017927,0.052260,0.379621,0.379610,89.0,4.488636,2020-01-20 00:00:00
14362,14362,alejandro fokina,jan lennard struff,0.5,2020-01-20,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,500.0,6.214608,-1,jan lennard struff,0.674155,0.337331,0.476576,0.317752,0.506278,0.502858,0.003420,0.287657,0.506298,0.503182,35.0,3.555348,2020-01-20 00:00:00


In [24]:
df_predict_atp = get_player_difference(df_predict_atp, diff_cols=diff_cols)

df_predict_atp

atp_preds = model.predict_proba(df_predict_atp[X_train.columns])
df_predict_atp['player_1_win_probability'] = atp_preds[:,1]

atp_pred_submission = df_predict_atp[['player_1', 'player_2', 'player_1_win_probability']]
atp_pred_submission.to_csv('Data/test_submission.csv')
atp_pred_submission

Unnamed: 0,player_1,player_2,player_1_win_probability
0,rafael nadal,hugo dellien,0.963657
1,rafael nadal,federico delbonis,0.944729
2,rafael nadal,joao sousa,0.913498
3,rafael nadal,christopher eubanks,0.951366
4,rafael nadal,peter gojowczyk,0.950504
...,...,...,...
15997,novak djokovic,yoshihito nishioka,0.950504
15998,novak djokovic,laslo djere,0.947398
15999,novak djokovic,tatsuma ito,0.961233
16000,novak djokovic,prajnesh gunneswaran,0.961183


In [25]:
# Average win rates for the top 30 players
atp_pred_submission.groupby('player_1')['player_1_win_probability'].agg('mean').sort_values(ascending=False).head(10)

player_1
novak djokovic                 0.923752
roger federer                  0.901148
rafael nadal                   0.898214
dominic thiem                  0.835166
daniil medvedev                0.804219
stefanos tsitsipas             0.795601
alexander zverev               0.786144
gael monfils                   0.760937
diego sebastian schwartzman    0.748177
roberto bautista agut          0.745087
Name: player_1_win_probability, dtype: float32

In [26]:
#Average win rates for the bottom 30 players, note that players with no previous data rank close to the very bottom, as to be expected

atp_pred_submission.groupby('player_1')['player_1_win_probability'].agg('mean').sort_values(ascending=False).tail(30)

player_1
tatsuma ito               0.381479
corentin moutet           0.380138
michael mmoh              0.375453
stefano travaglia         0.374110
damir dzumhur             0.371015
cedrik marcel stebe       0.363749
tennys sandgren           0.361557
alexei popyrin            0.356213
alejandro tabilo          0.352996
christopher eubanks       0.352710
john patrick smith        0.345291
marc polmans              0.344128
evgeny donskoy            0.340649
yuichi sugita             0.340443
alex bolt                 0.339419
ilya ivashka              0.332820
max purcell               0.330556
hugo gaston               0.323172
ernests gulbis            0.321412
thiago monteiro           0.317520
quentin halys             0.311640
daniel elahi galan        0.295165
pedro martinez            0.295165
alejandro fokina          0.295165
yen hsun lu               0.295165
mohamed safwat            0.287472
tallon griekspoor         0.266955
marco trungelliti         0.237825
mario vilel