### Logistic Regression
> Most published ML-based models make use of logistic regression. Clarke and Dyte [6] fit a logistic
regression model to the difference in the ATP rating points of the two players for predicting the outcome
of a set. In other words, they used a 1-dimensional feature space x = (rankdiff ), and optimised β1 so
that the function σ(β1 · rankdiff ) gave the best predictions for the training data. The parameter β0 was
omitted from the model on the basis that a rankdiff of 0 should result in a match-winning probability of
0.5. Instead of predicting the match outcome directly, Clark and Dyte opted to predict the set-winning
probability and run a simulation to find the match-winning probability, thereby increasing the size of
the dataset. The model was used to predict the result of several men’s tournaments in 1998 and 1999,
producing reasonable results (no precise figures on the accuracy of the prediction are given).
Ma, Liu and Tan [15] used a larger feature space of 16 variables belonging to three categories: player
skills and performance, player characteristics and match characteristics. The model was trained with
matches occurring between 1991 and 2008 and was used to make training recommendations to players
(e.g., “more training in returning skills”).

### Using Rank Difference

In [249]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Visiting Untouched Dataset

In [250]:
all_ATP_matches = panda.read_csv("./dataset/ATP_matches.csv", engine='python')
all_ATP_matches

Unnamed: 0,ATP,Tournament,Tournament_Int,Date,Series,Series_Int,Court,Court_Int,Surface,Surface_Int,...,Player1,Player1_Int,Player2,Player2_Int,Player1_Rank,Player2_Rank,Player1_Odds,Player2_Odds,Player1_Implied_Prob,Player2_Implied_Prob
0,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,YmerE.,6.7633,ThompsonJ.,6.7926,160.0,79.0,3.50,1.29,0.2857,0.7752
1,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,MahutN.,6.9297,RobertS.,6.9686,39.0,54.0,1.54,2.43,0.6494,0.4115
2,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,TomicB.,6.5792,FerrerD.,6.3881,26.0,21.0,1.77,2.01,0.5650,0.4975
3,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,EdmundK.,6.8384,EscobedoE.,6.0929,45.0,141.0,1.37,3.01,0.7299,0.3322
4,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,JohnsonS.,6.7032,DimitrovG.,6.5157,33.0,17.0,2.85,1.41,0.3509,0.7092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14730,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,FedererR.,6.9997,DelPotroJ.M.,6.0310,2.0,7.0,1.42,2.82,0.7042,0.3546
14731,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,FerrerD.,6.3881,TipsarevicJ.,6.5461,5.0,9.0,1.20,4.55,0.8333,0.2198
14732,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,DelPotroJ.M.,6.0310,DjokovicN.,6.9457,7.0,1.0,4.28,1.22,0.2336,0.8197
14733,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,FedererR.,6.9997,MurrayA.,6.2537,2.0,3.0,2.16,1.68,0.4630,0.5952


In [251]:
all_ATP_matches.columns

Index(['ATP', 'Tournament', 'Tournament_Int', 'Date', 'Series', 'Series_Int',
       'Court', 'Court_Int', 'Surface', 'Surface_Int', 'Round', 'Round_Int',
       'Best_of', 'Winner', 'Winner_Int', 'Player1', 'Player1_Int', 'Player2',
       'Player2_Int', 'Player1_Rank', 'Player2_Rank', 'Player1_Odds',
       'Player2_Odds', 'Player1_Implied_Prob', 'Player2_Implied_Prob'],
      dtype='object')

In [252]:
all_ATP_matches.Series.unique()

array(['ATP250', 'GrandSlam', 'ATP500', 'Masters1000', 'MastersCup',
       'Masters'], dtype=object)

In [253]:
all_ATP_matches.Court.unique()

array(['Outdoor', 'Indoor'], dtype=object)

In [254]:
all_ATP_matches.Surface.unique()

array(['Hard', 'Clay', 'Grass'], dtype=object)

### Add a new column named "Won". 
Dictates if player1 won the match, 1 for win, 0 for loss

In [255]:
won_column = np.where(all_ATP_matches.Winner == all_ATP_matches.Player1, 1, 0)
all_ATP_matches["Won"] = won_column

In [256]:
all_ATP_matches["Won"]

0        0
1        1
2        0
3        1
4        0
        ..
14730    0
14731    1
14732    0
14733    1
14734    0
Name: Won, Length: 14735, dtype: int32

We have now created the outputs that will be predicted through a Logistic Regression.

### Add a new columns being the difference in ranks of both player

In [257]:
all_ATP_matches.Player1_Rank.dtypes

dtype('float64')

In [258]:
all_ATP_matches.Player2_Rank.dtypes

dtype('float64')

Columns data types looks good to calculate the difference.

In [259]:
all_ATP_matches['rank_difference'] = all_ATP_matches.Player1_Rank - all_ATP_matches.Player2_Rank
all_ATP_matches.dropna()

Unnamed: 0,ATP,Tournament,Tournament_Int,Date,Series,Series_Int,Court,Court_Int,Surface,Surface_Int,...,Player2,Player2_Int,Player1_Rank,Player2_Rank,Player1_Odds,Player2_Odds,Player1_Implied_Prob,Player2_Implied_Prob,Won,rank_difference
0,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,ThompsonJ.,6.7926,160.0,79.0,3.50,1.29,0.2857,0.7752,0,81.0
1,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,RobertS.,6.9686,39.0,54.0,1.54,2.43,0.6494,0.4115,1,-15.0
2,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,FerrerD.,6.3881,26.0,21.0,1.77,2.01,0.5650,0.4975,0,5.0
3,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,EscobedoE.,6.0929,45.0,141.0,1.37,3.01,0.7299,0.3322,1,-96.0
4,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,DimitrovG.,6.5157,33.0,17.0,2.85,1.41,0.3509,0.7092,0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14730,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,DelPotroJ.M.,6.0310,2.0,7.0,1.42,2.82,0.7042,0.3546,0,-5.0
14731,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,TipsarevicJ.,6.5461,5.0,9.0,1.20,4.55,0.8333,0.2198,1,-4.0
14732,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,DjokovicN.,6.9457,7.0,1.0,4.28,1.22,0.2336,0.8197,0,6.0
14733,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,MurrayA.,6.2537,2.0,3.0,2.16,1.68,0.4630,0.5952,1,-1.0


Ready to create the dataset.

### Create the dataset

Separating the Surfaces, Series and Court values in individual columns to check the impact on victories.

In [260]:
tennis_dataset = pd.get_dummies(all_ATP_matches, columns=['Series', 'Court', 'Surface'])

In [261]:
tennis_dataset.columns

Index(['ATP', 'Tournament', 'Tournament_Int', 'Date', 'Series_Int',
       'Court_Int', 'Surface_Int', 'Round', 'Round_Int', 'Best_of', 'Winner',
       'Winner_Int', 'Player1', 'Player1_Int', 'Player2', 'Player2_Int',
       'Player1_Rank', 'Player2_Rank', 'Player1_Odds', 'Player2_Odds',
       'Player1_Implied_Prob', 'Player2_Implied_Prob', 'Won',
       'rank_difference', 'Series_ATP250', 'Series_ATP500', 'Series_GrandSlam',
       'Series_Masters', 'Series_Masters1000', 'Series_MastersCup',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard'],
      dtype='object')

As we can see now, we can see all the extracted columns from each of the former columns.

In [262]:
tennis_dataset = tennis_dataset.iloc[:, [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]

In [263]:
tennis_dataset.isnull().values.any()

True

This confirms there are still some null values inside our panda Dataframe.

In [264]:
tennis_dataset = tennis_dataset.dropna()

In [265]:
tennis_dataset.isnull().values.any()

False

This confirms the entire dataframe don't contain any nulll values.

In [266]:
tennis_dataset

Unnamed: 0,Won,rank_difference,Series_ATP250,Series_ATP500,Series_GrandSlam,Series_Masters,Series_Masters1000,Series_MastersCup,Court_Indoor,Court_Outdoor,Surface_Clay,Surface_Grass,Surface_Hard
0,0,81.0,1,0,0,0,0,0,0,1,0,0,1
1,1,-15.0,1,0,0,0,0,0,0,1,0,0,1
2,0,5.0,1,0,0,0,0,0,0,1,0,0,1
3,1,-96.0,1,0,0,0,0,0,0,1,0,0,1
4,0,16.0,1,0,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14730,0,-5.0,0,0,0,0,0,1,1,0,0,0,1
14731,1,-4.0,0,0,0,0,0,1,1,0,0,0,1
14732,0,6.0,0,0,0,0,0,1,1,0,0,0,1
14733,1,-1.0,0,0,0,0,0,1,1,0,0,0,1


In [267]:
tennis_features = tennis_dataset.iloc[:, 1:]
true_outcomes = tennis_dataset.iloc[:, 0]

In [268]:
tennis_features

Unnamed: 0,rank_difference,Series_ATP250,Series_ATP500,Series_GrandSlam,Series_Masters,Series_Masters1000,Series_MastersCup,Court_Indoor,Court_Outdoor,Surface_Clay,Surface_Grass,Surface_Hard
0,81.0,1,0,0,0,0,0,0,1,0,0,1
1,-15.0,1,0,0,0,0,0,0,1,0,0,1
2,5.0,1,0,0,0,0,0,0,1,0,0,1
3,-96.0,1,0,0,0,0,0,0,1,0,0,1
4,16.0,1,0,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
14730,-5.0,0,0,0,0,0,1,1,0,0,0,1
14731,-4.0,0,0,0,0,0,1,1,0,0,0,1
14732,6.0,0,0,0,0,0,1,1,0,0,0,1
14733,-1.0,0,0,0,0,0,1,1,0,0,0,1


In [269]:
true_outcomes

0        0
1        1
2        0
3        1
4        0
        ..
14730    0
14731    1
14732    0
14733    1
14734    0
Name: Won, Length: 14697, dtype: int32

### Running the Logistic Regression

In [270]:
clf = LogisticRegression()

In [271]:
clf = clf.fit(tennis_features, true_outcomes)

In [272]:
predicted_outcomes = clf.predict(tennis_features)

In [275]:
accuracy_score(true_outcomes, predicted_outcomes)

0.6709532557664829

### Conclusion


As we can see, the rank difference does predict the results relatively precisely as it makes sense that higher ranked players have objectively a better chance at winning. However, it's still not be the best feature to base upon our predictions because simply having knowing the rank doesn't give us the full picture of a player. As we know, having the better rank doesn't necessarily guarantee the victory as upsets can happen or else every best player/team in sports would only have undefeated records.

### Using Player statistics

This logistic regression on the other 