### Logistic Regression
> Most published ML-based models make use of logistic regression. Clarke and Dyte [6] fit a logistic
regression model to the difference in the ATP rating points of the two players for predicting the outcome
of a set. In other words, they used a 1-dimensional feature space x = (rankdiff ), and optimised β1 so
that the function σ(β1 · rankdiff ) gave the best predictions for the training data. The parameter β0 was
omitted from the model on the basis that a rankdiff of 0 should result in a match-winning probability of
0.5. Instead of predicting the match outcome directly, Clark and Dyte opted to predict the set-winning
probability and run a simulation to find the match-winning probability, thereby increasing the size of
the dataset. The model was used to predict the result of several men’s tournaments in 1998 and 1999,
producing reasonable results (no precise figures on the accuracy of the prediction are given).
Ma, Liu and Tan [15] used a larger feature space of 16 variables belonging to three categories: player
skills and performance, player characteristics and match characteristics. The model was trained with
matches occurring between 1991 and 2008 and was used to make training recommendations to players
(e.g., “more training in returning skills”).

### Using Rank Difference

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Visiting Untouched Dataset

In [3]:
all_ATP_matches = pd.read_csv("./dataset/ATP_matches.csv", engine='python')
all_ATP_matches

Unnamed: 0,ATP,Tournament,Tournament_Int,Date,Series,Series_Int,Court,Court_Int,Surface,Surface_Int,...,Player1,Player1_Int,Player2,Player2_Int,Player1_Rank,Player2_Rank,Player1_Odds,Player2_Odds,Player1_Implied_Prob,Player2_Implied_Prob
0,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,YmerE.,6.7633,ThompsonJ.,6.7926,160.0,79.0,3.50,1.29,0.2857,0.7752
1,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,MahutN.,6.9297,RobertS.,6.9686,39.0,54.0,1.54,2.43,0.6494,0.4115
2,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,TomicB.,6.5792,FerrerD.,6.3881,26.0,21.0,1.77,2.01,0.5650,0.4975
3,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,EdmundK.,6.8384,EscobedoE.,6.0929,45.0,141.0,1.37,3.01,0.7299,0.3322
4,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,JohnsonS.,6.7032,DimitrovG.,6.5157,33.0,17.0,2.85,1.41,0.3509,0.7092
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14730,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,FedererR.,6.9997,DelPotroJ.M.,6.0310,2.0,7.0,1.42,2.82,0.7042,0.3546
14731,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,FerrerD.,6.3881,TipsarevicJ.,6.5461,5.0,9.0,1.20,4.55,0.8333,0.2198
14732,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,DelPotroJ.M.,6.0310,DjokovicN.,6.9457,7.0,1.0,4.28,1.22,0.2336,0.8197
14733,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,FedererR.,6.9997,MurrayA.,6.2537,2.0,3.0,2.16,1.68,0.4630,0.5952


In [4]:
all_ATP_matches.columns

Index(['ATP', 'Tournament', 'Tournament_Int', 'Date', 'Series', 'Series_Int',
       'Court', 'Court_Int', 'Surface', 'Surface_Int', 'Round', 'Round_Int',
       'Best_of', 'Winner', 'Winner_Int', 'Player1', 'Player1_Int', 'Player2',
       'Player2_Int', 'Player1_Rank', 'Player2_Rank', 'Player1_Odds',
       'Player2_Odds', 'Player1_Implied_Prob', 'Player2_Implied_Prob'],
      dtype='object')

In [5]:
all_ATP_matches.Series.unique()

array(['ATP250', 'GrandSlam', 'ATP500', 'Masters1000', 'MastersCup',
       'Masters'], dtype=object)

In [6]:
all_ATP_matches.Court.unique()

array(['Outdoor', 'Indoor'], dtype=object)

In [7]:
all_ATP_matches.Surface.unique()

array(['Hard', 'Clay', 'Grass'], dtype=object)

### Add a new column named "Won". 
Dictates if player1 won the match, 1 for win, 0 for loss

In [8]:
won_column = np.where(all_ATP_matches.Winner == all_ATP_matches.Player1, 1, 0)
all_ATP_matches["Won"] = won_column

In [9]:
all_ATP_matches["Won"]

0        0
1        1
2        0
3        1
4        0
        ..
14730    0
14731    1
14732    0
14733    1
14734    0
Name: Won, Length: 14735, dtype: int32

We have now created the outputs that will be predicted through a Logistic Regression.

### Add a new columns being the difference in ranks of both player

In [10]:
all_ATP_matches.Player1_Rank.dtypes

dtype('float64')

In [11]:
all_ATP_matches.Player2_Rank.dtypes

dtype('float64')

Columns data types looks good to calculate the difference.

In [12]:
all_ATP_matches['rank_difference'] = all_ATP_matches.Player1_Rank - all_ATP_matches.Player2_Rank
all_ATP_matches.dropna()

Unnamed: 0,ATP,Tournament,Tournament_Int,Date,Series,Series_Int,Court,Court_Int,Surface,Surface_Int,...,Player2,Player2_Int,Player1_Rank,Player2_Rank,Player1_Odds,Player2_Odds,Player1_Implied_Prob,Player2_Implied_Prob,Won,rank_difference
0,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,ThompsonJ.,6.7926,160.0,79.0,3.50,1.29,0.2857,0.7752,0,81.0
1,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,RobertS.,6.9686,39.0,54.0,1.54,2.43,0.6494,0.4115,1,-15.0
2,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,FerrerD.,6.3881,26.0,21.0,1.77,2.01,0.5650,0.4975,0,5.0
3,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,EscobedoE.,6.0929,45.0,141.0,1.37,3.01,0.7299,0.3322,1,-96.0
4,1,BrisbaneInternational,1.2757,42737,ATP250,2.9693,Outdoor,3.6494,Hard,4.4983,...,DimitrovG.,6.5157,33.0,17.0,2.85,1.41,0.3509,0.7092,0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14730,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,DelPotroJ.M.,6.0310,2.0,7.0,1.42,2.82,0.7042,0.3546,0,-5.0
14731,65,MastersCup,1.7532,41223,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,TipsarevicJ.,6.5461,5.0,9.0,1.20,4.55,0.8333,0.2198,1,-4.0
14732,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,DjokovicN.,6.9457,7.0,1.0,4.28,1.22,0.2336,0.8197,0,6.0
14733,65,MastersCup,1.7532,41224,MastersCup,2.7079,Indoor,3.2579,Hard,4.4983,...,MurrayA.,6.2537,2.0,3.0,2.16,1.68,0.4630,0.5952,1,-1.0


Ready to create the dataset.

### Create the dataset

Separating the Surfaces, Series and Court values in individual columns to check the impact on victories.

In [13]:
tennis_dataset = pd.get_dummies(all_ATP_matches, columns=['Series', 'Court', 'Surface'])

In [14]:
tennis_dataset.columns

Index(['ATP', 'Tournament', 'Tournament_Int', 'Date', 'Series_Int',
       'Court_Int', 'Surface_Int', 'Round', 'Round_Int', 'Best_of', 'Winner',
       'Winner_Int', 'Player1', 'Player1_Int', 'Player2', 'Player2_Int',
       'Player1_Rank', 'Player2_Rank', 'Player1_Odds', 'Player2_Odds',
       'Player1_Implied_Prob', 'Player2_Implied_Prob', 'Won',
       'rank_difference', 'Series_ATP250', 'Series_ATP500', 'Series_GrandSlam',
       'Series_Masters', 'Series_Masters1000', 'Series_MastersCup',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard'],
      dtype='object')

As we can see now, we can see all the extracted columns from each of the former columns.

In [15]:
tennis_dataset = tennis_dataset.iloc[:, [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]

In [16]:
tennis_dataset.isnull().values.any()

True

This confirms there are still some null values inside our panda Dataframe.

In [17]:
tennis_dataset = tennis_dataset.dropna()

In [18]:
tennis_dataset.isnull().values.any()

False

This confirms the entire dataframe don't contain any nulll values.

In [19]:
tennis_dataset

Unnamed: 0,Won,rank_difference,Series_ATP250,Series_ATP500,Series_GrandSlam,Series_Masters,Series_Masters1000,Series_MastersCup,Court_Indoor,Court_Outdoor,Surface_Clay,Surface_Grass,Surface_Hard
0,0,81.0,1,0,0,0,0,0,0,1,0,0,1
1,1,-15.0,1,0,0,0,0,0,0,1,0,0,1
2,0,5.0,1,0,0,0,0,0,0,1,0,0,1
3,1,-96.0,1,0,0,0,0,0,0,1,0,0,1
4,0,16.0,1,0,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
14730,0,-5.0,0,0,0,0,0,1,1,0,0,0,1
14731,1,-4.0,0,0,0,0,0,1,1,0,0,0,1
14732,0,6.0,0,0,0,0,0,1,1,0,0,0,1
14733,1,-1.0,0,0,0,0,0,1,1,0,0,0,1


In [20]:
tennis_features = tennis_dataset.iloc[:, 1:]
true_outcomes = tennis_dataset.iloc[:, 0]

In [21]:
tennis_features

Unnamed: 0,rank_difference,Series_ATP250,Series_ATP500,Series_GrandSlam,Series_Masters,Series_Masters1000,Series_MastersCup,Court_Indoor,Court_Outdoor,Surface_Clay,Surface_Grass,Surface_Hard
0,81.0,1,0,0,0,0,0,0,1,0,0,1
1,-15.0,1,0,0,0,0,0,0,1,0,0,1
2,5.0,1,0,0,0,0,0,0,1,0,0,1
3,-96.0,1,0,0,0,0,0,0,1,0,0,1
4,16.0,1,0,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
14730,-5.0,0,0,0,0,0,1,1,0,0,0,1
14731,-4.0,0,0,0,0,0,1,1,0,0,0,1
14732,6.0,0,0,0,0,0,1,1,0,0,0,1
14733,-1.0,0,0,0,0,0,1,1,0,0,0,1


In [22]:
true_outcomes

0        0
1        1
2        0
3        1
4        0
        ..
14730    0
14731    1
14732    0
14733    1
14734    0
Name: Won, Length: 14697, dtype: int32

### Running the Logistic Regression

In [23]:
clf = LogisticRegression()

In [24]:
clf = clf.fit(tennis_features, true_outcomes)

In [25]:
predicted_outcomes = clf.predict(tennis_features)

In [26]:
accuracy_score(true_outcomes, predicted_outcomes)

0.6709532557664829

### Conclusion


As we can see, the rank difference does predict the results relatively precisely as it makes sense that higher ranked players have objectively a better chance at winning. However, it's still not be the best feature to base upon our predictions because simply having knowing the rank doesn't give us the full picture of a player. As we know, having the better rank doesn't necessarily guarantee the victory as upsets can happen or else every best player/team in sports would only have undefeated records.

### Logistic Regression using Player statistics

This logistic regression will be using player match statistics such as aces, break points, etc. 

### Visiting Dataset

In [27]:
ATP_Matches = pd.read_csv("./dataset/atp_matches_2000.csv")
ATP_Matches

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,2000-717,Orlando,Clay,32,A,20000501,1,102179,,,...,15.0,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0
1,2000-717,Orlando,Clay,32,A,20000501,2,103602,,Q,...,6.0,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0
2,2000-717,Orlando,Clay,32,A,20000501,3,103387,,,...,0.0,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0
3,2000-717,Orlando,Clay,32,A,20000501,4,101733,,,...,12.0,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0
4,2000-717,Orlando,Clay,32,A,20000501,5,101727,4.0,,...,1.0,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3359,2000-D082,Davis Cup G2 QF: VEN vs URU,Hard,4,D,20000204,4,103286,,,...,,,,,,,,,,
3360,2000-D083,Davis Cup WG R1: ZIM vs USA,Hard,4,D,20000204,1,101736,,,...,,,,,,,,,,
3361,2000-D083,Davis Cup WG R1: ZIM vs USA,Hard,4,D,20000204,2,101647,,,...,,,,,,,,,,
3362,2000-D083,Davis Cup WG R1: ZIM vs USA,Hard,4,D,20000204,3,101736,,,...,,,,,,,,,,


In [28]:
ATP_Matches.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

In [29]:
ATP_Matches.isnull().sum()

tourney_id               0
tourney_name             0
surface                  0
draw_size                0
tourney_level            0
tourney_date             0
match_num                0
winner_id                0
winner_seed           2138
winner_entry          2969
winner_name              0
winner_hand              1
winner_ht              102
winner_ioc               0
winner_age               1
winner_rank            164
winner_rank_points     164
loser_id                 0
loser_seed            2691
loser_entry           2702
loser_name               0
loser_hand               0
loser_ht               190
loser_ioc                0
loser_age                0
loser_rank             206
loser_rank_points      206
score                    0
best_of                  0
round                    0
minutes                422
w_ace                  423
w_df                   423
w_svpt                 423
w_1stIn                423
w_1stWon               423
w_2ndWon               423
w

As there are a lot of empty values, select the columns used first.

In [30]:
won_column = np.where(True, 1, 0)
ATP_Matches["Won"] = won_column
ATP_Matches["Won"]

0       1
1       1
2       1
3       1
4       1
       ..
3359    1
3360    1
3361    1
3362    1
3363    1
Name: Won, Length: 3364, dtype: int32

In [31]:
tennis_dataset = pd.get_dummies(ATP_Matches)

In [32]:
tennis_dataset = ATP_Matches.iloc[:, [10, 20, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]
tennis_dataset = tennis_dataset.dropna()
tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
1,Fernando Gonzalez,Cecil Mamiit,4.0,2.0,67.0,35.0,25.0,16.0,10.0,4.0,...,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0,1
2,Paradorn Srichaphan,Sebastien Lareau,4.0,1.0,46.0,29.0,23.0,11.0,8.0,0.0,...,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0,1
3,Jan Siemerink,Justin Gimelstob,8.0,6.0,109.0,56.0,43.0,21.0,15.0,9.0,...,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0,1
4,Jason Stoltenberg,Alex Lopez Moron,3.0,0.0,50.0,27.0,22.0,16.0,9.0,1.0,...,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3034,Gustavo Kuerten,Yevgeny Kafelnikov,6.0,0.0,48.0,15.0,13.0,22.0,9.0,0.0,...,3.0,1.0,60.0,36.0,22.0,11.0,10.0,2.0,6.0,1
3035,Gustavo Kuerten,Andre Agassi,19.0,1.0,98.0,56.0,46.0,22.0,15.0,7.0,...,7.0,1.0,88.0,60.0,41.0,15.0,15.0,6.0,9.0,1
3036,Andre Agassi,Yevgeny Kafelnikov,6.0,1.0,51.0,31.0,27.0,12.0,9.0,1.0,...,5.0,4.0,64.0,32.0,22.0,12.0,8.0,5.0,8.0,1
3037,Andre Agassi,Magnus Norman,6.0,3.0,62.0,43.0,35.0,7.0,9.0,0.0,...,3.0,2.0,59.0,34.0,23.0,10.0,8.0,1.0,4.0,1


In [33]:
tennis_dataset.loc[tennis_dataset['winner_name'] == 'Antony Dupuis']

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
512,Antony Dupuis,Martin Rodriguez,8.0,0.0,94.0,58.0,44.0,26.0,16.0,1.0,...,13.0,4.0,102.0,50.0,44.0,21.0,16.0,3.0,6.0,1
1089,Antony Dupuis,Fredrik Jonsson,16.0,0.0,48.0,26.0,22.0,16.0,9.0,1.0,...,7.0,1.0,56.0,39.0,22.0,10.0,9.0,3.0,6.0,1
2261,Antony Dupuis,Magnus Gustafsson,5.0,0.0,102.0,63.0,45.0,22.0,15.0,10.0,...,6.0,7.0,78.0,44.0,34.0,17.0,14.0,4.0,7.0,1


In [34]:
winner = tennis_dataset.iloc[:, [0, 2, 3, 4, 5, 6, 7, 8, 9, 10,]]
winner = winner.groupby('winner_name').mean()
winner = winner.reset_index()
winner

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
0,Adrian Garcia,1.500000,0.500000,51.000000,33.500000,27.000000,10.000000,8.500000,0.000000,0.500000
1,Adrian Voinea,9.888889,3.222222,92.111111,53.777778,39.888889,20.222222,14.222222,4.333333,6.444444
2,Agustin Calleri,10.454545,3.727273,85.272727,40.727273,33.272727,24.636364,13.636364,3.000000,4.818182
3,Albert Costa,4.655172,1.379310,76.551724,47.758621,34.103448,17.413793,12.206897,3.586207,5.206897
4,Albert Portas,5.227273,3.818182,80.272727,44.227273,33.318182,18.636364,12.363636,3.727273,5.772727
...,...,...,...,...,...,...,...,...,...,...
230,Wolfgang Schranz,2.000000,8.000000,92.000000,52.000000,36.000000,19.000000,14.000000,2.000000,6.000000
231,Xavier Malisse,5.857143,2.428571,66.428571,38.857143,27.857143,15.428571,10.571429,3.714286,5.285714
232,Yevgeny Kafelnikov,5.457627,4.033898,82.000000,45.881356,34.423729,19.508475,12.949153,4.491525,6.457627
233,Yong Il Yoon,6.000000,3.000000,92.500000,54.000000,35.000000,16.500000,13.500000,7.000000,11.500000


In [35]:
winner.loc[winner['winner_name'] == 'Antony Dupuis']

Unnamed: 0,winner_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced
23,Antony Dupuis,9.25,0.25,92.5,55.75,41.75,23.25,14.0,6.5,7.5


In [36]:
loser = tennis_dataset.iloc[:, [1, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
loser = loser.groupby('loser_name').mean().reset_index()
loser

Unnamed: 0,loser_name,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,Adrian Garcia,1.500000,1.500000,59.500000,39.500000,24.000000,10.000000,9.000000,3.500000,7.000000
1,Adrian Voinea,4.050000,3.600000,64.550000,36.900000,24.900000,11.500000,10.050000,3.300000,6.850000
2,Agustin Calleri,9.200000,3.600000,94.400000,46.400000,35.600000,23.800000,14.400000,4.600000,8.400000
3,Albert Costa,3.217391,1.956522,75.565217,45.478261,29.086957,14.347826,11.521739,4.956522,8.608696
4,Albert Portas,4.347826,4.217391,74.739130,37.913043,25.347826,15.826087,11.434783,5.304348,9.739130
...,...,...,...,...,...,...,...,...,...,...
333,Yevgeny Kafelnikov,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,9.272727
334,Yong Il Yoon,2.000000,0.000000,76.000000,41.000000,33.000000,18.000000,11.000000,4.000000,5.000000
335,Younes El Aynaoui,6.600000,3.080000,82.360000,51.280000,34.800000,14.320000,12.360000,6.040000,9.440000
336,Yu Zhang,2.000000,1.000000,65.000000,33.000000,19.000000,13.000000,11.000000,4.000000,10.000000


In [37]:
tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
1,Fernando Gonzalez,Cecil Mamiit,4.0,2.0,67.0,35.0,25.0,16.0,10.0,4.0,...,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0,1
2,Paradorn Srichaphan,Sebastien Lareau,4.0,1.0,46.0,29.0,23.0,11.0,8.0,0.0,...,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0,1
3,Jan Siemerink,Justin Gimelstob,8.0,6.0,109.0,56.0,43.0,21.0,15.0,9.0,...,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0,1
4,Jason Stoltenberg,Alex Lopez Moron,3.0,0.0,50.0,27.0,22.0,16.0,9.0,1.0,...,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3034,Gustavo Kuerten,Yevgeny Kafelnikov,6.0,0.0,48.0,15.0,13.0,22.0,9.0,0.0,...,3.0,1.0,60.0,36.0,22.0,11.0,10.0,2.0,6.0,1
3035,Gustavo Kuerten,Andre Agassi,19.0,1.0,98.0,56.0,46.0,22.0,15.0,7.0,...,7.0,1.0,88.0,60.0,41.0,15.0,15.0,6.0,9.0,1
3036,Andre Agassi,Yevgeny Kafelnikov,6.0,1.0,51.0,31.0,27.0,12.0,9.0,1.0,...,5.0,4.0,64.0,32.0,22.0,12.0,8.0,5.0,8.0,1
3037,Andre Agassi,Magnus Norman,6.0,3.0,62.0,43.0,35.0,7.0,9.0,0.0,...,3.0,2.0,59.0,34.0,23.0,10.0,8.0,1.0,4.0,1


In [38]:
for i, row in tennis_dataset.iterrows():
    if(i > 500):
        temp = tennis_dataset.at[i, 'winner_name']
        tennis_dataset.at[i, 'winner_name'] = tennis_dataset.at[i, 'loser_name']
        tennis_dataset.at[i, 'loser_name'] = temp
        temp = tennis_dataset.at[i, 'w_ace']
        tennis_dataset.at[i,'w_ace'] = tennis_dataset.at[i, 'l_ace']
        tennis_dataset.at[i, 'l_ace'] = temp
        temp = tennis_dataset.at[i, 'w_df']
        tennis_dataset.at[i,'w_df'] = tennis_dataset.at[i, 'l_df']       
        tennis_dataset.at[i,'l_df'] = temp
        temp = tennis_dataset.at[i, 'w_svpt']
        tennis_dataset.at[i,'w_svpt'] = tennis_dataset.at[i, 'l_svpt']
        tennis_dataset.at[i, 'l_svpt'] = temp
        temp = tennis_dataset.at[i,'w_1stIn'] 
        tennis_dataset.at[i,'w_1stIn'] = tennis_dataset.at[i, 'l_1stIn']
        tennis_dataset.at[i,'w_1stWon'] = temp
        temp = tennis_dataset.at[i,'w_2ndWon']
        tennis_dataset.at[i,'w_2ndWon'] = tennis_dataset.at[i, 'l_2ndWon']
        tennis_dataset.at[i,'l_2ndWon'] = temp
        temp = tennis_dataset.at[i,'w_SvGms']
        tennis_dataset.at[i,'w_SvGms'] = tennis_dataset.at[i, 'l_SvGms']
        tennis_dataset.at[i,'l_SvGms'] = temp
        temp = tennis_dataset.at[i,'w_bpSaved']
        tennis_dataset.at[i,'w_bpSaved'] = tennis_dataset.at[i, 'l_bpSaved']
        tennis_dataset.at[i,'l_bpSaved'] = temp
        temp = tennis_dataset.at[i,'w_bpFaced']
        tennis_dataset.at[i,'w_bpFaced'] = tennis_dataset.at[i, 'l_bpFaced']
        tennis_dataset.at[i,'l_bpFaced'] = temp
        tennis_dataset.at[i,'Won'] = 0
tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
1,Fernando Gonzalez,Cecil Mamiit,4.0,2.0,67.0,35.0,25.0,16.0,10.0,4.0,...,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0,1
2,Paradorn Srichaphan,Sebastien Lareau,4.0,1.0,46.0,29.0,23.0,11.0,8.0,0.0,...,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0,1
3,Jan Siemerink,Justin Gimelstob,8.0,6.0,109.0,56.0,43.0,21.0,15.0,9.0,...,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0,1
4,Jason Stoltenberg,Alex Lopez Moron,3.0,0.0,50.0,27.0,22.0,16.0,9.0,1.0,...,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3034,Yevgeny Kafelnikov,Gustavo Kuerten,3.0,1.0,60.0,36.0,15.0,11.0,10.0,2.0,...,6.0,0.0,48.0,36.0,22.0,22.0,9.0,0.0,1.0,0
3035,Andre Agassi,Gustavo Kuerten,7.0,1.0,88.0,60.0,56.0,15.0,15.0,6.0,...,19.0,1.0,98.0,60.0,41.0,22.0,15.0,7.0,7.0,0
3036,Yevgeny Kafelnikov,Andre Agassi,5.0,4.0,64.0,32.0,31.0,12.0,8.0,5.0,...,6.0,1.0,51.0,32.0,22.0,12.0,9.0,1.0,1.0,0
3037,Magnus Norman,Andre Agassi,3.0,2.0,59.0,34.0,43.0,10.0,8.0,1.0,...,6.0,3.0,62.0,34.0,23.0,7.0,9.0,0.0,0.0,0


In [39]:
for i, row in tennis_dataset.iterrows():
    for y, row in winner.iterrows():
        if tennis_dataset.at[i,'winner_name'] == winner.at[y,'winner_name']:
            tennis_dataset.at[i,'w_ace'] = winner.at[y, 'w_ace']
            tennis_dataset.at[i,'w_df'] = winner.at[y, 'w_df']
            tennis_dataset.at[i,'w_svpt'] = winner.at[y, 'w_svpt']
            tennis_dataset.at[i,'w_1stIn'] = winner.at[y, 'w_1stIn']
            tennis_dataset.at[i,'w_1stWon'] = winner.at[y, 'w_1stWon']
            tennis_dataset.at[i,'w_2ndWon'] = winner.at[y, 'w_2ndWon']
            tennis_dataset.at[i,'w_SvGms'] = winner.at[y, 'w_SvGms']
            tennis_dataset.at[i,'w_bpSaved'] = winner.at[y, 'w_bpSaved']
            tennis_dataset.at[i,'w_bpFaced'] = winner.at[y, 'w_bpFaced']
    for x, row in loser.iterrows():
        if tennis_dataset.at[i,'winner_name'] == loser.at[x,'loser_name']:
            tennis_dataset.at[i,'w_ace'] = loser.at[x, 'l_ace']
            tennis_dataset.at[i,'w_df'] = loser.at[x, 'l_df']
            tennis_dataset.at[i,'w_svpt'] = loser.at[x, 'l_svpt']
            tennis_dataset.at[i,'w_1stIn'] = loser.at[x, 'l_1stIn']
            tennis_dataset.at[i,'w_1stWon'] = loser.at[x, 'l_1stWon']
            tennis_dataset.at[i,'w_2ndWon'] = loser.at[x, 'l_2ndWon']
            tennis_dataset.at[i,'w_SvGms'] = loser.at[x, 'l_SvGms']
            tennis_dataset.at[i,'w_bpSaved'] = loser.at[x, 'l_bpSaved']
            tennis_dataset.at[i,'w_bpFaced'] = loser.at[x, 'l_bpFaced']
tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,6.461538,3.153846,88.153846,50.923077,34.692308,17.538462,12.384615,6.076923,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
1,Fernando Gonzalez,Cecil Mamiit,5.600000,10.400000,70.000000,39.000000,28.200000,12.800000,11.200000,2.600000,...,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0,1
2,Paradorn Srichaphan,Sebastien Lareau,5.562500,4.687500,84.250000,45.125000,29.562500,16.625000,12.437500,5.250000,...,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0,1
3,Jan Siemerink,Justin Gimelstob,5.214286,6.285714,80.214286,43.500000,30.714286,16.857143,12.214286,5.214286,...,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0,1
4,Jason Stoltenberg,Alex Lopez Moron,6.285714,4.000000,83.142857,46.642857,33.857143,15.714286,12.928571,5.000000,...,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3034,Yevgeny Kafelnikov,Gustavo Kuerten,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,...,6.0,0.0,48.0,36.0,22.0,22.0,9.0,0.0,1.0,0
3035,Andre Agassi,Gustavo Kuerten,4.000000,3.466667,68.533333,40.800000,27.266667,12.800000,11.000000,4.400000,...,19.0,1.0,98.0,60.0,41.0,22.0,15.0,7.0,7.0,0
3036,Yevgeny Kafelnikov,Andre Agassi,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,...,6.0,1.0,51.0,32.0,22.0,12.0,9.0,1.0,1.0,0
3037,Magnus Norman,Andre Agassi,5.958333,3.291667,85.791667,52.833333,34.666667,15.458333,12.875000,5.666667,...,6.0,3.0,62.0,34.0,23.0,7.0,9.0,0.0,0.0,0


In [40]:
for i, row in tennis_dataset.iterrows():
    for y, row in loser.iterrows():
        if tennis_dataset.at[i,'loser_name'] == loser.at[y,'loser_name']:
            tennis_dataset.at[i,'l_ace'] = loser.at[y, 'l_ace']
            tennis_dataset.at[i,'l_df'] = loser.at[y, 'l_df']
            tennis_dataset.at[i,'l_svpt'] = loser.at[y, 'l_svpt']
            tennis_dataset.at[i,'l_1stIn'] = loser.at[y, 'l_1stIn']
            tennis_dataset.at[i,'l_1stWon'] = loser.at[y, 'l_1stWon']
            tennis_dataset.at[i,'l_2ndWon'] = loser.at[y, 'l_2ndWon']
            tennis_dataset.at[i,'l_SvGms'] = loser.at[y, 'l_SvGms']
            tennis_dataset.at[i,'l_bpSaved'] = loser.at[y, 'l_bpSaved']
            tennis_dataset.at[i,'l_bpFaced'] = loser.at[y, 'l_bpFaced']
    for x, row in winner.iterrows():
        if tennis_dataset.at[i,'loser_name'] == winner.at[x,'winner_name']:
            tennis_dataset.at[i,'l_ace'] = winner.at[x, 'w_ace']
            tennis_dataset.at[i,'l_df'] = winner.at[x, 'w_df']
            tennis_dataset.at[i,'l_svpt'] = winner.at[x, 'w_svpt']
            tennis_dataset.at[i,'l_1stIn'] = winner.at[x, 'w_1stIn']
            tennis_dataset.at[i,'l_1stWon'] = winner.at[x, 'w_1stWon']
            tennis_dataset.at[i,'l_2ndWon'] = winner.at[x, 'w_2ndWon']
            tennis_dataset.at[i,'l_SvGms'] = winner.at[x, 'w_SvGms']
            tennis_dataset.at[i,'l_bpSaved'] = winner.at[x, 'w_bpSaved']
            tennis_dataset.at[i,'l_bpFaced'] = winner.at[x, 'w_bpFaced']   

tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,6.461538,3.153846,88.153846,50.923077,34.692308,17.538462,12.384615,6.076923,...,6.333333,2.904762,84.190476,43.714286,32.238095,22.714286,12.761905,4.142857,6.000000,1
1,Fernando Gonzalez,Cecil Mamiit,5.600000,10.400000,70.000000,39.000000,28.200000,12.800000,11.200000,2.600000,...,3.666667,2.000000,64.666667,35.000000,27.666667,16.833333,10.666667,3.166667,4.666667,1
2,Paradorn Srichaphan,Sebastien Lareau,5.562500,4.687500,84.250000,45.125000,29.562500,16.625000,12.437500,5.250000,...,9.250000,3.750000,82.000000,48.750000,36.500000,16.000000,12.000000,3.250000,5.500000,1
3,Jan Siemerink,Justin Gimelstob,5.214286,6.285714,80.214286,43.500000,30.714286,16.857143,12.214286,5.214286,...,10.428571,4.928571,82.285714,47.285714,37.071429,18.214286,12.785714,3.500000,5.285714,1
4,Jason Stoltenberg,Alex Lopez Moron,6.285714,4.000000,83.142857,46.642857,33.857143,15.714286,12.928571,5.000000,...,0.000000,2.000000,44.000000,26.000000,21.000000,11.000000,8.000000,0.000000,0.000000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3034,Yevgeny Kafelnikov,Gustavo Kuerten,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,...,9.280702,2.070175,77.456140,41.403509,32.771930,19.877193,12.385965,3.210526,4.771930,0
3035,Andre Agassi,Gustavo Kuerten,4.000000,3.466667,68.533333,40.800000,27.266667,12.800000,11.000000,4.400000,...,9.280702,2.070175,77.456140,41.403509,32.771930,19.877193,12.385965,3.210526,4.771930,0
3036,Yevgeny Kafelnikov,Andre Agassi,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,...,5.857143,2.971429,76.800000,47.857143,36.342857,16.942857,12.571429,3.257143,4.428571,0
3037,Magnus Norman,Andre Agassi,5.958333,3.291667,85.791667,52.833333,34.666667,15.458333,12.875000,5.666667,...,5.857143,2.971429,76.800000,47.857143,36.342857,16.942857,12.571429,3.257143,4.428571,0


In [41]:
tennis_features = tennis_dataset.iloc[:, 2:-1]
true_outcomes = tennis_dataset.iloc[:, -1]

In [42]:
tennis_features

Unnamed: 0,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,w_bpFaced,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced
0,6.461538,3.153846,88.153846,50.923077,34.692308,17.538462,12.384615,6.076923,9.307692,6.333333,2.904762,84.190476,43.714286,32.238095,22.714286,12.761905,4.142857,6.000000
1,5.600000,10.400000,70.000000,39.000000,28.200000,12.800000,11.200000,2.600000,6.000000,3.666667,2.000000,64.666667,35.000000,27.666667,16.833333,10.666667,3.166667,4.666667
2,5.562500,4.687500,84.250000,45.125000,29.562500,16.625000,12.437500,5.250000,9.875000,9.250000,3.750000,82.000000,48.750000,36.500000,16.000000,12.000000,3.250000,5.500000
3,5.214286,6.285714,80.214286,43.500000,30.714286,16.857143,12.214286,5.214286,8.785714,10.428571,4.928571,82.285714,47.285714,37.071429,18.214286,12.785714,3.500000,5.285714
4,6.285714,4.000000,83.142857,46.642857,33.857143,15.714286,12.928571,5.000000,8.857143,0.000000,2.000000,44.000000,26.000000,21.000000,11.000000,8.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3034,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,9.272727,9.280702,2.070175,77.456140,41.403509,32.771930,19.877193,12.385965,3.210526,4.771930
3035,4.000000,3.466667,68.533333,40.800000,27.266667,12.800000,11.000000,4.400000,7.933333,9.280702,2.070175,77.456140,41.403509,32.771930,19.877193,12.385965,3.210526,4.771930
3036,4.515152,4.606061,80.272727,43.757576,28.878788,15.484848,12.030303,4.757576,9.272727,5.857143,2.971429,76.800000,47.857143,36.342857,16.942857,12.571429,3.257143,4.428571
3037,5.958333,3.291667,85.791667,52.833333,34.666667,15.458333,12.875000,5.666667,9.916667,5.857143,2.971429,76.800000,47.857143,36.342857,16.942857,12.571429,3.257143,4.428571


In [43]:
true_outcomes

0       1
1       1
2       1
3       1
4       1
       ..
3034    0
3035    0
3036    0
3037    0
3038    0
Name: Won, Length: 2941, dtype: int32

In [44]:
clf = LogisticRegression()

In [48]:
clf = clf.fit(tennis_features, true_outcomes)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [49]:
predicted_outcomes = clf.predict(tennis_features)

In [50]:
accuracy_score(true_outcomes, predicted_outcomes)

0.8361101666099966