### Logistic Regression
> Most published ML-based models make use of logistic regression. Clarke and Dyte [6] fit a logistic
regression model to the difference in the ATP rating points of the two players for predicting the outcome
of a set. In other words, they used a 1-dimensional feature space x = (rankdiff ), and optimised β1 so
that the function σ(β1 · rankdiff ) gave the best predictions for the training data. The parameter β0 was
omitted from the model on the basis that a rankdiff of 0 should result in a match-winning probability of
0.5. Instead of predicting the match outcome directly, Clark and Dyte opted to predict the set-winning
probability and run a simulation to find the match-winning probability, thereby increasing the size of
the dataset. The model was used to predict the result of several men’s tournaments in 1998 and 1999,
producing reasonable results (no precise figures on the accuracy of the prediction are given).
Ma, Liu and Tan [15] used a larger feature space of 16 variables belonging to three categories: player
skills and performance, player characteristics and match characteristics. The model was trained with
matches occurring between 1991 and 2008 and was used to make training recommendations to players
(e.g., “more training in returning skills”).

### Using Rank Difference

In [178]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support

### Visiting Untouched Dataset

In [179]:
ATP_Matches_2000 = pd.read_csv("./dataset/atp_matches_2000.csv")
ATP_Matches_2001 = pd.read_csv("./dataset/atp_matches_2001.csv")
ATP_Matches_2002 = pd.read_csv("./dataset/atp_matches_2002.csv")
ATP_Matches_2003 = pd.read_csv("./dataset/atp_matches_2003.csv")
ATP_Matches_2004 = pd.read_csv("./dataset/atp_matches_2004.csv")
ATP_Matches_2005 = pd.read_csv("./dataset/atp_matches_2005.csv")
ATP_Matches_2006 = pd.read_csv("./dataset/atp_matches_2006.csv")
ATP_Matches_2007 = pd.read_csv("./dataset/atp_matches_2007.csv")
ATP_Matches_2008 = pd.read_csv("./dataset/atp_matches_2008.csv")
ATP_Matches_2009 = pd.read_csv("./dataset/atp_matches_2009.csv")
ATP_Matches_2010 = pd.read_csv("./dataset/atp_matches_2010.csv")
ATP_Matches_2011 = pd.read_csv("./dataset/atp_matches_2011.csv")
ATP_Matches_2012 = pd.read_csv("./dataset/atp_matches_2012.csv")
ATP_Matches_2013 = pd.read_csv("./dataset/atp_matches_2013.csv")
ATP_Matches_2014 = pd.read_csv("./dataset/atp_matches_2014.csv")
ATP_Matches_2015 = pd.read_csv("./dataset/atp_matches_2015.csv")
ATP_Matches_2016 = pd.read_csv("./dataset/atp_matches_2016.csv")
ATP_Matches_2017 = pd.read_csv("./dataset/atp_matches_2017.csv")

This procedure is done to prepare to merge all the different dataframes from the all the years together.

### Merging all the dataframes

In [192]:
ATP_Matches = pd.concat([ATP_Matches_2000,ATP_Matches_2001, ATP_Matches_2002, ATP_Matches_2003, ATP_Matches_2003, ATP_Matches_2004, ATP_Matches_2005, ATP_Matches_2006, ATP_Matches_2007, ATP_Matches_2008, ATP_Matches_2009, ATP_Matches_2010, ATP_Matches_2011, ATP_Matches_2012, ATP_Matches_2013, ATP_Matches_2014, ATP_Matches_2015, ATP_Matches_2016, ATP_Matches_2017])

In [193]:
ATP_Matches.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

In [194]:
ATP_Matches.isnull().sum()

tourney_id               63
tourney_name             77
surface                1321
draw_size                63
tourney_level            71
tourney_date            131
match_num               131
winner_id                63
winner_seed           31418
winner_entry          47210
winner_name              99
winner_hand             113
winner_ht              2820
winner_ioc              103
winner_age              116
winner_rank            1287
winner_rank_points     1287
loser_id                103
loser_seed            41611
loser_entry           42997
loser_name              103
loser_hand              118
loser_ht               4863
loser_ioc               103
loser_age               126
loser_rank             2009
loser_rank_points      2009
score                   104
best_of                 103
round                  3392
minutes               10719
w_ace                  9525
w_df                   9525
w_svpt                 9525
w_1stIn                9525
w_1stWon            

### Creating the true outcome column

In [195]:
won = np.where(True, 1, 0)
ATP_Matches["Won"] = won

This will be a bit of a problem as the predicted output column only contains 1 class for now because this dataset already contains the winner/loser column. We will have to tweak the dataset for it to have 2 classes. 

In [196]:
tennis_dataset = pd.get_dummies(ATP_Matches)

### Creating the dataset without null values


In [199]:
tennis_dataset = ATP_Matches.iloc[:, [15, 16, 25, 26, 49]]
tennis_dataset = tennis_dataset.dropna()
tennis_dataset = tennis_dataset.reset_index(drop=True)
tennis_dataset

Unnamed: 0,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Won
0,113.0,351.0,50.0,762.0,1
1,352.0,76.0,139.0,280.0,1
2,103.0,380.0,133.0,293.0,1
3,107.0,371.0,95.0,408.0,1
4,74.0,543.0,111.0,357.0,1
...,...,...,...,...,...
54539,42.0,16.0,20.0,13.0,1
54540,37.0,22.0,9.0,13.0,1
54541,68.0,25.0,17.0,19.0,1
54542,31.0,14.0,14.0,10.0,1


### Add a new column being the difference in ranks of both player

Ready to create the dataset.

In [201]:
tennis_dataset['rank_difference'] = tennis_dataset['winner_rank'] - tennis_dataset['loser_rank']
tennis_dataset['rank_difference_points'] = tennis_dataset['winner_rank_points'] - tennis_dataset['loser_rank_points']
tennis_dataset

Unnamed: 0,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Won,rank_difference,rank_difference_points
0,113.0,351.0,50.0,762.0,1,63.0,-411.0
1,352.0,76.0,139.0,280.0,1,213.0,-204.0
2,103.0,380.0,133.0,293.0,1,-30.0,87.0
3,107.0,371.0,95.0,408.0,1,12.0,-37.0
4,74.0,543.0,111.0,357.0,1,-37.0,186.0
...,...,...,...,...,...,...,...
54539,42.0,16.0,20.0,13.0,1,22.0,3.0
54540,37.0,22.0,9.0,13.0,1,28.0,9.0
54541,68.0,25.0,17.0,19.0,1,51.0,6.0
54542,31.0,14.0,14.0,10.0,1,17.0,4.0


### Preprocessing for two classed in predicted outcomes

In [208]:
for i, row in tennis_dataset.iterrows():
    if(i > (54544/2)):
        tennis_dataset.at[i,'rank_difference'] = -(tennis_dataset.at[i, 'rank_difference'])
        tennis_dataset.at[i,'rank_difference_points'] = -(tennis_dataset.at[i, 'rank_difference_points'])
        tennis_dataset.at[i,'Won'] = 0
tennis_dataset

Unnamed: 0,winner_rank,winner_rank_points,loser_rank,loser_rank_points,Won,rank_difference,rank_difference_points
0,113.0,351.0,50.0,762.0,1,63.0,-411.0
1,352.0,76.0,139.0,280.0,1,213.0,-204.0
2,103.0,380.0,133.0,293.0,1,-30.0,87.0
3,107.0,371.0,95.0,408.0,1,12.0,-37.0
4,74.0,543.0,111.0,357.0,1,-37.0,186.0
...,...,...,...,...,...,...,...
54539,42.0,16.0,20.0,13.0,0,-22.0,-3.0
54540,37.0,22.0,9.0,13.0,0,-28.0,-9.0
54541,68.0,25.0,17.0,19.0,0,-51.0,-6.0
54542,31.0,14.0,14.0,10.0,0,-17.0,-4.0


### Create dataset

In [209]:
tennis_features = tennis_dataset.iloc[:, -2:]
true_outcomes = tennis_dataset.iloc[:, -3]

In [210]:
tennis_features

Unnamed: 0,rank_difference,rank_difference_points
0,63.0,-411.0
1,213.0,-204.0
2,-30.0,87.0
3,12.0,-37.0
4,-37.0,186.0
...,...,...
54539,-22.0,-3.0
54540,-28.0,-9.0
54541,-51.0,-6.0
54542,-17.0,-4.0


In [212]:
true_outcomes.value_counts()

1    27273
0    27271
Name: Won, dtype: int64

### Running the Logistic Regression

In [213]:
clf = LogisticRegression()

In [214]:
clf = clf.fit(tennis_features, true_outcomes)

In [215]:
predicted_outcomes = clf.predict(tennis_features)

In [216]:
accuracy_score(true_outcomes, predicted_outcomes)

0.6209482252860077

In [219]:
precision_recall_fscore_support(true_outcomes, predicted_outcomes)

(array([0.62750329, 0.61503592]),
 array([0.59517436, 0.6467202 ]),
 array([0.61091142, 0.63048024]),
 array([27271, 27273], dtype=int64))

In [None]:
f1_score(true_outcomes, predicted_outcomes)

### Conclusion


As we can see, the rank difference does predict the results relatively precisely as it makes sense that higher ranked players have objectively a better chance at winning. However, it's still not be the best feature to base upon our predictions because simply having knowing the rank doesn't give us the full picture of a player. As we know, having the better rank doesn't necessarily guarantee the victory as upsets can happen or else every best player/team in sports would only have undefeated records.

### Logistic Regression using Player statistics

This logistic regression will be using player match statistics such as aces, break points, etc. We think using player statistics as features to predict the outcome will be more precise as they actually are a direct representation of the player's profile.

### Visiting Dataset

In [20]:
ATP_Matches_2000 = pd.read_csv("./dataset/atp_matches_2000.csv")
ATP_Matches_2001 = pd.read_csv("./dataset/atp_matches_2001.csv")
ATP_Matches_2002 = pd.read_csv("./dataset/atp_matches_2002.csv")
ATP_Matches_2003 = pd.read_csv("./dataset/atp_matches_2003.csv")
ATP_Matches_2004 = pd.read_csv("./dataset/atp_matches_2004.csv")
ATP_Matches_2005 = pd.read_csv("./dataset/atp_matches_2005.csv")
ATP_Matches_2006 = pd.read_csv("./dataset/atp_matches_2006.csv")
ATP_Matches_2007 = pd.read_csv("./dataset/atp_matches_2007.csv")
ATP_Matches_2008 = pd.read_csv("./dataset/atp_matches_2008.csv")
ATP_Matches_2009 = pd.read_csv("./dataset/atp_matches_2009.csv")
ATP_Matches_2010 = pd.read_csv("./dataset/atp_matches_2010.csv")
ATP_Matches_2011 = pd.read_csv("./dataset/atp_matches_2011.csv")
ATP_Matches_2012 = pd.read_csv("./dataset/atp_matches_2012.csv")
ATP_Matches_2013 = pd.read_csv("./dataset/atp_matches_2013.csv")
ATP_Matches_2014 = pd.read_csv("./dataset/atp_matches_2014.csv")
ATP_Matches_2015 = pd.read_csv("./dataset/atp_matches_2015.csv")
ATP_Matches_2016 = pd.read_csv("./dataset/atp_matches_2016.csv")
ATP_Matches_2017 = pd.read_csv("./dataset/atp_matches_2017.csv")

This procedure is done to prepare to merge all the different dataframes from the all the years together.

### Merging all the dataframes

In [21]:
ATP_Matches = pd.concat([ATP_Matches_2000,ATP_Matches_2001, ATP_Matches_2002, ATP_Matches_2003, ATP_Matches_2003, ATP_Matches_2004, ATP_Matches_2005, ATP_Matches_2006, ATP_Matches_2007, ATP_Matches_2008, ATP_Matches_2009, ATP_Matches_2010, ATP_Matches_2011, ATP_Matches_2012, ATP_Matches_2013, ATP_Matches_2014, ATP_Matches_2015, ATP_Matches_2016, ATP_Matches_2017])

In [22]:
ATP_Matches.columns

Index(['tourney_id', 'tourney_name', 'surface', 'draw_size', 'tourney_level',
       'tourney_date', 'match_num', 'winner_id', 'winner_seed', 'winner_entry',
       'winner_name', 'winner_hand', 'winner_ht', 'winner_ioc', 'winner_age',
       'winner_rank', 'winner_rank_points', 'loser_id', 'loser_seed',
       'loser_entry', 'loser_name', 'loser_hand', 'loser_ht', 'loser_ioc',
       'loser_age', 'loser_rank', 'loser_rank_points', 'score', 'best_of',
       'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 'w_1stWon',
       'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df',
       'l_svpt', 'l_1stIn', 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved',
       'l_bpFaced'],
      dtype='object')

In [23]:
ATP_Matches.isnull().sum()

tourney_id               63
tourney_name             77
surface                1321
draw_size                63
tourney_level            71
tourney_date            131
match_num               131
winner_id                63
winner_seed           31418
winner_entry          47210
winner_name              99
winner_hand             113
winner_ht              2820
winner_ioc              103
winner_age              116
winner_rank            1287
winner_rank_points     1287
loser_id                103
loser_seed            41611
loser_entry           42997
loser_name              103
loser_hand              118
loser_ht               4863
loser_ioc               103
loser_age               126
loser_rank             2009
loser_rank_points      2009
score                   104
best_of                 103
round                  3392
minutes               10719
w_ace                  9525
w_df                   9525
w_svpt                 9525
w_1stIn                9525
w_1stWon            

As there are a lot of empty values, select the columns used first.

### Creating the true outcome column

In [24]:
won = np.where(True, 1, 0)
ATP_Matches["Won"] = won

This will be a bit of a problem as the predicted output column only contains 1 class for now because this dataset already contains the winner/loser column. We will have to tweak the dataset for it to have 2 classes. 

In [25]:
tennis_dataset = pd.get_dummies(ATP_Matches)

### Creating the dataset without null values

In [26]:
tennis_dataset = ATP_Matches.iloc[:, [10, 20, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]
tennis_dataset = tennis_dataset.dropna()
tennis_dataset = tennis_dataset.reset_index(drop=True)
tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
1,Fernando Gonzalez,Cecil Mamiit,4.0,2.0,67.0,35.0,25.0,16.0,10.0,4.0,...,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0,1
2,Paradorn Srichaphan,Sebastien Lareau,4.0,1.0,46.0,29.0,23.0,11.0,8.0,0.0,...,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0,1
3,Jan Siemerink,Justin Gimelstob,8.0,6.0,109.0,56.0,43.0,21.0,15.0,9.0,...,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0,1
4,Jason Stoltenberg,Alex Lopez Moron,3.0,0.0,50.0,27.0,22.0,16.0,9.0,1.0,...,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47255,Rafael Nadal,David Ferrer,1.0,2.0,93.0,59.0,38.0,23.0,16.0,4.0,...,6.0,3.0,106.0,67.0,43.0,17.0,15.0,7.0,12.0,1
47256,Stanislas Wawrinka,Andy Murray,4.0,4.0,80.0,36.0,27.0,26.0,11.0,8.0,...,2.0,1.0,64.0,37.0,28.0,13.0,11.0,2.0,5.0,1
47257,Novak Djokovic,Rafael Nadal,4.0,1.0,46.0,28.0,25.0,11.0,9.0,0.0,...,3.0,0.0,53.0,37.0,20.0,11.0,9.0,2.0,5.0,1
47258,Roger Federer,Stanislas Wawrinka,6.0,3.0,64.0,41.0,31.0,15.0,11.0,1.0,...,6.0,2.0,63.0,39.0,27.0,10.0,10.0,3.0,6.0,1


Even after removing all the null values, we still have more than 40000 rows which is enough for our predicting model.

### Calculating the average for every single player's statistics column

In [27]:
winner = tennis_dataset.iloc[:, [0, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
winner.columns = ['name', 'ace', 'df', 'svpt', '1stIn', '1stWon', '2ndWon', 'SvGms', 'bpSaved', 'bpFaced']
winner

Unnamed: 0,name,ace,df,svpt,1stIn,1stWon,2ndWon,SvGms,bpSaved,bpFaced
0,Antony Dupuis,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,15.0
1,Fernando Gonzalez,4.0,2.0,67.0,35.0,25.0,16.0,10.0,4.0,6.0
2,Paradorn Srichaphan,4.0,1.0,46.0,29.0,23.0,11.0,8.0,0.0,0.0
3,Jan Siemerink,8.0,6.0,109.0,56.0,43.0,21.0,15.0,9.0,12.0
4,Jason Stoltenberg,3.0,0.0,50.0,27.0,22.0,16.0,9.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
47255,Rafael Nadal,1.0,2.0,93.0,59.0,38.0,23.0,16.0,4.0,7.0
47256,Stanislas Wawrinka,4.0,4.0,80.0,36.0,27.0,26.0,11.0,8.0,10.0
47257,Novak Djokovic,4.0,1.0,46.0,28.0,25.0,11.0,9.0,0.0,0.0
47258,Roger Federer,6.0,3.0,64.0,41.0,31.0,15.0,11.0,1.0,2.0


#### Sample of a player's average statistics

In [28]:
winner.loc[winner['name'] == 'Antony Dupuis']

Unnamed: 0,name,ace,df,svpt,1stIn,1stWon,2ndWon,SvGms,bpSaved,bpFaced
0,Antony Dupuis,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,15.0
507,Antony Dupuis,8.0,0.0,94.0,58.0,44.0,26.0,16.0,1.0,2.0
1078,Antony Dupuis,16.0,0.0,48.0,26.0,22.0,16.0,9.0,1.0,1.0
2176,Antony Dupuis,5.0,0.0,102.0,63.0,45.0,22.0,15.0,10.0,12.0
3844,Antony Dupuis,8.0,3.0,143.0,71.0,43.0,41.0,17.0,13.0,17.0
...,...,...,...,...,...,...,...,...,...,...
22836,Antony Dupuis,7.0,0.0,71.0,42.0,31.0,19.0,13.0,1.0,3.0
23133,Antony Dupuis,6.0,0.0,69.0,52.0,39.0,13.0,12.0,0.0,1.0
23146,Antony Dupuis,8.0,0.0,57.0,46.0,38.0,10.0,11.0,0.0,0.0
23152,Antony Dupuis,11.0,0.0,79.0,59.0,43.0,13.0,11.0,4.0,4.0


In [29]:
loser = tennis_dataset.iloc[:, [1, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
loser.columns = ['name', 'ace', 'df', 'svpt', '1stIn', '1stWon', '2ndWon', 'SvGms', 'bpSaved', 'bpFaced']
loser

Unnamed: 0,name,ace,df,svpt,1stIn,1stWon,2ndWon,SvGms,bpSaved,bpFaced
0,Andrew Ilie,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0
1,Cecil Mamiit,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0
2,Sebastien Lareau,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0
3,Justin Gimelstob,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0
4,Alex Lopez Moron,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0
...,...,...,...,...,...,...,...,...,...,...
47255,David Ferrer,6.0,3.0,106.0,67.0,43.0,17.0,15.0,7.0,12.0
47256,Andy Murray,2.0,1.0,64.0,37.0,28.0,13.0,11.0,2.0,5.0
47257,Rafael Nadal,3.0,0.0,53.0,37.0,20.0,11.0,9.0,2.0,5.0
47258,Stanislas Wawrinka,6.0,2.0,63.0,39.0,27.0,10.0,10.0,3.0,6.0


In [30]:
loser.loc[loser['name'] == 'Ze Zhang']

Unnamed: 0,name,ace,df,svpt,1stIn,1stWon,2ndWon,SvGms,bpSaved,bpFaced
32198,Ze Zhang,11.0,4.0,105.0,80.0,57.0,7.0,15.0,7.0,10.0
36522,Ze Zhang,4.0,5.0,60.0,37.0,24.0,9.0,9.0,4.0,7.0
36604,Ze Zhang,3.0,4.0,53.0,35.0,22.0,6.0,9.0,4.0,8.0
38833,Ze Zhang,1.0,9.0,49.0,28.0,16.0,6.0,7.0,7.0,12.0
38893,Ze Zhang,0.0,3.0,51.0,35.0,23.0,8.0,9.0,4.0,7.0
39352,Ze Zhang,3.0,2.0,39.0,22.0,10.0,5.0,7.0,4.0,9.0
39718,Ze Zhang,4.0,2.0,60.0,44.0,27.0,7.0,10.0,3.0,6.0
41032,Ze Zhang,0.0,7.0,65.0,47.0,29.0,10.0,11.0,3.0,6.0
43042,Ze Zhang,7.0,6.0,111.0,64.0,44.0,25.0,19.0,9.0,14.0
43384,Ze Zhang,3.0,2.0,96.0,67.0,40.0,14.0,15.0,8.0,13.0


### Merge Winner and Loser data

In [31]:
player_stats = winner.append(loser)
player_stats = player_stats.groupby('name').mean()
player_stats = player_stats.reset_index()
player_stats.loc[player_stats['name'] == 'Rafael Nadal']

Unnamed: 0,name,ace,df,svpt,1stIn,1stWon,2ndWon,SvGms,bpSaved,bpFaced
942,Rafael Nadal,2.901834,1.550162,72.434736,49.845739,35.725998,12.85329,11.829558,3.363538,5.105717


These are the features we will be using to predict the outcomes.

### Preprocessing for two classed in predicted outcomes

In [32]:
for i, row in tennis_dataset.iterrows():
    if(i > (47260/2)):
        temp = tennis_dataset.at[i, 'winner_name']
        tennis_dataset.at[i, 'winner_name'] = tennis_dataset.at[i, 'loser_name']
        tennis_dataset.at[i, 'loser_name'] = temp
        temp = tennis_dataset.at[i, 'w_ace']
        tennis_dataset.at[i,'w_ace'] = tennis_dataset.at[i, 'l_ace']
        tennis_dataset.at[i, 'l_ace'] = temp
        temp = tennis_dataset.at[i, 'w_df']
        tennis_dataset.at[i,'w_df'] = tennis_dataset.at[i, 'l_df']       
        tennis_dataset.at[i,'l_df'] = temp
        temp = tennis_dataset.at[i, 'w_svpt']
        tennis_dataset.at[i,'w_svpt'] = tennis_dataset.at[i, 'l_svpt']
        tennis_dataset.at[i, 'l_svpt'] = temp
        temp = tennis_dataset.at[i,'w_1stIn'] 
        tennis_dataset.at[i,'w_1stIn'] = tennis_dataset.at[i, 'l_1stIn']
        tennis_dataset.at[i,'w_1stWon'] = temp
        temp = tennis_dataset.at[i,'w_2ndWon']
        tennis_dataset.at[i,'w_2ndWon'] = tennis_dataset.at[i, 'l_2ndWon']
        tennis_dataset.at[i,'l_2ndWon'] = temp
        temp = tennis_dataset.at[i,'w_SvGms']
        tennis_dataset.at[i,'w_SvGms'] = tennis_dataset.at[i, 'l_SvGms']
        tennis_dataset.at[i,'l_SvGms'] = temp
        temp = tennis_dataset.at[i,'w_bpSaved']
        tennis_dataset.at[i,'w_bpSaved'] = tennis_dataset.at[i, 'l_bpSaved']
        tennis_dataset.at[i,'l_bpSaved'] = temp
        temp = tennis_dataset.at[i,'w_bpFaced']
        tennis_dataset.at[i,'w_bpFaced'] = tennis_dataset.at[i, 'l_bpFaced']
        tennis_dataset.at[i,'l_bpFaced'] = temp
        tennis_dataset.at[i,'Won'] = 0
tennis_dataset

Unnamed: 0,winner_name,loser_name,w_ace,w_df,w_svpt,w_1stIn,w_1stWon,w_2ndWon,w_SvGms,w_bpSaved,...,l_ace,l_df,l_svpt,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,Won
0,Antony Dupuis,Andrew Ilie,8.0,1.0,126.0,76.0,56.0,29.0,16.0,14.0,...,13.0,4.0,110.0,59.0,49.0,31.0,17.0,4.0,4.0,1
1,Fernando Gonzalez,Cecil Mamiit,4.0,2.0,67.0,35.0,25.0,16.0,10.0,4.0,...,0.0,0.0,57.0,24.0,13.0,17.0,10.0,4.0,9.0,1
2,Paradorn Srichaphan,Sebastien Lareau,4.0,1.0,46.0,29.0,23.0,11.0,8.0,0.0,...,2.0,2.0,65.0,39.0,22.0,10.0,8.0,6.0,10.0,1
3,Jan Siemerink,Justin Gimelstob,8.0,6.0,109.0,56.0,43.0,21.0,15.0,9.0,...,4.0,6.0,104.0,57.0,35.0,24.0,15.0,6.0,11.0,1
4,Jason Stoltenberg,Alex Lopez Moron,3.0,0.0,50.0,27.0,22.0,16.0,9.0,1.0,...,0.0,3.0,47.0,28.0,17.0,10.0,8.0,3.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47255,David Ferrer,Rafael Nadal,6.0,3.0,106.0,67.0,59.0,17.0,15.0,7.0,...,1.0,2.0,93.0,67.0,43.0,23.0,16.0,4.0,7.0,0
47256,Andy Murray,Stanislas Wawrinka,2.0,1.0,64.0,37.0,36.0,13.0,11.0,2.0,...,4.0,4.0,80.0,37.0,28.0,26.0,11.0,8.0,10.0,0
47257,Rafael Nadal,Novak Djokovic,3.0,0.0,53.0,37.0,28.0,11.0,9.0,2.0,...,4.0,1.0,46.0,37.0,20.0,11.0,9.0,0.0,0.0,0
47258,Stanislas Wawrinka,Roger Federer,6.0,2.0,63.0,39.0,41.0,10.0,10.0,3.0,...,6.0,3.0,64.0,39.0,27.0,15.0,11.0,1.0,2.0,0


We are swapping the winner and losers place to assign 0 to the "Won" column. Decided to cut through in the middle to avoid any bias.

In [33]:
tennis_dataset['Won'].value_counts()

1    23631
0    23629
Name: Won, dtype: int64

In [None]:
test1 = False
test2 = False

for i, row1 in tennis_dataset.iterrows():
    for y, row2 in player_stats.iterrows():
        if(tennis_dataset.at[i,'winner_name'] == player_stats.at[y,'name']):
            test1 = True
            tennis_dataset.at[i,'w_ace'] = player_stats.at[y, 'ace']
            tennis_dataset.at[i,'w_df'] = player_stats.at[y, 'df']
            tennis_dataset.at[i,'w_svpt'] = player_stats.at[y, 'svpt']
            tennis_dataset.at[i,'w_1stIn'] = player_stats.at[y, '1stIn']
            tennis_dataset.at[i,'w_1stWon'] = player_stats.at[y, '1stWon']
            tennis_dataset.at[i,'w_2ndWon'] = player_stats.at[y, '2ndWon']
            tennis_dataset.at[i,'w_SvGms'] = player_stats.at[y, 'SvGms']
            tennis_dataset.at[i,'w_bpSaved'] = player_stats.at[y, 'bpSaved']
            tennis_dataset.at[i,'w_bpFaced'] = player_stats.at[y, 'bpFaced']
            
        if tennis_dataset.at[i,'loser_name'] == player_stats.at[y,'name']:
            test2 = True
            tennis_dataset.at[i,'l_ace'] = player_stats.at[y, 'ace']
            tennis_dataset.at[i,'l_df'] = player_stats.at[y, 'df']
            tennis_dataset.at[i,'l_svpt'] = player_stats.at[y, 'svpt']
            tennis_dataset.at[i,'l_1stIn'] = player_stats.at[y, '1stIn']
            tennis_dataset.at[i,'l_1stWon'] = player_stats.at[y, '1stWon']
            tennis_dataset.at[i,'l_2ndWon'] = player_stats.at[y, '2ndWon']
            tennis_dataset.at[i,'l_SvGms'] = player_stats.at[y, 'SvGms']
            tennis_dataset.at[i,'l_bpSaved'] = player_stats.at[y, 'bpSaved']
            tennis_dataset.at[i,'l_bpFaced'] = player_stats.at[y, 'bpFaced']
        
        if(test1 and test2):
            test1 = False
            test2 = False
            break
            
tennis_dataset

In [None]:
tennis_features = tennis_dataset.iloc[:, 2:-1]
true_outcomes = tennis_dataset.iloc[:, -1]

In [None]:
tennis_features

In [None]:
true_outcomes

In [None]:
clf = LogisticRegression(solver='lbfgs', max_iter=10000)

In [None]:
clf = clf.fit(tennis_features, true_outcomes)

In [None]:
predicted_outcomes = clf.predict(tennis_features)

In [None]:
accuracy_score(true_outcomes, predicted_outcomes)

In [None]:
from sklearn.metrics import precision_recall_fscore_support

In [None]:
precision_recall_fscore_support(true_outcomes, predicted_outcomes)

In [None]:
f1_score(true_outcomes, predicted_outcomes)