<img src="https://cdn.freelogovectors.net/wp-content/uploads/2020/06/australian_open_logo.png" style="float: left; margin: 25px; height: 55px">

# Predicting the 2021 Australian Open
_Modelling_

**Data Dictionary**

- `year`: Year the match took place
- `score`: Set scores
- `round`: Round within tournament (QF, SF, F)
- `minutes`: How many minutes the match lasted
- `p1name`: Player 1 name
- `p1age`: Player 1 age
- `p1ace`: Number of aces scored by player 1
- `p1rank`: Player 1's rank at the time of the match 
- `p2name`: Player 2 name
- `p2age`: Player 2 age
- `p2ace`: Number of aces scored by player 2
- `p2rank`: Player 2's rank at the time of the match 
- `p1_win_pct`: Player 1 win percentage over the past 3 years
- `p1_win_pct_hsf`: Player 1 win percentage on hard surface over the past 3 years
- `p2_win_pct`: Player 2 win percentage over the past 3 years 
- `p2_win_pct_hsf`: Player 2 win percentage on hard surface over the past 3 years 
- `p1won`: Whether player 1 won or not (1,0)
- `p2won`: Whether player 2 won or not (1,0)

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
from datetime import datetime
import seaborn as sns
import numpy as np 

%matplotlib inline

# setting default figure and font sizes
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

In [2]:
pwd

'/Users/kbhunjan/datr-1116/final_project/AO2021/model'

In [3]:
ao = pd.read_csv('/Users/kbhunjan/datr-1116/final_project/AO2021/data/final_ao.csv')
ao.head()

Unnamed: 0.1,Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,p2name,p2age,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won
0,0,2014,6-4 7-6(5) 6-7(9) 6-2,R128,228.0,Carlos Berlocq,30.94319,9.0,41.0,0.111111,0.47,Edouard Roger Vasselin,30.12731,6.0,40.0,0.55,0.56,0,1
1,1,2016,7-5 6-3 6-2,R32,104.0,Stephane Robert,35.671458,3.0,225.0,0.25,0.47,Gael Monfils,29.379877,10.0,25.0,0.642857,0.710843,0,1
2,2,2018,6-3 6-2 6-1,R128,98.0,Dennis Novak,24.383299,3.0,226.0,0.394737,0.318182,Grigor Dimitrov,26.669405,7.0,3.0,0.550459,0.571429,0,1
3,3,2013,6-4 6-2 6-4,R128,106.0,Grigor Dimitrov,21.667351,5.0,41.0,0.550459,0.571429,Julien Benneteau,31.069131,6.0,38.0,0.421053,0.454545,0,1
4,4,2015,6-3 7-6(6) 6-1,R32,118.0,Malek Jaziri,30.997947,6.0,75.0,0.405797,0.388889,Nick Kyrgios,19.731691,27.0,53.0,0.642857,0.671875,0,1


In [4]:
ao.year.value_counts()

2019    127
2018    127
2017    127
2016    127
2015    127
2013    127
2011    127
2020    126
2014    126
2012    126
2010    126
Name: year, dtype: int64

In [5]:
ao.drop(['Unnamed: 0'], axis=1, inplace=True)

In [6]:
ao.head()

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,p2name,p2age,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won
0,2014,6-4 7-6(5) 6-7(9) 6-2,R128,228.0,Carlos Berlocq,30.94319,9.0,41.0,0.111111,0.47,Edouard Roger Vasselin,30.12731,6.0,40.0,0.55,0.56,0,1
1,2016,7-5 6-3 6-2,R32,104.0,Stephane Robert,35.671458,3.0,225.0,0.25,0.47,Gael Monfils,29.379877,10.0,25.0,0.642857,0.710843,0,1
2,2018,6-3 6-2 6-1,R128,98.0,Dennis Novak,24.383299,3.0,226.0,0.394737,0.318182,Grigor Dimitrov,26.669405,7.0,3.0,0.550459,0.571429,0,1
3,2013,6-4 6-2 6-4,R128,106.0,Grigor Dimitrov,21.667351,5.0,41.0,0.550459,0.571429,Julien Benneteau,31.069131,6.0,38.0,0.421053,0.454545,0,1
4,2015,6-3 7-6(6) 6-1,R32,118.0,Malek Jaziri,30.997947,6.0,75.0,0.405797,0.388889,Nick Kyrgios,19.731691,27.0,53.0,0.642857,0.671875,0,1


In [7]:
ao.columns

Index(['year', 'score', 'round', 'minutes', 'p1name', 'p1age', 'p1ace',
       'p1rank', 'p1_win_pct', 'p1_win_pct_hsf', 'p2name', 'p2age', 'p2ace',
       'p2rank', 'p2_win_pct', 'p2_win_pct_hsf', 'p1won', 'p2won'],
      dtype='object')

## 1. KNN classification

In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = ao[['p1rank', 'p2rank', 'p1_win_pct', 'p2_win_pct', 'p1_win_pct_hsf', 'p2_win_pct_hsf']]
y = ao['p1won']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=123)

In [9]:
for i in [1, 5, 15, 50]:
    knn = KNeighborsClassifier()
    knn.fit(X_train, y_train)
    print ("the score for", i, "neighbors:")
    print (knn.score(X_test, y_test))
    print ()
    print ("---"*20)

the score for 1 neighbors:
0.509325681492109

------------------------------------------------------------
the score for 5 neighbors:
0.509325681492109

------------------------------------------------------------
the score for 15 neighbors:
0.509325681492109

------------------------------------------------------------
the score for 50 neighbors:
0.509325681492109

------------------------------------------------------------


In [10]:
# Instantiating the model (using the value K=5).
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the model with data.
knn.fit(X_train, y_train)

# Store the predicted response values.
y_pred_class = knn.predict(X_test)

In [11]:
knn.score(X_test, y_test)

0.509325681492109

In [12]:
knn.predict_proba(X)

array([[0.6, 0.4],
       [0.4, 0.6],
       [1. , 0. ],
       ...,
       [0.4, 0.6],
       [0. , 1. ],
       [0.4, 0.6]])

In [13]:
# Storing the prediction for player 1 wins
ao['p1_win_pred_knn'] = knn.predict(X)

In [14]:
ao.head()

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,p2name,p2age,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won,p1_win_pred_knn
0,2014,6-4 7-6(5) 6-7(9) 6-2,R128,228.0,Carlos Berlocq,30.94319,9.0,41.0,0.111111,0.47,Edouard Roger Vasselin,30.12731,6.0,40.0,0.55,0.56,0,1,0
1,2016,7-5 6-3 6-2,R32,104.0,Stephane Robert,35.671458,3.0,225.0,0.25,0.47,Gael Monfils,29.379877,10.0,25.0,0.642857,0.710843,0,1,1
2,2018,6-3 6-2 6-1,R128,98.0,Dennis Novak,24.383299,3.0,226.0,0.394737,0.318182,Grigor Dimitrov,26.669405,7.0,3.0,0.550459,0.571429,0,1,0
3,2013,6-4 6-2 6-4,R128,106.0,Grigor Dimitrov,21.667351,5.0,41.0,0.550459,0.571429,Julien Benneteau,31.069131,6.0,38.0,0.421053,0.454545,0,1,0
4,2015,6-3 7-6(6) 6-1,R32,118.0,Malek Jaziri,30.997947,6.0,75.0,0.405797,0.388889,Nick Kyrgios,19.731691,27.0,53.0,0.642857,0.671875,0,1,0


## 2. Logistic Regression

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#instantiating the model
logreg = LogisticRegression()

#defining the features and the y 
X = ao[['p1rank', 'p2rank', 'p1_win_pct', 'p2_win_pct', 'p1_win_pct_hsf', 'p2_win_pct_hsf']]
y = ao.p1won

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=123)

#fitting the model with data for training
logreg.fit(X_train, y_train)

#storing predictions
pred = logreg.predict(X_test)

#scoring the model
logreg.score(X_test, y_test)

0.5107604017216643

In [16]:
logreg.predict_proba(X_test)[0:10]

array([[0.55338533, 0.44661467],
       [0.46343782, 0.53656218],
       [0.50797374, 0.49202626],
       [0.46491336, 0.53508664],
       [0.40692692, 0.59307308],
       [0.52519997, 0.47480003],
       [0.43597979, 0.56402021],
       [0.53718635, 0.46281365],
       [0.49875655, 0.50124345],
       [0.4571768 , 0.5428232 ]])

In [17]:
# Storing the prediction for player 1 wins
ao['p1_win_pred_log'] = logreg.predict(X)

In [18]:
# Store the predicted probabilities of player 1 winning.
ao['p1_win_pred_prob_log'] = logreg.predict_proba(X)[:,1]

In [19]:
ao.head()

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,...,p2age,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won,p1_win_pred_knn,p1_win_pred_log,p1_win_pred_prob_log
0,2014,6-4 7-6(5) 6-7(9) 6-2,R128,228.0,Carlos Berlocq,30.94319,9.0,41.0,0.111111,0.47,...,30.12731,6.0,40.0,0.55,0.56,0,1,0,0,0.488739
1,2016,7-5 6-3 6-2,R32,104.0,Stephane Robert,35.671458,3.0,225.0,0.25,0.47,...,29.379877,10.0,25.0,0.642857,0.710843,0,1,1,1,0.534095
2,2018,6-3 6-2 6-1,R128,98.0,Dennis Novak,24.383299,3.0,226.0,0.394737,0.318182,...,26.669405,7.0,3.0,0.550459,0.571429,0,1,0,1,0.515815
3,2013,6-4 6-2 6-4,R128,106.0,Grigor Dimitrov,21.667351,5.0,41.0,0.550459,0.571429,...,31.069131,6.0,38.0,0.421053,0.454545,0,1,0,1,0.520832
4,2015,6-3 7-6(6) 6-1,R32,118.0,Malek Jaziri,30.997947,6.0,75.0,0.405797,0.388889,...,19.731691,27.0,53.0,0.642857,0.671875,0,1,0,1,0.504559


In [20]:
ao[(ao.p1won == 1) & (ao.p1_win_pred_knn ==1)]

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,...,p2age,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won,p1_win_pred_knn,p1_win_pred_log,p1_win_pred_prob_log
694,2010,4-6 6-2 7-6(2) 6-0,R128,164.0,Igor Andreev,26.516085,2.0,37.0,0.460000,0.470000,...,28.446270,9.0,1.0,0.832000,0.818182,1,0,1,1,0.540501
696,2010,6-1 6-2 6-3,R128,100.0,Ricardo Hocevar,24.706366,5.0,194.0,0.460000,0.470000,...,28.898015,10.0,22.0,0.550000,0.560000,1,0,1,1,0.530656
697,2010,6-0 6-0 2-0 RET,R128,61.0,Frederico Gil,24.821355,0.0,68.0,0.460000,0.470000,...,27.797399,2.0,18.0,0.414634,0.434783,1,0,1,1,0.508174
698,2010,6-3 7-6(3) 4-6 7-6(8),R128,202.0,Dudi Sela,24.791239,7.0,41.0,0.411765,0.454545,...,21.995893,10.0,209.0,0.550000,0.560000,1,0,1,0,0.457616
699,2010,7-6(2) 7-5 6-3,R128,156.0,Carlos Moya,33.393566,9.0,509.0,0.460000,0.470000,...,22.362765,7.0,119.0,0.333333,0.333333,1,0,1,1,0.518672
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1388,2020,1-6 6-3 6-4 6-2,QF,139.0,Stan Wawrinka,34.814511,4.0,15.0,0.594340,0.632911,...,22.751540,13.0,7.0,0.703911,0.702479,1,0,1,1,0.557279
1389,2020,6-3 2-6 2-6 7-6(8) 6-3,QF,211.0,Tennys Sandgren,28.498289,27.0,100.0,0.423529,0.428571,...,38.450376,5.0,3.0,0.832000,0.818182,1,0,1,1,0.538860
1390,2020,6-4 6-3 7-6(1),QF,169.0,Milos Raonic,29.065024,18.0,35.0,0.675676,0.638554,...,32.665298,4.0,2.0,0.843023,0.842105,1,0,1,1,0.577509
1391,2020,7-6(1) 6-4 6-3,SF,138.0,Roger Federer,38.450376,15.0,3.0,0.832000,0.818182,...,32.665298,11.0,2.0,0.843023,0.842105,1,0,1,1,0.606916


In [21]:
440/699

0.6294706723891274

In [22]:
ao[(ao.p1won == 1) & (ao.p1_win_pred_log ==1)]

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,...,p2age,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won,p1_win_pred_knn,p1_win_pred_log,p1_win_pred_prob_log
694,2010,4-6 6-2 7-6(2) 6-0,R128,164.0,Igor Andreev,26.516085,2.0,37.0,0.460000,0.470000,...,28.446270,9.0,1.0,0.832000,0.818182,1,0,1,1,0.540501
695,2010,6-4 6-3 7-6(2),R128,137.0,Juan Ignacio Chela,30.387406,6.0,76.0,0.460000,0.470000,...,28.495551,6.0,47.0,0.550000,0.560000,1,0,0,1,0.511136
696,2010,6-1 6-2 6-3,R128,100.0,Ricardo Hocevar,24.706366,5.0,194.0,0.460000,0.470000,...,28.898015,10.0,22.0,0.550000,0.560000,1,0,1,1,0.530656
697,2010,6-0 6-0 2-0 RET,R128,61.0,Frederico Gil,24.821355,0.0,68.0,0.460000,0.470000,...,27.797399,2.0,18.0,0.414634,0.434783,1,0,1,1,0.508174
699,2010,7-6(2) 7-5 6-3,R128,156.0,Carlos Moya,33.393566,9.0,509.0,0.460000,0.470000,...,22.362765,7.0,119.0,0.333333,0.333333,1,0,1,1,0.518672
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1388,2020,1-6 6-3 6-4 6-2,QF,139.0,Stan Wawrinka,34.814511,4.0,15.0,0.594340,0.632911,...,22.751540,13.0,7.0,0.703911,0.702479,1,0,1,1,0.557279
1389,2020,6-3 2-6 2-6 7-6(8) 6-3,QF,211.0,Tennys Sandgren,28.498289,27.0,100.0,0.423529,0.428571,...,38.450376,5.0,3.0,0.832000,0.818182,1,0,1,1,0.538860
1390,2020,6-4 6-3 7-6(1),QF,169.0,Milos Raonic,29.065024,18.0,35.0,0.675676,0.638554,...,32.665298,4.0,2.0,0.843023,0.842105,1,0,1,1,0.577509
1391,2020,7-6(1) 6-4 6-3,SF,138.0,Roger Federer,38.450376,15.0,3.0,0.832000,0.818182,...,32.665298,11.0,2.0,0.843023,0.842105,1,0,1,1,0.606916


In [23]:
488/699

0.698140200286123

In [24]:
ao.p1won.value_counts()

1    699
0    694
Name: p1won, dtype: int64

In [25]:
ao.p1_win_pred_log.value_counts()

1    948
0    445
Name: p1_win_pred_log, dtype: int64

In [26]:
ao.p1_win_pred_knn.value_counts()

1    749
0    644
Name: p1_win_pred_knn, dtype: int64

In [27]:
logreg.score(X,y)

0.5183058147882269

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#instantiating the model
rf = RandomForestClassifier()

#defining the features and the y 
X = ao[['p1rank', 'p2rank', 'p1_win_pct', 'p2_win_pct', 'p1_win_pct_hsf', 'p2_win_pct_hsf']]
y = ao.p1won

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=123)

#fitting the model with data for training
rf.fit(X_train, y_train)

#storing predictions
pred = rf.predict(X_test)

#scoring the model
rf.score(X_test, y_test)


0.48206599713055953

In [30]:
rf.score(X, y)

0.7401292175161522

In [31]:
# Store the predicted probabilities of player 1 winning.
ao['rf_pred_win'] = rf.predict(X)

In [32]:
ao.head()

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,...,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won,p1_win_pred_knn,p1_win_pred_log,p1_win_pred_prob_log,rf_pred_win
0,2014,6-4 7-6(5) 6-7(9) 6-2,R128,228.0,Carlos Berlocq,30.94319,9.0,41.0,0.111111,0.47,...,6.0,40.0,0.55,0.56,0,1,0,0,0.488739,0
1,2016,7-5 6-3 6-2,R32,104.0,Stephane Robert,35.671458,3.0,225.0,0.25,0.47,...,10.0,25.0,0.642857,0.710843,0,1,1,1,0.534095,0
2,2018,6-3 6-2 6-1,R128,98.0,Dennis Novak,24.383299,3.0,226.0,0.394737,0.318182,...,7.0,3.0,0.550459,0.571429,0,1,0,1,0.515815,0
3,2013,6-4 6-2 6-4,R128,106.0,Grigor Dimitrov,21.667351,5.0,41.0,0.550459,0.571429,...,6.0,38.0,0.421053,0.454545,0,1,0,1,0.520832,0
4,2015,6-3 7-6(6) 6-1,R32,118.0,Malek Jaziri,30.997947,6.0,75.0,0.405797,0.388889,...,27.0,53.0,0.642857,0.671875,0,1,0,1,0.504559,0


In [33]:
ao.rf_pred_win.value_counts()

1    699
0    694
Name: rf_pred_win, dtype: int64

In [34]:
ao[(ao.p1won == 1) & (ao.rf_pred_win ==1)]

Unnamed: 0,year,score,round,minutes,p1name,p1age,p1ace,p1rank,p1_win_pct,p1_win_pct_hsf,...,p2ace,p2rank,p2_win_pct,p2_win_pct_hsf,p1won,p2won,p1_win_pred_knn,p1_win_pred_log,p1_win_pred_prob_log,rf_pred_win
694,2010,4-6 6-2 7-6(2) 6-0,R128,164.0,Igor Andreev,26.516085,2.0,37.0,0.460000,0.470000,...,9.0,1.0,0.832000,0.818182,1,0,1,1,0.540501,1
695,2010,6-4 6-3 7-6(2),R128,137.0,Juan Ignacio Chela,30.387406,6.0,76.0,0.460000,0.470000,...,6.0,47.0,0.550000,0.560000,1,0,0,1,0.511136,1
696,2010,6-1 6-2 6-3,R128,100.0,Ricardo Hocevar,24.706366,5.0,194.0,0.460000,0.470000,...,10.0,22.0,0.550000,0.560000,1,0,1,1,0.530656,1
699,2010,7-6(2) 7-5 6-3,R128,156.0,Carlos Moya,33.393566,9.0,509.0,0.460000,0.470000,...,7.0,119.0,0.333333,0.333333,1,0,1,1,0.518672,1
700,2010,6-1 6-0 6-3,R128,91.0,Dieter Kindlmann,27.627652,0.0,175.0,0.460000,0.470000,...,6.0,6.0,0.550000,0.560000,1,0,1,1,0.533049,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1388,2020,1-6 6-3 6-4 6-2,QF,139.0,Stan Wawrinka,34.814511,4.0,15.0,0.594340,0.632911,...,13.0,7.0,0.703911,0.702479,1,0,1,1,0.557279,1
1389,2020,6-3 2-6 2-6 7-6(8) 6-3,QF,211.0,Tennys Sandgren,28.498289,27.0,100.0,0.423529,0.428571,...,5.0,3.0,0.832000,0.818182,1,0,1,1,0.538860,1
1390,2020,6-4 6-3 7-6(1),QF,169.0,Milos Raonic,29.065024,18.0,35.0,0.675676,0.638554,...,4.0,2.0,0.843023,0.842105,1,0,1,1,0.577509,1
1391,2020,7-6(1) 6-4 6-3,SF,138.0,Roger Federer,38.450376,15.0,3.0,0.832000,0.818182,...,11.0,2.0,0.843023,0.842105,1,0,1,1,0.606916,1


In [35]:
518/699

0.7410586552217453

### Next steps:
Run either the KNN or LogReg model on the 2021 Australian Open draws _(still pending!)_ for actual predictions. 