<h3>Question 2: Please estimate the UAE win probability.(win/draw/lose)</h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<h2>Processing data on Excel</h2>

<p>With the dataset "datapro1" for question 2, I remain all the columns of "datapro" in question 1, and just modified a little bit the the question's analyzing purpose:</p>

<ul>
<li>The "year" and "date" columns were not removed.</li>
<li>A "result" column was added, which indicated whether the team win, draw, or lose (the respective values are 2,1, and 0).</li>

In [2]:
data = "/Users/dobaophuc/Documents/học data analyst/dataset/World Football Results 2018 to 2023.xlsx"
df = pd.read_excel(data, sheet_name ='datapro1')
df.head()

Unnamed: 0,year,date,tournament,team,rating,opp_rating,goal_score,goal_conceded,home_advantage,rating_diff,result
0,2018,January 2,Gulf Cup,Iraq,1570,1560,0,0,0,10,1
1,2018,January 2,Gulf Cup,Oman,1511,1418,1,0,0,93,2
2,2018,January 5,Gulf Cup,United Arab Emirates,1561,1526,0,0,0,35,1
3,2018,January 7,Friendly,Sweden,1825,1508,1,1,0,317,1
4,2018,January 11,Friendly,Finland,1595,1480,2,1,0,115,2


<p>Here, I'm trying to combine the day, month and year into 1 column named "date"...</p>

In [3]:
df[['month','day']]=df['date'].str.split(' ', expand=True)

In [4]:
df['day']=df['day'].astype(int)

In [5]:
df['month']=df['month'].map({'January': 1,
            'February': 2,
            'March': 3,
            'April': 4,
            'May': 5,
            'June': 6,
            'July': 7,
            'August': 8,
            'September': 9, 
            'October': 10,
            'November': 11,
            'December': 12})

In [6]:
df['date'] = pd.to_datetime(df[['year','month','day']])

In [7]:
df.drop(['year','month','day'],axis =1, inplace = True)
df.head()

Unnamed: 0,date,tournament,team,rating,opp_rating,goal_score,goal_conceded,home_advantage,rating_diff,result
0,2018-01-02,Gulf Cup,Iraq,1570,1560,0,0,0,10,1
1,2018-01-02,Gulf Cup,Oman,1511,1418,1,0,0,93,2
2,2018-01-05,Gulf Cup,United Arab Emirates,1561,1526,0,0,0,35,1
3,2018-01-07,Friendly,Sweden,1825,1508,1,1,0,317,1
4,2018-01-11,Friendly,Finland,1595,1480,2,1,0,115,2


In [8]:
df.dtypes

date              datetime64[ns]
tournament                object
team                      object
rating                     int64
opp_rating                 int64
goal_score                 int64
goal_conceded              int64
home_advantage             int64
rating_diff                int64
result                     int64
dtype: object

<p>Now, the task is to predict the win probability of UAE, ot it could be understood that we have to predict <strong>the probability when the "result" equals to 2,1, and 0</strong> in the match UAE vs Thailand. With this, I would try the <strong>Random Forest Regression model</strong> because it works really well with classification problems.</p>

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

In [11]:
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz

<p>I conduct train-test data spliting by date. Then, I store all the predictors into "predictor" variable.</p> 

In [24]:
rf = RandomForestClassifier(n_estimators = 50, min_samples_split = 10, random_state = 1)
train = df[df['date']<'2022-6-1']
test = df[df['date']>'2022-6-1']
predictor = ['opp_rating','home_advantage','rating','rating_diff']

In [25]:
rf.fit(train[predictor],train['result'])

In [26]:
pred = rf.predict(test[predictor])
accu = accuracy_score(test['result'],pred)
accu

0.5476190476190477

In [27]:
#combined = pd.DataFrame(dict(actual = test['result'],prediction = pred))
#pd.crosstab(index = combined['actual'], columns = combined['prediction'])

#just to be quick
confusion_matrix(test['result'],pred)

array([[343,  66, 118],
       [126,  74, 132],
       [123,  62, 342]])

In [28]:
pred[0:5]

array([1, 2, 2, 2, 2])

<h4>Comment:</h4><p>After fititng the initial Random Forest Regression, the model gives an accuracy score of <strong>~54.76%</strong>, which is quite good! But I'll try to improve it by parameters tuning.</p>

<p>testing precision score...</p>

In [29]:
precision_score(test['result'],pred, average = 'micro')

0.5476190476190477

<p>Just like finding the best order for the polynomial model in question 1, now I begin finding the best parameters  for the Random Forest Regression model, such as <strong>max_depth, min_samples_leaf, and n_estimators</strong>, also known as <strong>parameter tuning</strong> process...</p>

In [33]:
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10,25,30,50,100,200]
}
from sklearn.model_selection import GridSearchCV
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy")

In [34]:
grid_search.fit(train[predictor], train['result'])
grid_search.best_score_

Fitting 4 folds for each of 180 candidates, totalling 720 fits


0.5589613107469997

In [35]:
rf_best = grid_search.best_estimator_
rf_best

<p>We've found out the best parameters as above...</p>

<h3>Retraining the Random Forest Model</h3>

In [36]:
rf = RandomForestClassifier(max_depth =5, n_estimators = 10, min_samples_leaf= 50, random_state = 42)
train = df[df['date']<'2022-6-1']
test = df[df['date']>'2022-6-1']
predictor = ['opp_rating','home_advantage','rating','rating_diff']

In [37]:
rf.fit(train[predictor],train['result'])
pred = rf.predict(test[predictor])
accu = accuracy_score(test['result'],pred)
accu

0.5634920634920635

<h4>Now we have improved the accuracy score a bit to 56.35% and proceed with the answer for the question...</h4>

In [38]:
var = [[1173.4,1,1338.48,165.08]]

In [39]:
rf.predict_proba(var)



array([[0.18411471, 0.23496287, 0.58092242]])

<h2>Answer:</h2>
<h4>In the next football match vs Thailand, UAE will have ~58.09% WIN, ~23.50% DRAW, and ~18.41% LOSE.</h4>

<h3>Thanks for watching!</h3>