### Fit Baseline Model

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
import pandas as pd
import os
import dtale
import numpy as np

Load in dataset

In [26]:
project_dir = os.path.dirname(os.path.abspath(''))
df = pd.read_json(os.path.join(project_dir, 'model_prepped_dataset.json'))
dtale.show(df).open_browser()

Assign dataset to X matrix and y vector and apply any necessary transformations.

In [33]:
X = df.loc[:, ~df.columns.isin(['Outcome', 'Outcome_Bin_H'])]
y = df['Outcome']
X.tail()

Unnamed: 0,Season,Capacity,Elo_home,Elo_away,Day,Home_Team_Streak,Away_Team_Streak,Home_Team_Home_Streak,Away_Team_Away_Streak,Home_Team_Form,Away_Team_Form,Home_Team_Home_Form,Away_Team_Away_Form,Home_Team_Goals,Away_Team_Goals,Home_Team_Home_Goals,Away_Team_Away_Goals,Match_Relevance
105459,2021,41841,90,91,6,0,1,2,0,4,2,1,2,1.1,2.2,1.9,1.8,6e-06
105460,2021,62062,89,73,6,0,0,1,1,-3,0,0,-1,1.2,0.9,1.6,0.5,6e-06
105461,2021,32500,84,89,6,2,0,1,0,2,-2,2,1,1.6,1.6,1.5,1.1,6e-06
105462,2021,18482,60,72,6,3,0,2,1,1,-1,-1,-1,1.2,0.8,0.8,1.2,6e-06
105463,2021,21628,70,74,6,0,3,0,1,0,2,2,-1,1.1,1.6,1.1,1.4,6e-06


Split into training set and test set.

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
print(len(X_train))
print(len(X_test))
print(len(X))

83509
14738
98247


Scale X by standardising it. Use the mean and standard deviation of the training set to standardise the test set.

In [35]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

#### Option 1) Logistic Regression

Train model.

In [36]:
lgr = LogisticRegression()
lgr.fit(X_train, y_train)

LogisticRegression()

Test the accuracy of the model by checking against the test set. This consists of calculating the MSE and accuracy of the predicted classification.

In [37]:
y_pred_log = lgr.predict(X_test)
accu_log = accuracy_score(y_test, y_pred_log) * 100
# R2 score is not a good measure for classification.
print(f'Logistic Regression Model\nAccuracy: {accu_log:.2f}%.')

Logistic Regression Model
Accuracy: 49.83%.
MSE: 1.16


#### Option 2) Linear Regression

Train model.

In [38]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

Test the accuracy of the model by checking against the test set. This consists of calculating the MSE and accuracy of the predicted outcome.

In [39]:
y_pred_lin = np.rint(lr.predict(X_test))
mse_lin = mean_squared_error(y_test, y_pred_lin)
accu_lin = accuracy_score(y_test, y_pred_lin) * 100
r2_lin = r2_score(y_test, y_pred_lin)
print(f'Linear Regression Model\nAccuracy: {accu_lin:.2f}%.\nMSE: {mse_lin:.2f}\nR2 score: {r2_lin:.2f}')

Linear Regression Model
Accuracy: 34.46%.
MSE: 0.69
R2 score: -0.03


#### Option 3) Binary Logistic Regression

A final baseline model is trialled using the binary outcome (i.e. predicting home wins only).

In [40]:
X = df.loc[:, ~df.columns.isin(['Outcome', 'Outcome_Bin_H'])]
y = df['Outcome_Bin_H']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
lgr = LogisticRegression()
lgr.fit(X_train, y_train)
y_pred_log = lgr.predict(X_test)
accu_log = accuracy_score(y_test, y_pred_log) * 100
print(f'Logistic Regression Model\nAccuracy: {accu_log:.2f}%.')

Logistic Regression Model
Accuracy: 61.09%.
MSE: 0.39


#### Conclusion:
The multiclass logistic regression model has a 49.83% accuracy which is a 50% more than random choice (33%).

The linear regression model is not considered an appropiate fit for this problem as it is post processed to be a classification problem.

The binary logistic regression model has a 61.09% accuracy which is 20% more than random choice.

Both multiclass and binary clasisfication approaches will be developed going forward so that the best model for each can be compared.