Notebook to explore other approaches, such as using a neural network to predict the target variable.

# Testing Random Forrest

In [1]:
%load_ext autoreload
%autoreload 2

# Control figure size
figsize=(14, 4)

import pandas as pd
from util import util
import numpy as np
import os
data_folder = os.path.join('..', 'data')
file_name = "DataForModel"

In [2]:
data = util.load_data(data_folder, file_name)
data

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,Season,ELO diff,Home_prob_ELO,Draw_prob_ELO,...,Diff_shots_on_target_attempted,Diff_shots_on_target_allowed,Diff_shots_attempted,Diff_shots_allowed,Diff_corners_awarded,Diff_corners_conceded,Diff_fouls_commited,Diff_fouls_suffered,Diff_yellow_cards,Diff_red_cards
0,E0,2005-09-17,Aston Villa,Tottenham,1.0,1.0,0506,-25.173204,0.412832,0.245673,...,-9,10,-14,16,20,18,-13,9,-6,0
1,E0,2005-09-17,Portsmouth,Birmingham,1.0,1.0,0506,6.045620,0.468846,0.222236,...,4,-2,4,-4,0,13,6,17,1,0
2,E0,2005-09-17,Sunderland,West Brom,1.0,1.0,0506,-32.751187,0.399092,0.251422,...,9,-1,-4,8,5,0,-1,-21,-3,1
3,E0,2005-09-18,Blackburn,Newcastle,0.0,3.0,0506,34.014412,0.517707,0.201792,...,1,-13,7,-15,5,-14,0,-2,1,0
4,E0,2005-09-18,Man City,Bolton,0.0,1.0,0506,33.333649,0.516538,0.202282,...,3,3,-8,18,-4,2,-6,-17,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33998,E3,2024-04-27,Gillingham,Doncaster,2.0,2.0,2324,-111.203962,0.303611,0.218419,...,-16,12,-18,28,1,6,6,-19,-4,0
33999,E3,2024-04-27,Milton Keynes Dons,Sutton,4.0,4.0,2324,147.385429,0.692702,0.128574,...,2,-9,13,-23,-1,5,-29,10,-6,0
34000,E3,2024-04-27,Salford,Harrogate,2.0,2.0,2324,-47.482218,0.372310,0.262627,...,-7,-7,7,-18,-8,-4,17,6,9,2
34001,E3,2024-04-27,Swindon,Morecambe,3.0,3.0,2324,-33.608246,0.397536,0.252073,...,12,9,13,17,-13,-8,7,6,-5,1


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34003 entries, 0 to 34002
Data columns (total 27 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Div                             34003 non-null  object        
 1   Date                            34003 non-null  datetime64[ns]
 2   HomeTeam                        34003 non-null  object        
 3   AwayTeam                        34003 non-null  object        
 4   FTHG                            34003 non-null  float64       
 5   FTAG                            34003 non-null  float64       
 6   Season                          34003 non-null  object        
 7   ELO diff                        34003 non-null  float64       
 8   Home_prob_ELO                   34003 non-null  float64       
 9   Draw_prob_ELO                   34003 non-null  float64       
 10  Away_prob_ELO                   34003 non-null  float64       
 11  Di

### Handling Non-numeric Data
The first step is to convert the non-numeric data into numeric data. This can be done using the `LabelEncoder` class from the `sklearn.preprocessing` module. For this forst test i will remove the column Div since all data is from E0 anyway.

In [4]:
# Removing div column:
data.drop(columns="Div", inplace=True)


In [5]:
from sklearn.preprocessing import LabelEncoder
data = data.copy()
label_encoder = LabelEncoder()

#Convert Date to numerial values
data["Year"] = data["Date"].dt.year
data["Month"] = data["Date"].dt.month
data["Day"] = data["Date"].dt.day
data["DayOfWeek"] = data[
    "Date"
].dt.dayofweek  # Optional

# Drop the original Date column as it’s no longer needed
data = data.drop(columns=["Date"])

In [6]:
# One hot encoding of hometeam and awayteam
data = pd.get_dummies(
    data, columns=["HomeTeam", "AwayTeam"]
)

In [7]:
print(data.dtypes)

FTHG                float64
FTAG                float64
Season               object
ELO diff            float64
Home_prob_ELO       float64
                     ...   
AwayTeam_Wolves        bool
AwayTeam_Wrexham       bool
AwayTeam_Wycombe       bool
AwayTeam_Yeovil        bool
AwayTeam_York          bool
Length: 249, dtype: object


### Defining target variable

We want to predict the outcome. To create a numeric representation for this, we will create a columns called Outcome which has a vlue of 1 for home wins, 0 for draws and -1 for away wins. This corresponds to the way a strength difference is represented by positive or negative numbers depending on if it is in favour of the home or away team

In [8]:
# Add a new column Outcome which is 1 if HomeTeam wins, 0 if draw, -1 if AwayTeam wins
data["Outcome"] = data.apply(
    lambda row: (
        1 if row["FTHG"] > row["FTAG"] else (0 if row["FTHG"] == row["FTAG"] else -1)
    ),
    axis=1,
)

In [9]:
data

Unnamed: 0,FTHG,FTAG,Season,ELO diff,Home_prob_ELO,Draw_prob_ELO,Away_prob_ELO,Diff_goals_scored,Diff_goals_conceded,Matchrating,...,AwayTeam_Watford,AwayTeam_West Brom,AwayTeam_West Ham,AwayTeam_Wigan,AwayTeam_Wolves,AwayTeam_Wrexham,AwayTeam_Wycombe,AwayTeam_Yeovil,AwayTeam_York,Outcome
0,1.0,1.0,0506,-25.173204,0.412832,0.245673,0.341496,0,6,-6,...,False,False,False,False,False,False,False,False,False,0
1,1.0,1.0,0506,6.045620,0.468846,0.222236,0.308918,0,-1,1,...,False,False,False,False,False,False,False,False,False,0
2,1.0,1.0,0506,-32.751187,0.399092,0.251422,0.349487,-3,-1,-2,...,False,True,False,False,False,False,False,False,False,0
3,0.0,3.0,0506,34.014412,0.517707,0.201792,0.280500,2,-2,4,...,False,False,False,False,False,False,False,False,False,-1
4,0.0,1.0,0506,33.333649,0.516538,0.202282,0.281180,1,0,1,...,False,False,False,False,False,False,False,False,False,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33998,2.0,2.0,2324,-111.203962,0.303611,0.218419,0.477970,-12,4,-16,...,False,False,False,False,False,False,False,False,False,0
33999,4.0,4.0,2324,147.385429,0.692702,0.128574,0.178724,6,1,5,...,False,False,False,False,False,False,False,False,False,0
34000,2.0,2.0,2324,-47.482218,0.372310,0.262627,0.365063,-7,-4,-3,...,False,False,False,False,False,False,False,False,False,0
34001,3.0,3.0,2324,-33.608246,0.397536,0.252073,0.350392,4,0,4,...,False,False,False,False,False,False,False,False,False,0


In [10]:
print("Outcome distribution:")
print(data["Outcome"].value_counts())

Outcome distribution:
Outcome
 1    14814
-1    10241
 0     8948
Name: count, dtype: int64


### Split data into training and testing sets

Now, we will split the data into traing and test sets to properly test our model. The X-daataset will contain all our engeineered features that describe the match before it is played. The y-dataset will contain the outcome. We will use a 80/20 split, as this is the most normal. Football probably has evolved over the course of the 20 years, so to not get caught in only using "old" data to only test predictions on "new" data, we will use a randomized sampling

In [11]:
X = data.copy().drop(
    columns=["Outcome", "FTHG", "FTAG", "Season"]
)  # Drop columns not needed for prediction
y = data.copy()["Outcome"]

In [12]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets (e.g., 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

We use a Random Forest to predict the outcomes

In [13]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model
model.fit(X_train, y_train)

In [14]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print(
    "Classification Report:\n",
    classification_report(
        y_test, y_pred, target_names=["Away Win", "Draw", "Home Win"]
    ),
)

Accuracy: 0.45875606528451696
Classification Report:
               precision    recall  f1-score   support

    Away Win       0.42      0.39      0.40      2026
        Draw       0.34      0.05      0.08      1866
    Home Win       0.48      0.77      0.59      2909

    accuracy                           0.46      6801
   macro avg       0.41      0.40      0.36      6801
weighted avg       0.42      0.46      0.40      6801



Even with more sophisticated models we see the accuracy is still pretty low. However, by looking at the recall for each of the outcomes we see that the model has a really hard time predicting draws. This is a recurring pattern in many papers on predicting outcomes of football matches. A reason for this is the randomness involved in sport and the small margin between draws and a win for one team. Teams only need one goal more to win, and goals could often be a result of randomness, by mistakes or other more human factors. 

With this in mind, we change our focus. Instead, we now try to predict an expected goal difference for each match. This will be used to predict home or away wins where we find them certain. High uncertainty is connected with draws, which we will now avoid to predict.