## Author's notes
### 28.10.2025
I am starting the model creation process. The files created in the prerocessing will be used, however, I do have some issues with them.
1) The data has not been split into train and test set during the preprocessing phase. I am aware that label encoding of categorical variables has been done and we should know all the categories for the teams, but it is still best practice to do encoding AFTER the split in order to avoid data leakage and it may be a problem in the assessment of our work by the teachers. Therefore, it would probably be for the best if this was corrected. All the preprocessing steps should be fitted to only the train data and only then used to transform the test data.
2) Only the market values were used in the bigger dataset. I am not saying this is wrong or right and I will trust Vojta on this one, BUT there should be a very detailed explanation for why exactly we didn't use the rest of the data.
3) There are missing values in the first observations in the derived column. I have dropped them for now, but I think a KNN imputer might do the trick, we could do some imputations and use everything.

Because of the issues with the train test split, I will split the data here, but I would like to change it once it is fixed.

## Libraries import

In [12]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error 

## Data import, train test split and shenanigans
The train test split should later be replaced by just loading the already split data after it has been done in preprocessing.
After this step, the models for A0 dataset will be made in the first chunk and A1 on the second chunk of this notebook.

In [22]:
#data import
FILE_PATH_a0 = "ready_data\data_a0_encoded.csv"
FILE_PATH_a1 = "ready_data\data_a1_encoded.csv"

data_a0=pd.read_csv(FILE_PATH_a0)
data_a1=pd.read_csv(FILE_PATH_a1)

In [14]:
#check that everything checks out
print(data_a0.info())
print(data_a0.head())

print(data_a1.info())
print(data_a1.head())

'''
The averaged data of the last 5 games contains missing values. 
This is because for the first 5 matches, it is always impossible to compute the average.
Because this is a derived column, this makes sense and should not be an issue for the data.
'''

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42593 entries, 0 to 42592
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Time                           42593 non-null  int64  
 1   Target                         42593 non-null  int64  
 2   HomeTeam_enc                   42593 non-null  int64  
 3   avg_goals_in_last5_home        42115 non-null  float64
 4   avg_goals_conceded_last5_home  42115 non-null  float64
 5   AwayTeam_enc                   42593 non-null  int64  
 6   avg_goals_in_last5_away        42115 non-null  float64
 7   avg_goals_conceded_last5_away  42115 non-null  float64
 8   Year                           42593 non-null  int64  
 9   Month                          42593 non-null  int64  
 10  Dayofweek                      42593 non-null  int64  
 11  Is_weekend                     42593 non-null  int64  
 12  Season_of_year                 42593 non-null 

'\nThe averaged data of the last 5 games contains missing values. \nThis is because for the first 5 matches, it is always impossible to compute the average.\nBecause this is a derived column, this makes sense and should not be an issue for the data.\n'

The country and division dummies are booleans, change that into numerical.

In [23]:
data_a0[data_a0.select_dtypes(include='bool').columns]=data_a0[data_a0.select_dtypes(include='bool').columns].astype(int)
data_a1[data_a1.select_dtypes(include='bool').columns]=data_a1[data_a1.select_dtypes(include='bool').columns].astype(int)

In [24]:
#check that all columns have only numerical values
non_numeric_cols0 = data_a0.select_dtypes(exclude=[np.number]).columns
non_numeric_cols1 = data_a1.select_dtypes(exclude=[np.number]).columns

assert len(non_numeric_cols0)==0
assert len(non_numeric_cols1)==0

The data has been sorted chronologically in the preprocessing phase. Because this data is a time series and is likely time dependend, we will not be doing a random split of the data, but rather, a chronological one.

In [25]:
#just a check to see that we are good to keep working with the data and it's in the form we want
assert type(data_a0)==pd.core.frame.DataFrame
assert type(data_a1)==pd.core.frame.DataFrame

In [27]:
#train test split, 80/20 ratio
#for A0
split_index_0 = int(0.8 * len(data_a0))

train0 = data_a0.iloc[:split_index_0]
test0  = data_a0.iloc[split_index_0:]

X_train_0, y_train_0 = train0.drop(columns='Target'), train0['Target']
X_test_0,  y_test_0  = test0.drop(columns='Target'),  test0['Target']

#for A1
split_index_1 = int(0.8 * len(data_a1))

train1 = data_a1.iloc[:split_index_1]
test1  = data_a1.iloc[split_index_1:]

X_train_1, y_train_1 = train1.drop(columns='Target'), train1['Target']
X_test_1,  y_test_1  = test1.drop(columns='Target'),  test1['Target']

## Model creation A0

### RandomForestClassifier
Because this is a classification task and we aren't looking at a continuous target variable, we will use the RandomForestClassifier and not the RandomForestRegressor we have used in the lectures.

Because we have derived the avg variables, there is some missingness in the data. I will drop the observations with missing values for now but I think it can be fixed (check author's notes at the top of this markdown)

In [32]:
#temporary solution to missing values
X_train_0 = X_train_0.dropna()
y_train_0 = y_train_0.loc[X_train_0.index]

X_test_0 = X_test_0.dropna()
y_test_0 = y_test_0.loc[X_test_0.index]

In [34]:
#initiate the model
rf0_1 = RandomForestClassifier(random_state=42)

In [30]:
#because the data is chronological, we cannot do a randomized cross-validation when choosing the model
#we will use a rolling cross-validation instead

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5) 

#this will split out data into 5 folds and we will always evaluate only based on the past, preventing leakage

In [None]:
#random search for some good values to use in gridsearch (hyperparameter tuning)
#parameters to go through
param_grid= {
    'max_depth':[i for i in range(1, 30)],
    'min_samples_split':[i for i in range(1,300)],
    'min_samples_leaf':[i for i in range(1, 200)],
    'criterion' :['gini', 'entropy', 'log_loss']
}


In [None]:
#the random search
params={
    'max_depth':[],
    'min_samples_split':[],
    'min_samples_leaf':[],
    'criterion' :[]
} #empty parameter grid to input the results of the random search
for state in [1, 20, 42, 200]:
    random_search = RandomizedSearchCV(
        estimator=rf0_1,
        param_distributions=param_grid,
        cv=tscv,
        n_iter=100,
        random_state=state,
        n_jobs=-1
    )
    random_search.fit(X_train_0, y_train_0)
    new_params=random_search.best_params_
    for key, value in new_params.items():
        params[key].append(value)

print(params)

{'max_depth': [18, 16, 14, 18], 'min_samples_split': [268, 239, 291, 282], 'min_samples_leaf': [58, 43, 61, 66], 'criterion': ['entropy', 'entropy', 'gini', 'log_loss']}


In [37]:
#keeping only unique values in the parameter grid
for key in params:
    params[key] = list(set(params[key]))
print(params)

{'max_depth': [16, 18, 14], 'min_samples_split': [282, 291, 268, 239], 'min_samples_leaf': [66, 58, 43, 61], 'criterion': ['log_loss', 'gini', 'entropy']}


The random search is run multiple times to find some values that could be used in the grid search. As this takes some time, here are the values that it gave me when I ran the code (so they can be used immediately and without running the code):

params={'max_depth': [16, 18, 14],'min_samples_split': [282, 291, 268, 239], 'min_samples_leaf': [66, 58, 43, 61], 'criterion': ['log_loss', 'gini', 'entropy']}

In [38]:
#for convenience, the params output can be loaded here
'''
params={'max_depth': [16, 18, 14],
        'min_samples_split': [282, 291, 268, 239],
        'min_samples_leaf': [66, 58, 43, 61],
        'criterion': ['log_loss', 'gini', 'entropy']}
'''

"\nparams={'max_depth': [16, 18, 14],\n        'min_samples_split': [282, 291, 268, 239],\n        'min_samples_leaf': [66, 58, 43, 61],\n        'criterion': ['log_loss', 'gini', 'entropy']}\n"

In [39]:
#grid search
grid=GridSearchCV(estimator=rf0_1,
                  param_grid=params,
                  n_jobs=-1, 
                  cv=tscv)
grid.fit(X_train_0, y_train_0)

best_params=grid.best_params_
print(best_params)

rf_0=grid.best_estimator_

{'criterion': 'log_loss', 'max_depth': 14, 'min_samples_leaf': 61, 'min_samples_split': 282}


As grid search takes a lot of time to be computed, here are the hyperparameter values for the best estimator for future convenience.

best_params={'criterion': 'log_loss', 'max_depth': 14, 'min_samples_leaf': 61, 'min_samples_split': 282}

In [None]:
#for convenience, the best estimator can be loaded here
'''
best_params={'criterion': 'log_loss', 'max_depth': 14, 'min_samples_leaf': 61, 'min_samples_split': 282}
rf_0=RandomForestClassifier(best_params, random_state=42)
'''

In [42]:
#Random forest model predictions
pred_rf_0=rf_0.predict(X_test_0)

### Tree boosting