# Introduction

In this notebook we will be training a Random Forest to predict the scorline of EPL games using the existing pipeline pipeline_ex_RandomForest.

# Step 1 - Import and Data Prep

We import the necessary libraries and get the training, validation and testing datasets. These sets have undergone cleaning, preparation and feature engeneering in the original pipeline file, so they are ready to use

In [1]:
# Random Forest Evaluation Notebook

# Imports
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
import sys, os

# Import pipeline and load data
sys.path.append(os.path.abspath('..'))  # Adjust if needed
from pipeline_ex_RandomForest import get_train_val_test_data

X_train, X_val, X_test, y_train, y_val, y_test = get_train_val_test_data()

--- NaN Check before scaling ---
NaNs in X_train: 0
NaNs in X_val: 0
NaNs in X_test: 0
--- NaN Check after scaling ---
NaNs in X_train_scaled: 0
NaNs in X_val_scaled: 0
NaNs in X_test_scaled: 0


  db['odds_hw'] = db[home_win_cols].mean(axis=1)
  db['odds_d']  = db[draw_cols].mean(axis=1)
  db['odds_aw'] = db[away_win_cols].mean(axis=1)


In [3]:
# Check data types and shapes
print("X_train type:", type(X_train))
print("X_train shape:", X_train.shape)
print("y_train type:", type(y_train))
print("y_train shape:", y_train.shape)
print("First 5 y_train columns:", y_train.columns.tolist())
print("First 5 X_train columns:", X_train.columns.tolist())

X_train type: <class 'pandas.core.frame.DataFrame'>
X_train shape: (2410, 181)
y_train type: <class 'pandas.core.frame.DataFrame'>
y_train shape: (2410, 2)
First 5 y_train columns: ['FTHG', 'FTAG']
First 5 X_train columns: ['HTHG', 'HTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD', 'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx_2.5', 'BbAv_2.5', 'BbMx_2.5', 'BbAv_2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA', 'B365_2.5', 'B365_2.5', 'P_2.5', 'P_2.5', 'Max_2.5', 'Max_2.5', 'Avg_2.5', 'Avg_2.5', 'AHh', 'B365AHH', 'B365AHA', 'PAHH', 'PAHA', 'MaxAHH', 'MaxAHA', 'AvgAHH', 'AvgAHA', 'B365C_2.5', 'B365C_2.5', 'PC_2.5', 'PC_2.5', 'MaxC_2.5', 'MaxC_2.5', 'AvgC_2.5', 'AvgC_2.5', 'AHCh', 'B365CAHH', 'B365CAHA', 'PCAHH', 'PCAHA', 'MaxCAHH', 'MaxCAHA', 'AvgCAHH', 'AvgCAHA', 'BFEH', 'BFED', 'BFEA', 'BFE_2.5', 'BFE_2.5', 'BFEAHH', 'BFEAHA', 'BFECH', 'BFECD', 'BFECA', 'BFEC_2.5', 'BFEC_2.5', 'BFECAHH', 'BFECAHA', 'odds_hw

# Step 2 - Hyperparameter Tuning

We tune the hyperparemeters using GridSearchCV to get the optimal model performance.

In [4]:
# Hyperparameter tuning (on validation set)
param_grid = {
    'estimator__n_estimators': [50, 100, 200],
    'estimator__max_depth': [3, 5, 7],
    'estimator__min_samples_split': [2, 5, 10],
    'estimator__min_samples_leaf': [1, 2, 4],
    'estimator__max_features': ['sqrt', 'log2']
}

base_model = MultiOutputRegressor(RandomForestRegressor(random_state=42, n_jobs=-1))
grid = GridSearchCV(base_model, param_grid, cv=3, scoring='neg_root_mean_squared_error', verbose=2, n_jobs=-1)
grid.fit(X_train.values, y_train.values)

print("Best parameters:", grid.best_params_)
print("Best CV score (neg RMSE):", grid.best_score_)

Fitting 3 folds for each of 162 candidates, totalling 486 fits
Best parameters: {'estimator__max_depth': 7, 'estimator__max_features': 'sqrt', 'estimator__min_samples_leaf': 4, 'estimator__min_samples_split': 5, 'estimator__n_estimators': 200}
Best CV score (neg RMSE): -1.0017834169136337


# Step 4 - Validation Evaluation

We validate the model and see how it performs on the validation set. This should give us an idea of how it performs and if the hyperparameter optimization was successful

In [5]:
# Validation set evaluation with rounded predictions
y_val_pred = grid.predict(X_val.values)
y_val_pred_rounded = np.round(y_val_pred)  # Round predictions to nearest integer

print("Validation RMSE (raw):", mean_squared_error(y_val, y_val_pred))
print("Validation RMSE (rounded):", mean_squared_error(y_val, y_val_pred_rounded))
print("Validation MAE (raw):", mean_absolute_error(y_val, y_val_pred))
print("Validation MAE (rounded):", mean_absolute_error(y_val, y_val_pred_rounded))
print("Validation R2 (raw):", r2_score(y_val, y_val_pred))
print("Validation R2 (rounded):", r2_score(y_val, y_val_pred_rounded))

Validation RMSE (raw): 1.0117995682891507
Validation RMSE (rounded): 1.1029900332225915
Validation MAE (raw): 0.8054240133174153
Validation MAE (rounded): 0.7774086378737541
Validation R2 (raw): 0.41083261125668613
Validation R2 (rounded): 0.3568044172563198


# Step 5 - Retrain and Evaluate

We can now retrain the model on the train+val sets to ensure it sees as many examples as possible while avoiding leakage by leaving the test set untouched. We will use the test set to evaluate performance

In [6]:
# Retrain on train+val, test on test set
X_trainval = pd.concat([X_train, X_val])
y_trainval = pd.concat([y_train, y_val])
final_model = MultiOutputRegressor(
    RandomForestRegressor(
        random_state=42,
        n_jobs=-1,
        **{k.replace('estimator__', ''): v for k, v in grid.best_params_.items()}
    )
)
final_model.fit(X_trainval.values, y_trainval.values)
y_test_pred = final_model.predict(X_test.values)

# Step 6 - Evaluation Results

Here we see the results of our testing

In [7]:
# Test set evaluation with rounded predictions
y_test_pred = final_model.predict(X_test.values)
y_test_pred_rounded = np.round(y_test_pred)  # Round predictions to nearest integer

print("Test RMSE (raw):", mean_squared_error(y_test, y_test_pred))
print("Test RMSE (rounded):", mean_squared_error(y_test, y_test_pred_rounded))
print("Test MAE (raw):", mean_absolute_error(y_test, y_test_pred))
print("Test MAE (rounded):", mean_absolute_error(y_test, y_test_pred_rounded))
print("Test R2 (raw):", r2_score(y_test, y_test_pred))
print("Test R2 (rounded):", r2_score(y_test, y_test_pred_rounded))

Test RMSE (raw): 0.9523432055437859
Test RMSE (rounded): 1.0231788079470199
Test MAE (raw): 0.7643898637404292
Test MAE (rounded): 0.7317880794701986
Test R2 (raw): 0.40087755157296473
Test R2 (rounded): 0.35596463345828355


In [8]:
# Show predictions vs actuals with rounded predictions
results = pd.DataFrame({
    'FTHG_true': y_test['FTHG'].values,
    'FTHG_pred': y_test_pred[:, 0],
    'FTHG_pred_rounded': y_test_pred_rounded[:, 0],
    'FTAG_true': y_test['FTAG'].values,
    'FTAG_pred': y_test_pred[:, 1],
    'FTAG_pred_rounded': y_test_pred_rounded[:, 1]
})
print(results.head())

   FTHG_true  FTHG_pred  FTHG_pred_rounded  FTAG_true  FTAG_pred  \
0          2   2.064010                2.0          2   1.223948   
1          2   1.928646                2.0          2   1.386642   
2          1   0.953295                1.0          1   2.237210   
3          4   2.991334                3.0          2   0.767785   
4          1   1.392233                1.0          1   0.963454   

   FTAG_pred_rounded  
0                1.0  
1                1.0  
2                2.0  
3                1.0  
4                1.0  


# Additional Visuals - Feature Importance

We can see what features were the most important and had the most weight in the predictions of the model

In [8]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': final_model.estimators_[0].feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)
print("\nTop 10 most important features:")
print(feature_importance.head(10))


Top 10 most important features:
    feature  importance
0      HTHG    0.220516
4       HST    0.161437
80  odds_hw    0.066009
2        HS    0.051032
82  odds_aw    0.047949
57     AHCh    0.038952
81   odds_d    0.028487
40      AHh    0.024314
3        AS    0.017409
35    P_2.5    0.011196
