# Machine Learning Model Application

In this part, after merging the `GENERAL_STATS` dataset and `EMA_data` dataset, we will clean the data to create a dataset characterized by all the features and the target variable.

## Features Used for Training the Machine Learning Models:

- **B365H**: Betting odds for a home win (Bet365).
- **B365D**: Betting odds for a draw (Bet365).
- **B365A**: Betting odds for an away win (Bet365).
- **HTGD**: Goal difference for the home team up to the match.
- **ATGD**: Goal difference for the away team up to the match.
- **DiffPts**: Difference in total points between the two teams.
- **DiffFormPts**: Difference in recent form points between the two teams.
- **f_cornersAgainstHome**: Corners conceded by the home team (EMA feature).
- **f_cornersForHome**: Corners earned by the home team (EMA feature).
- **f_freesAgainstHome**: Fouls conceded by the home team (EMA feature).
- **f_freesForHome**: Fouls earned by the home team (EMA feature).
- **f_goalsAgainstHome**: Goals conceded by the home team (EMA feature).
- **f_goalsForHome**: Goals scored by the home team (EMA feature).
- **f_halfTimeGoalsAgainstHome**: Half-time goals conceded by the home team (EMA feature).
- **f_halfTimeGoalsForHome**: Half-time goals scored by the home team (EMA feature).
- **f_redsAgainstHome**: Red cards received by opponents of the home team (EMA feature).
- **f_redsForHome**: Red cards received by the home team (EMA feature).
- **f_shotsAgainstHome**: Shots taken by opponents of the home team (EMA feature).
- **f_shotsForHome**: Shots taken by the home team (EMA feature).
- **f_shotsOnTargetAgainstHome**: Shots on target by opponents of the home team (EMA feature).
- **f_shotsOnTargetForHome**: Shots on target by the home team (EMA feature).
- **f_yellowsAgainstHome**: Yellow cards received by opponents of the home team (EMA feature).
- **f_yellowsForHome**: Yellow cards received by the home team (EMA feature).
- **f_cornersAgainstAway**: Corners conceded by the away team (EMA feature).
- **f_cornersForAway**: Corners earned by the away team (EMA feature).
- **f_freesAgainstAway**: Fouls conceded by the away team (EMA feature).
- **f_freesForAway**: Fouls earned by the away team (EMA feature).
- **f_goalsAgainstAway**: Goals conceded by the away team (EMA feature).
- **f_goalsForAway**: Goals scored by the away team (EMA feature).
- **f_halfTimeGoalsAgainstAway**: Half-time goals conceded by the away team (EMA feature).
- **f_halfTimeGoalsForAway**: Half-time goals scored by the away team (EMA feature).
- **f_redsAgainstAway**: Red cards received by opponents of the away team (EMA feature).
- **f_redsForAway**: Red cards received by the away team (EMA feature).
- **f_shotsAgainstAway**: Shots taken by opponents of the away team (EMA feature).
- **f_shotsForAway**: Shots taken by the away team (EMA feature).
- **f_shotsOnTargetAgainstAway**: Shots on target by opponents of the away team (EMA feature).
- **f_shotsOnTargetForAway**: Shots on target by the away team (EMA feature).
- **f_yellowsAgainstAway**: Yellow cards received by opponents of the away team (EMA feature).
- **f_yellowsForAway**: Yellow cards received by the away team (EMA feature).

## Target Variable:

- **FTR**: 
  - `H`: Home Win  
  - `D`: Draw  
  - `A`: Away Win
lowsForAway'



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
import warnings
import lightgbm as lgb

import os
import seaborn as sns
import matplotlib.pyplot as plt
from MachineLearningModels import*
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

DATA_PATH='data'

First of all we use discard the variables in the datasets we are sure we are not going to use in the analysis. Then we merge the datasets and filter them in order to use only the features to use in our machine learning models.

From the GENERAL_STATS dataset one we eliminate:
- 'FTHG', 'FTAG': goals in the match, we should predict the result so we don't know it
- 'HTGS', 'ATGS', 'HTGC','ATGC': goals done and conceded so far, already considered with the feature HTGD e ATGD
 - 'HTFormPts','ATFormPts': points home team and away tema in the last 5 mathces. Already considered DiffFormPts
 - 'HTP', 'ATP':points so far for home and away team so far. Already considered with DiffPts
 - 'MW', 'HTFormPtsStr', 'ATFormPtsStr'. Useless variables

Form the EMA_data dataset we discard: 'Unnamed: 0', 'f_DateHome', 'f_SeasonHome', 'HomeTeam', 'homeGame_x', 'f_DateAway', 'f_SeasonAway','AwayTeam', 'homeGame_y', 'gameId_y','gameId_x' which are either repetition of a variable in GENERAL_STATS or a useless variable

Then we merge both variable in order to have a unique dataset which contain all the features that will be used to predict the outcome of each match


In [None]:
general_stats = pd.read_csv('data/GENERAL_STATS.csv')

general_stats.columns

In [None]:
general_stats.drop(['FTHG', 'FTAG',
                 'HTGS','ATGS', 'HTGC', 'ATGC',  'HTFormPts', 
                 'ATFormPts', 'MW', 'HTFormPtsStr', 'ATFormPtsStr', 'HTP', 'ATP'], axis =1, inplace=True)
general_stats.columns

In [None]:
ema_dataset = pd.read_csv("data/EMA_data.csv")

In [None]:
ema_dataset.drop(['Unnamed: 0', 'f_DateHome', 'f_SeasonHome', 'HomeTeam',
               'homeGame_x', 'f_DateAway', 'f_SeasonAway', 
               'AwayTeam', 'homeGame_y'], axis=1, inplace=True)


In [None]:
# Dataset merging w.r.t gameId
df = pd.merge(general_stats, ema_dataset, left_on='gameId', right_index=True) 
df.drop(['gameId_y','gameId_x'], axis=1, inplace=True)

df.head()

In [None]:
df.columns

In [None]:
df.to_csv(os.path.join(DATA_PATH, 'ML_data.csv'), index=False)

We save this dataset. We will use it later in order to analyze a feasible Betting strategy (Betting_Strategy.ipynb)

In [None]:
df.tail()

## Machine Learning Model Application

Now we prepare the dataset to correctly apply the machine learning models.

Since these are temporal variables, we cannot use the classic 70/30 or 80/20 random split of the data. Instead, we need to set a cutoff date to divide the data into training and test sets. We decided to use the following split:

- **Training data**: From the 16-17 season to the 22-23 season (inclusive - approximately 75% of the data).
- **Test data**: The 23-24 season and the 24-25 season (up to the last available match - approximately 25% of the data).

Next, all variables are normalized using the `StandardScaler`, and the categories of the target variable (FTR) are transformed as follows:

- **FTR**:  
  - `H`: Home Win == 2  
  - `D`: Draw == 1  
  - `A`: Away Win == 0

In [None]:
# Create training data: Include all seasons except 2324 and 2425
training_data = df.loc[~(df['Season'] == 2324) & ~(df['Season'] == 2425)].reset_index(drop=True)
# Create testing data: Include only seasons 2324 and 2425
testing_data = df.loc[(df['Season'] == 2324) | (df['Season'] == 2425)]

X = training_data.drop(['gameId', 'Date', 'Season', 'HomeTeam', 'AwayTeam',  'FTR'], axis=1)
Y = training_data['FTR']

X_test = testing_data.drop(['gameId', 'Date', 'Season', 'HomeTeam', 'AwayTeam', 'FTR'], axis=1)
y_test = testing_data['FTR']

print(f"Training features shape: {X.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"Training labels shape: {Y.shape}")
print(f"Test labels shape: {y_test.shape}")

In [None]:
# Normalize the features and Trasform in numeric the Target variable
le = LabelEncoder()
scaler = StandardScaler()

Y = le.fit_transform(Y)      # Away Win = 0 , Draw = 1, # Home Win = 2
y_test = le.fit_transform(y_test)

X = scaler.fit_transform(X)
X_test = scaler.fit_transform(X_test)

# MachineLearningModels.py - Automated Pipeline for Model Training

The `MachineLearningModels.py` file implements an automated pipeline for training, validation, and optimization of machine learning models with the goal of reducing overfitting, maximizing performance, and comparing key metrics across models. The process also enables the saving of trained models in a dedicated directory (`ML_models`), making them reusable for future applications, such as in the next stage, `Betting_Strategy`.

The models considered include:

- **Logistic Regression (Simple)**: A baseline linear classification model.  
- **Elastic Net Logistic Regression**: An advanced variant of logistic regression that combines L1 regularization (sparsity) and L2 regularization (shrinkage), particularly useful when dealing with numerous correlated features, as in the given dataset.  
- **Random Forest**: A tree-based model ideal for exploring complex relationships between variables without requiring significant feature scaling.  
- **XGBoost (Extreme Gradient Boosting Classifier)**: A highly optimized and powerful gradient-boosting algorithm, designed for complex problems.  
- **LightGBM (Light Gradient Boosting Machine)**: A lightweight and faster alternative to XGBoost, optimized for handling large datasets.  

The primary goal is to identify the model with the best **F1-weighted score**, a balanced metric particularly suitable for datasets with imbalanced classes.

---

# Process Details

The `MachineLearningModels.py` script functions as follows:

1. **Model Initialization**  
   Each model is initialized with default or pre-defined parameters.

2. **Hyperparameter Optimization**  
   - **RandomizedSearchCV** is used to perform hyperparameter tuning by testing multiple random combinations of parameters.
   - For **Random Forest**, **XGBoost**, and **LightGBM**, this optimization is conducted in two phases:
     - **Broader Search**: An initial exploration of a wide range of parameter values.
     - **Deeper Search**: A more focused search around the best parameters from the broader search.

3. **Cross-Validation**  
   - Stratified 5-Fold Cross-Validation (`StratifiedKFold`) is employed to ensure robust evaluation of the models by maintaining the proportion of classes across folds.

4. **Metric Evaluation**  
   - Each model is evaluated using the **weighted F1-score**. The model with the highest weighted F1-score is selected.

5. **Model Saving**  
   - Trained models are saved as `.pkl` files in the `ML_models` directory to facilitate reuse without retraining.

6. **Comparison of Models**  
   - Models are compared based on their weighted F1-scores and other performance metrics (e.g., accuracy, precision, recall). The best model is selected and highlighted.

This systematic approach ensures the selection of an optimal model, balancing computational efficiency and predictive performance.



In [None]:
models = {
    "Logistic Regression": train_simple_logistic(X, Y, X_test, y_test),
    "Elastic Net Logistic": train_elastic_net(X, Y, X_test, y_test),
    "Random Forest": train_random_forest_improved(X, Y, X_test, y_test),
    "XGBoost": train_xgboost_improved(X, Y, X_test, y_test),
    "LightGBM": train_lightgbm_improved(X, Y, X_test, y_test),
}

In [None]:
best_model_name, best_model = evaluate_models(models, X_test, y_test)


# Model Performance Analysis

## Logistic Regression
- **Precision**: High for `away win` (0.62) and `home win` (0.60), but low for `draw` (0.35).
- **Recall**: Very high for `home win` (0.82), but extremely low for `draw` (0.11), indicating the model struggles to correctly identify this outcome.
- **F1-Score**: Balanced for `away win` and `home win`, but very low for `draw` (0.17).
- **Conclusion**: Overall performance is acceptable, but the model struggles with imbalanced outcomes, particularly predicting `draw`.


## Elastic Net Logistic Regression
- Results are similar to the basic Logistic Regression:
  - Slight improvement in **Precision** and **Recall** for `draw` (`precision`: 0.35 → 0.37, `recall`: 0.11 → 0.12).
  - No significant improvement for the other outcomes.
- **Conclusion**: Adding L1 and L2 regularization provides minor gains but does not significantly outperform the base model.


## Random Forest
- **Precision**: Improved compared to Logistic Regression, especially for `draw` (0.42), and decent for the other two outcomes.
- **Recall**: Balanced overall (`away win`: 0.71, `draw`: 0.26, `home win`: 0.68).
- **F1-Score**: Higher across all outcomes compared to Logistic Regression. 
  - Best performance for `home win` (0.65).
- **Conclusion**: Random Forest performs better with imbalanced outcomes due to its ability to handle non-linear relationships and complex features.


## XGBoost
- **Precision** and **Recall**: Similar to Random Forest, with a slight drop in precision for `away win` and recall for `draw`.
- **F1-Score**: Comparable to Random Forest but slightly lower for `draw` (0.29 vs. 0.32).
- **Conclusion**: Although powerful, XGBoost does not outperform Random Forest on this dataset, likely due to limited hyperparameter tuning.


## LightGBM
- **Precision**: Similar to XGBoost, but slightly lower across all outcomes.
- **Recall**: Good for `home win` (0.72), but low for `draw` (0.22).
- **F1-Score**: Comparable to XGBoost and Random Forest but worse for `draw`.
- **Conclusion**: While LightGBM offers excellent training speed, its overall performance is weaker compared to Random Forest.


## Confusion Matrix Insights
1. **Away Win**: Identified with high precision and recall across all models.
2. **Draw**: All models struggle to identify this outcome correctly, with high false negative rates.
3. **Home Win**: Consistently well-classified, with high recall across all models.


## General Observations
1. **Weighted F1-Score**: Random Forest (0.544) emerges as the best model in terms of balancing precision and recall, delivering the highest overall performance.
2. **Class Imbalance**: The `draw` outcome poses a significant challenge for all models.
3. **Model Selection**: Random Forest offers the best balance between simplicity and performance. While XGBoost and LightGBM may benefit from more aggressive hyperparameter tuning, it's uncertain whether the additional computational resources and time required would significantly enhance their predictive power.




--> Proceed to the "Betting_Strategy" section.