# Model Training
Welcome to the 'Model Training and Prediction' notebook, a crucial facet of our project's data science pipeline. In this notebook, we offer a meticulous examination of our rigorous model development process. The pipeline starts by accepting training data, followed by fitting three distinct types of models to it: Random Forest, Gradient Boosted Tree, and XGBoost. The initial stages include encoding categorical variables and executing Recursive Feature Elimination (RFE) for feature selection. This is succeeded by the application of genetic algorithms to hyperparameter tuning, operating in tandem with a cross-validation routine. Subsequently, the best model is selected based on the highest F1 score, indicating the balance between precision and recall. Finally, the selected model is utilized to predict the outcomes for the current week's round of NRL matches. This process is iterative and cyclical, with the potential for revisiting earlier stages based on the model's performance. Let us proceed with this in-depth exploration.

## Set up Environment
This code segment is setting up the environment for the model training pipeline. It begins by importing sys and pathlib - Python libraries used for managing system parameters and file paths, respectively.

The code then updates the system path to include the "functions" directory. This allows for the import of custom modules `modelling_functions`, `model_properties`, and `training_config` which are stored in this directory. These modules contain custom functions and configuration settings that are critical for the later stages of data preprocessing, model training, and prediction.

Following this, the `project_root` variable is defined. This is achieved by using the pathlib library to establish the root directory of the project.

Finally, the `db_path` variable is constructed. This is the relative path to the SQLite database "footy-tipper-db.sqlite", which is located in the "data" directory of the project root. This path will be used for database connectivity throughout the pipeline.

In [1]:
# import libraries
import os
import sys
import pathlib

cwd = os.getcwd()

# get the parent directory
parent_dir = os.path.dirname(cwd)

# add the parent directory to the system path
sys.path.insert(0, parent_dir)

# Get to the root directory
project_root = pathlib.Path().absolute().parent

# import functions from common like this:
from pipeline.common.model_training import (
    training_config as tc,
    modelling_functions as mf,
    model_properties as mp
)

from pipeline.common.model_prediciton import prediction_functions as pf

## Get data
Our process starts by establishing the root directory of the project and constructing the relative path to the 'footy-tipper-db.sqlite' database located within the 'data' directory. We then connect to this SQLite database and use a SQL query housed in the 'footy_tipping_data.sql' file, found in the 'sql' directory, to extract the required data. This data is loaded into a pandas DataFrame, footy_tipping_data, serving as the basis for our subsequent modeling activities. Upon successful extraction of the data, we ensure the database connection is closed, maintaining good coding practice and resource management.

In [None]:
training_data = mf.get_training_data(
    db_path = project_root / "data" / "footy-tipper-db.sqlite", 
    sql_file = project_root / 'pipeline/common/sql/training_data.sql')

training_data

## Modelling
During the modelling phase, the `train_and_select_best_model` function, part of our `modelling_functions` module, is invoked. This function initiates the training of three distinct models: XGBoost, Random Forest, and Gradient Boosting Classifier. It takes as input the footy tipping data, predictor variables, the outcome variable, and several configuration settings like whether to use Recursive Feature Elimination (RFE), the number of cross-validation folds, and the optimization metric, all sourced from the `training_config` module.

The function first identifies categorical columns in the feature set for one-hot encoding, creating dummy variables for categorical features. Depending on the choice of using RFE, a feature elimination step may be included in the pipeline. Each model subsequently undergoes hyperparameter tuning using a genetic algorithm, facilitated by the `GASearchCV` function.

All the models are then trained and evaluated through cross-validation. The best model, or `footy_tipper`, is selected based on the superior performance on the chosen optimization metric. Additionally, a `LabelEncoder`(`label_encoder`), used to encode the categorical target variable, is returned. This LabelEncoder is specific to the model that performed best. The selected model, encapsulated in a pipeline with pre-processing steps and hyperparameter tuning, is now ready for the prediction phase.

### Basic Model

In [None]:
# footy_tipper, label_encoder = mf.train_and_select_best_model(
#     training_data, tc.predictors, tc.outcome_var,
#     tc.use_rfe, tc.num_folds, tc.opt_metric
# )
# footy_tipper

### Stacking Model - no pretrained models

In [None]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn_genetic.space import Integer, Continuous, Categorical
import xgboost as xgb
from sklearn.utils.multiclass import type_of_target

def setup_base_models(data, predictors, base_outcomes, use_rfe, num_folds):
    base_models = []
    for outcome_var in base_outcomes:
        filtered_data = data[predictors]

        target_type = type_of_target(data[outcome_var])
        is_regression = target_type in ['continuous', 'multiclass']

        if is_regression:
            estimator = xgb.XGBRegressor()
            param_grid = {
                'n_estimators': Integer(20, 500),
                'learning_rate': Continuous(0.01, 0.9),
                'max_depth': Integer(2, 20),
                'subsample': Continuous(0.1, 1.0),
                'colsample_bytree': Continuous(0.1, 0.99),
                'gamma': Continuous(0, 0.9)
            }
            opt_metric = 'neg_mean_squared_error'
        else:
            estimator = xgb.XGBClassifier()
            param_grid = {
                'n_estimators': Integer(50, 300),
                'learning_rate': Continuous(0.05, 0.95),
                'max_depth': Integer(3, 15),
                'subsample': Continuous(0.3, 1.0),
                'colsample_bytree': Continuous(0.3, 0.95),
                'gamma': Continuous(0.1, 0.5)
            }
            opt_metric = 'accuracy'

        cat_cols = filtered_data.select_dtypes(include=['object']).columns.tolist()
        pipeline = mf.create_pipeline(estimator, param_grid, use_rfe, num_folds, opt_metric, cat_cols)
        base_models.append((outcome_var, pipeline))

    return base_models

# Training data setup
X, y = training_data[tc.predictors], training_data[tc.main_outcome]

# Setup base models for stacking without pre-training them
base_models = setup_base_models(training_data, tc.predictors, tc.base_outcomes, True, 5)

# Setup the Stacking Classifier with RandomForest as the meta-model
meta_model = RandomForestClassifier(n_estimators=250, random_state=69)
stack = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=3)

# Fit the Stacking Classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=69)
stack.fit(X_train, y_train)

# Predict and evaluate
y_pred = stack.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Stacking Classifier: {accuracy}")

### Stacking Model - with pretrained models

In [None]:
import numpy as np
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# List of outcome variables - this can be dynamically extended
outcome_variables = ["home_team_result", "match_points_difference"]

# Function to train and retrieve models for each outcome variable
def get_trained_models(data, predictors, outcome_variables, use_rfe, num_folds, opt_metric):
    model_pipelines = {}
    for outcome_var in outcome_variables:
        print(f"Training model for {outcome_var}")
        best_pipeline, _ = mf.train_and_select_best_model(
            data, predictors, outcome_var, use_rfe, num_folds, opt_metric
        )
        model_pipelines[outcome_var] = best_pipeline
    return model_pipelines

# Training the base models
X, y = training_data[tc.predictors], training_data['home_team_result']  # Ensure this is set correctly
model_pipelines = get_trained_models(training_data, tc.predictors, outcome_variables, True, 5, 'accuracy')  # Assuming use_rfe, num_folds, opt_metric are set as such

# Preparing the base models list for StackingClassifier
base_models = [(outcome_var, model_pipelines[outcome_var]) for outcome_var in outcome_variables]

# Setting up the Stacking Classifier with RandomForest as the meta-model
print("Setting up the Stacking Classifier with RandomForest as the meta-model")
meta_model = RandomForestClassifier(n_estimators=250, random_state=42)
stack = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=3)

# Fit the Stacking Classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
stack.fit(X_train, y_train)

# Evaluate the model
y_pred = stack.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Stacking Classifier: {accuracy}")


### Custom Stacking Classifier - OOF Prediction Analysis

In [None]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn_genetic.space import Integer, Continuous, Categorical
import xgboost as xgb
from sklearn.utils.multiclass import type_of_target
from sklearn.preprocessing import LabelEncoder

def setup_base_models(data, predictors, base_outcomes, use_rfe, num_folds):

    filtered_data = data[predictors]

    base_models = []
    for outcome_var in base_outcomes:
        
        target_type = type_of_target(data[outcome_var])
        is_regression = target_type in ['continuous', 'multiclass']

        if is_regression:
            estimator = xgb.XGBRegressor()
            param_grid = {
                'n_estimators': Integer(20, 500),
                'learning_rate': Continuous(0.01, 0.9),
                'max_depth': Integer(2, 20),
                'subsample': Continuous(0.1, 1.0),
                'colsample_bytree': Continuous(0.1, 0.99),
                'gamma': Continuous(0, 0.9)
            }
            opt_metric = 'neg_mean_squared_error'
        else:
            estimator = xgb.XGBClassifier()
            param_grid = {
                'n_estimators': Integer(50, 300),
                'learning_rate': Continuous(0.05, 0.95),
                'max_depth': Integer(3, 15),
                'subsample': Continuous(0.3, 1.0),
                'colsample_bytree': Continuous(0.3, 0.95),
                'gamma': Continuous(0.1, 0.5)
            }
            opt_metric = 'accuracy'

        cat_cols = filtered_data.select_dtypes(include=['object']).columns.tolist()
        pipeline = mf.create_pipeline(estimator, param_grid, use_rfe, num_folds, opt_metric, cat_cols)
        base_models.append((outcome_var, pipeline))

    return base_models

import numpy as np
from sklearn.model_selection import KFold
from sklearn.base import clone

# Assuming setup_base_models returns a list of (name, estimator) tuples
# Initialize the data
X, y = training_data[tc.predictors], training_data[tc.main_outcome]
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)  # Encode the labels

kf = KFold(n_splits=3, shuffle=True, random_state=69)

oof_predictions = np.zeros((X.shape[0], len(tc.base_outcomes)))  # For storing OOF predictions from each base model

# Base models setup
base_models = setup_base_models(training_data, tc.predictors, tc.base_outcomes, True, 3)
meta_model = RandomForestClassifier(n_estimators=250, random_state=69)

# Iterate over each fold
for train_index, test_index in kf.split(X):
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y_encoded[train_index], y_encoded[test_index]  # Use encoded labels
    
    # Train each base model and generate OOF predictions
    for i, (name, base_model) in enumerate(base_models):
        print(f"Training base model {name} iteration {i+1}")
        cloned_model = clone(base_model)
        cloned_model.fit(X_train, y_train)
        oof_pred = cloned_model.predict_proba(X_val)[:, 1] if hasattr(cloned_model, 'predict_proba') else cloned_model.predict(X_val)
        oof_predictions[test_index, i] = oof_pred

# Now oof_predictions contains the OOF predictions for each base model
# Train the meta-model on the OOF predictions
print("Training the meta-model on OOF predictions")
meta_model.fit(oof_predictions, y_encoded)

# To evaluate the model, you can use the OOF predictions and compare them to the actual labels
from sklearn.metrics import accuracy_score
oof_pred_final = meta_model.predict(oof_predictions)
accuracy = accuracy_score(y_encoded, oof_pred_final)
print(f"OOF Accuracy of the Stacking Classifier: {accuracy}")

In [None]:
inference_data = pf.get_inference_data(
    db_path = project_root / "data" / "footy-tipper-db.sqlite", 
    sql_file = project_root / 'pipeline/common/sql/inference_data.sql')
inference_data

# Assuming base_models is a list of (name, model) tuples
# Function to get the model for a specific outcome
def get_model_for_outcome(base_models, outcome):
    for name, model in base_models:
        if name == outcome:
            return model
    return None

# Extract the model trained on "match_points_difference"
model_for_difference = get_model_for_outcome(base_models, "match_points_difference")

# Assuming you have some new data in new_data
# Make sure to prepare new_data in the same way as training_data
X_new = inference_data[tc.predictors]

# Check if the model is not None and predict
if model_for_difference:
    numeric_predictions = model_for_difference.predict(X_new)  # These will be numeric
    predictions = encoder.inverse_transform(numeric_predictions)  # Convert numeric predictions back to original labels
    print(predictions)
else:
    print("No model found for this outcome.")

### Display feature importance
The `get_feature_importance` function retrieves feature importances from a trained scikit-learn pipeline. It accounts for different transformations, such as one-hot encoding and recursive feature elimination. The function then returns a sorted DataFrame listing each feature alongside its respective importance, aiding in understanding the model's decision-making process.

In [None]:
# feature_importance_df = mp.get_feature_importances_from_pipeline(footy_tipper, tc.predictors)
# feature_importance_df

## Save Model
The `save_models` function stores the trained LabelEncoder and Pipeline objects to the disk. This allows for easy retrieval and reuse in future model prediction tasks, without the need to retrain these components. The objects are stored in a designated 'models' directory under the project root path, ensuring organized and consistent storage.

In [None]:
mf.save_models(label_encoder, footy_tipper, project_root)

## Predict
The final stage of the pipeline involves predicting the outcomes of the current week's NRL matches. This is achieved by connecting to the SQLite database and extracting the required data. The trained model and LabelEncoder are then loaded from the disk, and the prediction is performed using the `model_predictions` function. The predictions are stored in the 'predictions' table of the database, allowing for easy retrieval and analysis.

In [None]:
label_encoder, footy_tipper = pf.load_models(project_root)

In [None]:
inference_data = pf.get_inference_data(
    db_path = project_root / "data" / "footy-tipper-db.sqlite", 
    sql_file = project_root / 'pipeline/common/sql/inference_data.sql')
inference_data

In [None]:
predictions_df = pf.model_predictions(footy_tipper, inference_data, label_encoder)
predictions_df

In [None]:
pf.save_predictions_to_db(
    predictions_df, 
    project_root / "data" / "footy-tipper-db.sqlite", 
    project_root / 'pipeline/common/sql/create_table.sql', 
    project_root / 'pipeline/common/sql/insert_into_table.sql'
)

# this is the sending bit

In [2]:
from dotenv import load_dotenv
from pipeline.common.model_prediciton import prediction_functions as pf
from pipeline.common.use_predictions import sending_functions as sf

# Now construct the relative path to your SQLite database
db_path = project_root / "data" / "footy-tipper-db.sqlite"
secrets_path = project_root / "secrets.env"
json_path = project_root / "service-account-token.json"

load_dotenv(dotenv_path=secrets_path)

True

In [3]:
import sqlite3
import pandas as pd
# Connect to the SQLite database
con = sqlite3.connect(str(db_path))

# Read SQL query from external SQL file
with open(project_root / 'pipeline/common' / 'sql/prediction_table.sql', 'r') as file:
    query = file.read()

# Execute the query and fetch the results into a data frame
predictions = pd.read_sql_query(query, con)

# Disconnect from the SQLite database
con.close()

predictions

Unnamed: 0,game_id,home_team_result,team_home,position_home,team_head_to_head_odds_home,team_away,position_away,team_head_to_head_odds_away,home_team_win_prob,home_team_lose_prob,round_id,competition_year,round_name
0,20241111410,Win,St. George Illawarra Dragons,12,1.39,Wests Tigers,16,3.04,0.708454,0.291546,14,2024,Round 14
1,20241111420,Win,Gold Coast Titans,14,1.8,South Sydney Rabbitohs,17,2.05,0.656749,0.343251,14,2024,Round 14
2,20241111430,Win,North Queensland Cowboys,10,1.66,New Zealand Warriors,13,2.23,0.602685,0.397315,14,2024,Round 14
3,20241111440,Win,Brisbane Broncos,5,1.62,Cronulla-Sutherland Sharks,2,2.31,0.612076,0.387924,14,2024,Round 14
4,20241111450,Win,Melbourne Storm,1,1.21,Newcastle Knights,11,4.45,0.67376,0.32624,14,2024,Round 14
5,20241111460,Win,Penrith Panthers,3,1.4,Manly-Warringah Sea Eagles,7,2.96,0.662642,0.337358,14,2024,Round 14
6,20241111470,Win,Canterbury-Bankstown Bulldogs,9,2.12,Parramatta Eels,15,1.73,0.614463,0.385537,14,2024,Round 14


In [4]:
tipper_picks = sf.get_tipper_picks(predictions)
tipper_picks

Unnamed: 0,team,price,price_min
1,Gold Coast Titans,1.8,1.522651
6,Canterbury-Bankstown Bulldogs,2.12,1.627438


In [None]:
sf.upload_df_to_drive(
    predictions, 
    json_path, 
    os.getenv('FOLDER_ID'), 
    "predictions.csv"
)

In [5]:
reg_reagan = sf.generate_reg_regan_email(
    predictions, 
    tipper_picks, 
    os.getenv('OPENAI_KEY'), 
    os.getenv('FOLDER_URL'),
    1
)

print(reg_reagan)

InvalidRequestError: max_tokens is too large: 100000. This model supports at most 4096 completion tokens, whereas you provided 100000.

In [None]:
sf.send_emails(
    "footy-tipper-email-list", 
    f"Footy Tipper Predictions for {predictions['round_name'].unique()[0]}", 
    reg_reagan, 
    os.getenv('MY_EMAIL'), 
    os.getenv('EMAIL_PASSWORD'), 
    json_path
)