# Model Training
Welcome to the 'Model Training and Prediction' notebook, a crucial facet of our project's data science pipeline. In this notebook, we offer a meticulous examination of our rigorous model development process. The pipeline starts by accepting training data, followed by fitting three distinct types of models to it: Random Forest, Gradient Boosted Tree, and XGBoost. The initial stages include encoding categorical variables and executing Recursive Feature Elimination (RFE) for feature selection. This is succeeded by the application of genetic algorithms to hyperparameter tuning, operating in tandem with a cross-validation routine. Subsequently, the best model is selected based on the highest F1 score, indicating the balance between precision and recall. Finally, the selected model is utilized to predict the outcomes for the current week's round of NRL matches. This process is iterative and cyclical, with the potential for revisiting earlier stages based on the model's performance. Let us proceed with this in-depth exploration.

## Set up Environment
he initial section of the code is dedicated to preparing the environment for our model training pipeline. Key Python libraries, including sys, pandas, sqlite3, pathlib, and numpy, are imported to manage system parameters, perform data operations, handle database connectivity, manage file paths, and conduct numerical operations respectively. Additionally, the path of our custom modules, residing in the 'model-training/functions' directory, is appended to the system path, enabling us to import the modelling_functions and training_config modules. These modules contain custom functions and configuration settings essential for the subsequent phases of data preprocessing, model training, and prediction. This setup ensures all necessary tools and functions are readily available for the pipeline's operations.

In [1]:
import sys
import pandas as pd
import sqlite3
import pathlib
import numpy as np

sys.path.append("model-training/functions") 
import modelling_functions as mf
import training_config as tc

## Get data
Our process starts by establishing the root directory of the project and constructing the relative path to the 'footy-tipper-db.sqlite' database located within the 'data' directory. We then connect to this SQLite database and use a SQL query housed in the 'footy_tipping_data.sql' file, found in the 'sql' directory, to extract the required data. This data is loaded into a pandas DataFrame, footy_tipping_data, serving as the basis for our subsequent modeling activities. Upon successful extraction of the data, we ensure the database connection is closed, maintaining good coding practice and resource management.

In [2]:
# Get to the root directory
project_root = pathlib.Path().absolute().parent.parent

# Now construct the relative path to your SQLite database
db_path = project_root / "data" / "footy-tipper-db.sqlite"

# Connect to the SQLite database
con = sqlite3.connect(str(db_path))

# Read SQL query from external SQL file
with open('sql/footy_tipping_data.sql', 'r') as file:
    query = file.read()

footy_tipping_data = pd.read_sql_query(query, con)

# Don't forget to close the connection
con.close()

footy_tipping_data

Unnamed: 0,game_id,round_id,round_name,game_number,game_state_name,start_time,start_time_utc,venue_name,city,crowd,...,turn_around_away,turn_around_diff,matchup_form,state_of_origin,home_elo,away_elo,home_elo_prob,away_elo_prob,draw_prob,home_ground_advantage
0,2.012111e+10,1.0,Round 1,1.0,Final,1.330600e+09,1.330560e+09,McDonald Jones Stadium,Newcastle,29189.0,...,14.186239,0.000000,0.0,0.0,1500.000000,1500.000000,0.488541,0.481319,0.030140,
1,2.012111e+10,1.0,Round 1,2.0,Final,1.330686e+09,1.330646e+09,Bankwest Stadium,Sydney,11399.0,...,14.186239,0.000000,0.0,0.0,1500.000000,1500.000000,0.488541,0.481319,0.030140,
2,2.012111e+10,1.0,Round 1,3.0,Final,1.330772e+09,1.330733e+09,Canberra Stadium,Canberra,7862.0,...,14.186239,0.000000,0.0,0.0,1500.000000,1500.000000,0.488541,0.481319,0.030140,
3,2.012111e+10,1.0,Round 1,4.0,Final,1.330772e+09,1.330733e+09,Panthers Stadium,Penrith,9585.0,...,14.186239,0.000000,0.0,0.0,1500.000000,1500.000000,0.488541,0.481319,0.030140,
4,2.012111e+10,1.0,Round 1,5.0,Final,1.330769e+09,1.330733e+09,1300SMILES Stadium,Townsville,16311.0,...,14.186239,0.000000,0.0,0.0,1500.000000,1500.000000,0.488541,0.481319,0.030140,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2378,2.023111e+10,27.0,Round 27,4.0,Pre Game,1.693667e+09,1.693631e+09,Suncorp Stadium,Brisbane,,...,7.791667,0.000000,0.0,0.0,1485.749403,1501.872182,0.466047,0.503813,0.030140,
2379,2.023111e+10,27.0,Round 27,5.0,Pre Game,1.693676e+09,1.693640e+09,BlueBet Stadium,Penrith,,...,7.895833,1.006944,3.0,0.0,1522.491971,1522.504724,0.488523,0.481337,0.030140,
2380,2.023111e+10,27.0,Round 27,6.0,Pre Game,1.693683e+09,1.693647e+09,Netstrata Jubilee Stadium,Sydney,,...,6.145833,1.836806,3.0,0.0,1484.806612,1501.488641,0.465268,0.504592,0.030140,
2381,2.023111e+10,27.0,Round 27,7.0,Pre Game,1.693750e+09,1.693714e+09,Cbus Super Stadium,Gold Coast,,...,7.000000,0.958333,1.0,0.0,1503.265483,1467.119452,0.536004,0.428879,0.035117,


## Modelling
During the modeling phase, we invoke the train_and_select_best_model function from our modelling_functions module. This function, by accepting our footy tipping data, predictor variables, outcome variable, and several configuration settings (including Recursive Feature Elimination (RFE) usage, cross-validation folds number, and optimization metric) from the training_config module, initiates the training of three distinct models: Random Forest, Gradient Boosted Tree, and XGBoost. Each model undergoes hyperparameter tuning via genetic algorithms and is evaluated through cross-validation. The best_model, defined by the superior performance on the chosen optimization metric, is selected and, together with X_inference (transformed feature matrix), label_encoder (for encoding categorical variables), and game_id_inference (for associating predictions with specific games), is returned, ready for the prediction phase.

In [3]:
best_model, X_inference, label_encoder, game_id_inference = mf.train_and_select_best_model(
    footy_tipping_data, tc.predictors, tc.outcome_var,
    tc.use_rfe, tc.num_folds, tc.opt_metric
)

best_model

Fitting 5 folds for each of 729 candidates, totalling 3645 fits
{'colsample_bytree': 0.7, 'gamma': 0.2, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.9}
0.7788349142053186
Fitting 5 folds for each of 432 candidates, totalling 2160 fits
{'bootstrap': True, 'max_depth': 20, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}
0.7693469307766454
Fitting 5 folds for each of 1458 candidates, totalling 7290 fits




{'learning_rate': 0.01, 'max_depth': 3, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200, 'subsample': 0.8}
0.7863682702327219


## Make Predictions
In the prediction phase, we employ the model_predictions function from the modelling_functions module, leveraging the best_model previously selected. By feeding the best_model with X_inference - the transformed feature matrix, we generate predictions. Additionally, label_encoder helps decode the predicted labels to their original form, while game_id_inference enables us to link the predictions with corresponding games. The results are consolidated into a DataFrame, predictions_df, which carries the predicted outcomes for each game. This completes the prediction phase, producing actionable insights for the week's round of NRL matches.

In [4]:
predictions_df = mf.model_predictions(best_model, X_inference, label_encoder, game_id_inference)
predictions_df

Unnamed: 0,game_id,home_team_result,home_team_win_prob,home_team_lose_prob
0,20231110000.0,Loss,0.264914,0.735086
1,20231110000.0,Loss,0.335308,0.664692
2,20231110000.0,Win,0.526304,0.473696
3,20231110000.0,Win,0.663941,0.336059
4,20231110000.0,Win,0.65517,0.34483


## Write predictions to the database
With this final step, the comprehensive procedure concludes. The sophisticated model, once trained, has offered its predictions for the current week's NRL matches. The generated predictions have been appropriately stored within the database, readily available for ensuing analysis and application. The rigorous pipeline, hence, has successfully accomplished its mission, ensuring the implementation of robust model training, precise generation of predictions, and secure archival of data.

In [5]:
# Connect to the SQLite database
con = sqlite3.connect(str(db_path))

# Read SQL query from external SQL file and create table
with open('sql/create_table.sql', 'r') as file:
    create_table_query = file.read()
con.execute(create_table_query)

# Read SQL query from external SQL file for insertion
with open('sql/insert_into_table.sql', 'r') as file:
    insert_into_table_query = file.read()

# Write each row in the DataFrame to the database
for index, row in predictions_df.iterrows():
    con.execute(insert_into_table_query, (
        row['game_id'], 
        row['home_team_result'],
        row['home_team_win_prob'],
        row['home_team_lose_prob']
    ))

# Commit the transaction
con.commit()

# Close the connection
con.close()