# Model Training
Welcome to the 'Model Training and Prediction' notebook, a crucial facet of our project's data science pipeline. In this notebook, we offer a meticulous examination of our rigorous model development process. The pipeline starts by accepting training data, followed by fitting three distinct types of models to it: Random Forest, Gradient Boosted Tree, and XGBoost. The initial stages include encoding categorical variables and executing Recursive Feature Elimination (RFE) for feature selection. This is succeeded by the application of genetic algorithms to hyperparameter tuning, operating in tandem with a cross-validation routine. Subsequently, the best model is selected based on the highest F1 score, indicating the balance between precision and recall. Finally, the selected model is utilized to predict the outcomes for the current week's round of NRL matches. This process is iterative and cyclical, with the potential for revisiting earlier stages based on the model's performance. Let us proceed with this in-depth exploration.

## Set up Environment
This code segment is setting up the environment for the model training pipeline. It begins by importing sys and pathlib - Python libraries used for managing system parameters and file paths, respectively.

The code then updates the system path to include the "functions" directory. This allows for the import of custom modules `modelling_functions`, `model_properties`, and `training_config` which are stored in this directory. These modules contain custom functions and configuration settings that are critical for the later stages of data preprocessing, model training, and prediction.

Following this, the `project_root` variable is defined. This is achieved by using the pathlib library to establish the root directory of the project.

Finally, the `db_path` variable is constructed. This is the relative path to the SQLite database "footy-tipper-db.sqlite", which is located in the "data" directory of the project root. This path will be used for database connectivity throughout the pipeline.

In [1]:
# import libraries
import os
import sys
import pathlib

cwd = os.getcwd()

# get the parent directory
parent_dir = os.path.dirname(cwd)

# add the parent directory to the system path
sys.path.insert(0, parent_dir)

# Get to the root directory
project_root = pathlib.Path().absolute().parent

# import functions from common like this:
from pipeline.common.model_training import (
    training_config as tc,
    modelling_functions as mf,
    model_properties as mp
)

from pipeline.common.model_prediciton import prediction_functions as pf

## Get data
Our process starts by establishing the root directory of the project and constructing the relative path to the 'footy-tipper-db.sqlite' database located within the 'data' directory. We then connect to this SQLite database and use a SQL query housed in the 'footy_tipping_data.sql' file, found in the 'sql' directory, to extract the required data. This data is loaded into a pandas DataFrame, footy_tipping_data, serving as the basis for our subsequent modeling activities. Upon successful extraction of the data, we ensure the database connection is closed, maintaining good coding practice and resource management.

In [2]:
data = mf.get_training_data(
    db_path = project_root / "data" / "footy-tipper-db.sqlite", 
    sql_file = project_root / 'pipeline/common/sql/training_data.sql')

data

Unnamed: 0,game_id,round_id,round_name,game_number,game_state_name,start_time,start_time_utc,venue_name,city,crowd,...,home_prev_result_diff,away_prev_result_diff,prev_result_diff,home_elo,away_elo,elo_diff,home_elo_prob,away_elo_prob,elo_draw_prob,elo_prob_diff
0,2.019111e+10,1.0,Round 1,1.0,Final,1.552594e+09,1.552554e+09,AAMI Park,Melbourne,16239.0,...,0.0,0.0,0.0,1512.815547,1506.871980,5.943567,0.491704,0.468054,0.040241,0.023650
1,2.019111e+10,1.0,Round 1,2.0,Final,1.552673e+09,1.552633e+09,McDonald Jones Stadium,Newcastle,21813.0,...,0.0,0.0,0.0,1480.476662,1508.614594,-28.137933,0.443330,0.513480,0.043189,-0.070150
2,2.019111e+10,1.0,Round 1,3.0,Final,1.552680e+09,1.552641e+09,Sydney Cricket Ground,Sydney,24527.0,...,0.0,0.0,0.0,1522.482547,1509.260995,13.221552,0.501744,0.458015,0.040241,0.043729
3,2.019111e+10,1.0,Round 1,4.0,Final,1.552756e+09,1.552709e+09,Go Media Stadium,Auckland,18795.0,...,0.0,0.0,0.0,1499.459581,1501.711892,-2.252310,0.480386,0.479372,0.040241,0.001014
4,2.019111e+10,1.0,Round 1,5.0,Final,1.552757e+09,1.552718e+09,Leichhardt Oval,Sydney,13159.0,...,0.0,0.0,0.0,1484.877388,1486.813420,-1.936032,0.480823,0.478935,0.040241,0.001888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1136,2.024111e+10,21.0,Round 21,4.0,Final,1.722101e+09,1.722065e+09,Queensland Country Bank Stadium,Townsville,17907.0,...,2.0,1.0,1.0,1503.189788,1526.466464,-23.276675,0.449995,0.506816,0.043189,-0.056822
1137,2.024111e+10,21.0,Round 21,5.0,Final,1.722109e+09,1.722073e+09,Allianz Stadium,Sydney,25155.0,...,30.0,-1.0,31.0,1549.661396,1532.057945,17.603451,0.509804,0.453781,0.036415,0.056023
1138,2.024111e+10,21.0,Round 21,6.0,Final,1.722175e+09,1.722139e+09,WIN Stadium,Wollongong,18988.0,...,20.0,-8.0,28.0,1496.264541,1526.843739,-30.579198,0.439989,0.516822,0.043189,-0.076833
1139,2.024111e+10,21.0,Round 21,7.0,Final,1.722183e+09,1.722147e+09,Suncorp Stadium,Brisbane,19533.0,...,8.0,30.0,-22.0,1502.252019,1485.780092,16.471927,0.508240,0.455345,0.036415,0.052895


## Modelling

During the modeling phase, the `train_and_select_best_model` function, part of our `modelling_functions` module, is invoked. This function initiates the training of Poisson models specifically designed for predicting the scores of the home and away teams. It takes as input the footy tipping data, predictor variables, the outcome variable, and several configuration settings like whether to use Recursive Feature Elimination (RFE), the number of cross-validation folds, and the optimization metric, all sourced from the `training_config` module.

The function first identifies categorical columns in the feature set for one-hot encoding, creating dummy variables for categorical features. Depending on the choice of using RFE, a feature elimination step may be included in the pipeline. Each model subsequently undergoes hyperparameter tuning using a genetic algorithm, facilitated by the `GASearchCV` function.

All the models are then trained and evaluated through cross-validation. The best models for predicting home and away scores are selected based on superior performance on the chosen optimization metric. The selected models, encapsulated in pipelines with pre-processing steps and hyperparameter tuning, are now ready for the prediction phase.

The trained models are saved using the `save_models` function for future use, ensuring that the prediction process can be efficiently replicated and scaled.

### Test Modelling Data Preparation
First, because we want to understand how effective our simulations are without creating more load on the model cross-validation, we will split the data into training and testing sets. This will allow us to evaluate the model's interaction with each other on unseen data more easily.

In [3]:
# Set the random seed for reproducibility
random_seed = 42

# Define the test size proportion
test_size = 0.2

# Randomly shuffle the DataFrame and split
training_data = data.sample(frac=1 - test_size, random_state=random_seed)
test_data = data.drop(training_data.index)

### Poisson Modelling
The Poisson model is a statistical model that is used to predict the number of events occurring within a fixed interval of time or space. In the context of sports, the Poisson model can be used to predict the number of goals or points scored by each team in a match. In this section, we will implement a Poisson model to predict the number of points scored by each team in a match. We will then use these predictions to calculate the expected match outcome.

#### Home Modelling

In [4]:
home_model = mf.train_and_select_best_model(
    training_data, tc.predictors, 'team_final_score_home',
    tc.use_rfe, tc.num_folds, tc.opt_metric
)


Model training: LGBMRegressor
gen	nevals	fitness 	fitness_std	fitness_max	fitness_min
0  	10    	-6.60211	0.711143   	-5.57177   	-7.39003   
1  	15    	-6.21606	0.667187   	-5.57177   	-7.12931   
2  	15    	-5.84556	0.488286   	-5.57177   	-6.81748   
Best parameters: {'n_estimators': 140, 'learning_rate': 0.14985754742636262, 'max_depth': 2, 'num_leaves': 90, 'subsample': 0.824460066745631, 'colsample_bytree': 0.8973152570628308, 'reg_alpha': 0.34572096626064575, 'reg_lambda': 0.01905604127755156}
Best score: -5.539452282545402

Best overall model: LGBMRegressor
Best overall score: -5.539452282545402


##### Feature Imprtance

In [5]:
preprocessor = home_model.named_steps['one_hot_encoder']
encoded_feature_names = preprocessor.get_feature_names_out(tc.predictors)

# Get the top 20 feature importances
top_20_features = mp.get_feature_importances(home_model, encoded_feature_names)
print(top_20_features)

                                             feature  importance
227               remainder__season_form_away_ladder          32
182               remainder__season_form_home_ladder          18
162           remainder__team_head_to_head_odds_away          16
161                 remainder__team_line_amount_home          15
432           remainder__dummy_pass_away_performance          13
363          remainder__tackle_made_away_performance          11
437  remainder__post_contact_metres_away_performance           9
159           remainder__team_head_to_head_odds_home           9
302    remainder__one_on_one_tackle_home_performance           9
202            remainder__home_loss_rate_home_ladder           8
293         remainder__charge_downs_home_performance           8
382               remainder__dh_run_away_performance           7
328           remainder__dummy_pass_home_performance           7
400   remainder__ineffective_tackle_away_performance           7
251             remainder

#### Away Team Modelling

In [6]:
away_model = mf.train_and_select_best_model(
    training_data, tc.predictors, 'team_final_score_away',
    tc.use_rfe, tc.num_folds, tc.opt_metric
)


Model training: LGBMRegressor
gen	nevals	fitness 	fitness_std	fitness_max	fitness_min
0  	10    	-6.44675	0.573474   	-5.36741   	-7.4685    
1  	15    	-6.07647	0.400402   	-5.36627   	-6.50456   
2  	17    	-5.72586	0.387462   	-5.35963   	-6.41282   
Best parameters: {'n_estimators': 30, 'learning_rate': 0.19441815010715624, 'max_depth': 2, 'num_leaves': 84, 'subsample': 0.23063139918755848, 'colsample_bytree': 0.2626078404659108, 'reg_alpha': 0.801489029516365, 'reg_lambda': 0.9494799726468875}
Best score: -5.35963155099117

Best overall model: LGBMRegressor
Best overall score: -5.35963155099117


##### Feature Importance

In [7]:
# Get the names of the features after preprocessing (one-hot encoding)
preprocessor = away_model.named_steps['one_hot_encoder']
encoded_feature_names = preprocessor.get_feature_names_out(tc.predictors)

# Get the top 20 feature importances
top_20_features = mp.get_feature_importances(away_model, encoded_feature_names)
print(top_20_features)

                                           feature  importance
182             remainder__season_form_home_ladder           7
159         remainder__team_head_to_head_odds_home           7
227             remainder__season_form_away_ladder           5
164               remainder__team_line_amount_away           5
161               remainder__team_line_amount_home           5
366          remainder__territory_away_performance           4
181             remainder__recent_form_home_ladder           4
162         remainder__team_head_to_head_odds_away           3
207      remainder__avg_tries_conceded_home_ladder           2
217              remainder__points_for_away_ladder           2
200           remainder__home_win_rate_home_ladder           2
261         remainder__possession_home_performance           2
412           remainder__supports_away_performance           2
360            remainder__sin_bin_away_performance           2
132            encoder__team_home_Penrith Panthers     

### Example Match Simulation
In this section, we will simulate a match between two teams using the Poisson model. We will generate the expected number of points scored by each team and use these predictions to determine the match outcome. We will then compare the predicted outcome with the actual outcome to evaluate the accuracy of the Poisson model.

In [8]:
# Example usage with test_data
# mp.plot_sampling_distributions(home_model, away_model, test_data, tc.predictors)

###  Evaluation
The evaluation phase involves predicting the outcomes of an unseen set of random NRL matches using the selected home and away models.

This section is only to be used in development and testing at the moment.
#### Classification Report

In [9]:
# Evaluate the models on the test data
# result_df = mp.evaluate_models(home_model, away_model, test_data, tc.predictors)

#### Expected Scores

In [10]:
# mp.evaluate_score_predictions(result_df)

## Save Models
The `save_models` function stores Pipeline objects to the disk. This allows for easy retrieval and reuse in future model prediction tasks, without the need to retrain these components. The objects are stored in a designated 'models' directory under the project root path, ensuring organized and consistent storage.

In [11]:
mf.save_models(home_model, 'home_model', project_root)
mf.save_models(away_model, 'away_model', project_root)

Pipeline saved to models/home_model.pkl
Pipeline saved to models/away_model.pkl


## Match Simulation and Prediction
The final stage of the pipeline involves predicting the outcomes of the current week's NRL matches. This is achieved by loading the saved models from the 'models' directory and utilizing them to simulate the matches. The predictions are then stored in a DataFrame, which is subsequently written back to the SQLite database. This data can be accessed by the front-end application to display the predicted outcomes to the users.

In [12]:
# Load the models
home_model = pf.load_models('home_model', project_root)
away_model = pf.load_models('away_model', project_root)

# Load this week's game data
inference_data = pf.get_inference_data(
    db_path = project_root / "data" / "footy-tipper-db.sqlite", 
    sql_file = project_root / 'pipeline/common/sql/inference_data.sql')
inference_data

home_model model pipeline loaded successfully.
away_model model pipeline loaded successfully.
Getting inference data...


Unnamed: 0,game_id,round_id,round_name,game_number,game_state_name,start_time,start_time_utc,venue_name,city,crowd,...,home_prev_result_diff,away_prev_result_diff,prev_result_diff,home_elo,away_elo,elo_diff,home_elo_prob,away_elo_prob,elo_draw_prob,elo_prob_diff
0,20241110000.0,22.0,Round 22,1.0,Pre Game,1722542000.0,1722506000.0,Leichhardt Oval,Sydney,,...,-12.0,-10.0,-2.0,1440.905032,1508.768312,-67.86328,0.39406,0.573682,0.032258,-0.179623
1,20241110000.0,22.0,Round 22,2.0,Pre Game,1722629000.0,1722586000.0,Go Media Stadium,Auckland,,...,12.0,8.0,4.0,1488.10143,1465.790416,22.311014,0.516305,0.447281,0.036415,0.069024
2,20241110000.0,22.0,Round 22,3.0,Pre Game,1722622000.0,1722593000.0,Other,Perth,,...,-7.0,16.0,-23.0,1497.268226,1549.37648,-52.108254,0.410718,0.546092,0.043189,-0.135374
3,20241110000.0,22.0,Round 22,4.0,Pre Game,1722697000.0,1722661000.0,Cbus Super Stadium,Gold Coast,,...,8.0,-16.0,24.0,1490.763885,1491.132897,-0.369012,0.482988,0.476771,0.040241,0.006216
4,20241110000.0,22.0,Round 22,5.0,Pre Game,1722706000.0,1722670000.0,AAMI Park,Melbourne,,...,16.0,-4.0,20.0,1537.921513,1483.325091,54.596422,0.564498,0.40609,0.029412,0.158407
5,20241110000.0,22.0,Round 22,6.0,Pre Game,1722714000.0,1722678000.0,PointsBet Stadium,Sydney,,...,52.0,20.0,32.0,1520.88794,1483.482743,37.405197,0.537051,0.426534,0.036415,0.110517
6,20241110000.0,22.0,Round 22,7.0,Pre Game,1722780000.0,1722744000.0,BlueBet Stadium,Penrith,,...,2.0,38.0,-36.0,1539.78319,1470.932726,68.850464,0.583738,0.38685,0.029412,0.196887
7,20241110000.0,22.0,Round 22,8.0,Pre Game,1722788000.0,1722752000.0,Belmore Sports Ground,Sydney,,...,1.0,10.0,-9.0,1518.473729,1480.743529,37.7302,0.537496,0.42609,0.036415,0.111406


### Prediction Simulation
The `simulate_predictions` function is used to predict the outcomes of the current week's NRL matches. It loads the saved models from the 'models' directory and uses them to simulate the matches. The predictions are then stored in a DataFrame, which is written back to the SQLite database.

In [13]:
import pandas as pd

# Predict match outcomes and scorelines for the inference data
outcomes, margins = pf.predict_match_outcome_and_scoreline_with_bayes(home_model, away_model, inference_data, tc.predictors)
outcome_df = pd.merge(outcomes, margins, on='game_id')
outcome_df

Unnamed: 0,game_id,home_team_result,home_team_win_prob,home_team_lose_prob,draw_prob,bayes_factor,evidence_strength,predicted_home_score,predicted_away_score,predicted_margin
0,20241110000.0,Loss,0.10648,0.86459,0.02893,0.123157,Negative evidence,20,28,-8
1,20241110000.0,Win,0.96326,0.02662,0.01012,36.185575,Very strong evidence,30,18,12
2,20241110000.0,Loss,0.1403,0.82509,0.03461,0.170042,Negative evidence,20,26,-6
3,20241110000.0,Loss,0.31439,0.63228,0.05333,0.497232,Negative evidence,22,24,-2
4,20241110000.0,Win,0.98577,0.00995,0.00428,99.072362,Very strong evidence,30,15,15
5,20241110000.0,Win,0.96725,0.02353,0.00922,41.107097,Very strong evidence,32,18,14
6,20241110000.0,Win,0.99043,0.0061,0.00347,162.365574,Decisive evidence,33,15,18
7,20241110000.0,Win,0.80724,0.15333,0.03943,5.264723,Moderate evidence,25,18,7


In [14]:
# Save the predictions to the database
pf.save_predictions_to_db(
    outcome_df, 
    project_root / "data" / "footy-tipper-db.sqlite", 
    project_root / 'pipeline/common/sql/create_table.sql', 
    project_root / 'pipeline/common/sql/insert_into_table.sql'
)

Saving predictions to database...


## Sending Predictions via Email using ChatGPT
In this section, we will use the OpenAI ChatGPT model to generate an email template for sending the predictions to the users. We will use the predictions generated in the previous section and the ChatGPT model to create a personalized email template for each user. The email template will contain the predicted outcomes of the NRL matches for the current week.

In [15]:
from dotenv import load_dotenv
from pipeline.common.model_prediciton import prediction_functions as pf
from pipeline.common.use_predictions import sending_functions as sf

# Now construct the relative path to your SQLite database
db_path = project_root / "data" / "footy-tipper-db.sqlite"
secrets_path = project_root / "secrets.env"
json_path = project_root / "service-account-token.json"

load_dotenv(dotenv_path=secrets_path)

True

In [16]:
import sqlite3
import pandas as pd
# Connect to the SQLite database
con = sqlite3.connect(str(db_path))

# Read SQL query from external SQL file
with open(project_root / 'pipeline/common' / 'sql/prediction_table.sql', 'r') as file:
    query = file.read()

# Execute the query and fetch the results into a data frame
predictions = pd.read_sql_query(query, con)

# Disconnect from the SQLite database
con.close()

predictions

Unnamed: 0,game_id,home_team_result,team_home,position_home,team_head_to_head_odds_home,team_away,position_away,team_head_to_head_odds_away,home_team_win_prob,home_team_lose_prob,round_id,competition_year,round_name
0,20241112210,Loss,Wests Tigers,17,3.43,North Queensland Cowboys,6,1.32,0.10648,0.86459,22,2024,Round 22
1,20241112220,Win,New Zealand Warriors,12,1.23,Parramatta Eels,16,4.2,0.96326,0.02662,22,2024,Round 22
2,20241112230,Loss,Dolphins,8,3.54,Sydney Roosters,3,1.3,0.1403,0.82509,22,2024,Round 22
3,20241112240,Loss,Gold Coast Titans,14,2.44,Brisbane Broncos,13,1.56,0.31439,0.63228,22,2024,Round 22
4,20241112250,Win,Melbourne Storm,1,1.1,St. George Illawarra Dragons,10,7.0,0.98577,0.00995,22,2024,Round 22
5,20241112260,Win,Cronulla-Sutherland Sharks,4,1.28,South Sydney Rabbitohs,15,3.7,0.96725,0.02353,22,2024,Round 22
6,20241112270,Win,Penrith Panthers,2,1.13,Newcastle Knights,11,6.05,0.99043,0.0061,22,2024,Round 22
7,20241112280,Win,Canterbury-Bankstown Bulldogs,5,1.38,Canberra Raiders,9,3.07,0.80724,0.15333,22,2024,Round 22


### Tippper Picks
The Tipper Picks are selcted games which show high value for the tipper to select. This is based on the predicted outcomes of the matches and the odds of the games.

This shows how predictions can be used to further enrich the user experience and provide valuable insights to the users.

In [17]:
tipper_picks = sf.get_tipper_picks(predictions)
tipper_picks

Unnamed: 0,team,price,price_min
1,New Zealand Warriors,1.23,1.038141
4,Melbourne Storm,1.1,1.014435
5,Cronulla-Sutherland Sharks,1.28,1.033859
6,Penrith Panthers,1.13,1.009662
7,Canterbury-Bankstown Bulldogs,1.38,1.238789
0,North Queensland Cowboys,1.32,1.156618
2,Sydney Roosters,1.3,1.211989


In [18]:
# sf.upload_df_to_drive(
#     predictions, 
#     json_path, 
#     os.getenv('FOLDER_ID'), 
#     "predictions.csv"
# )

### Reg Reagan's Email
In this section, we will generate an email from Reg Reagan, a fictional character, using the OpenAI ChatGPT model. The email will contain the predicted outcomes of the NRL matches for the current week, along with some humorous and engaging content. The email will be sent to the users to provide them with the predictions and entertain them at the same time.

In [19]:
reg_reagan = sf.generate_reg_regan_email(
    predictions, 
    tipper_picks, 
    os.getenv('OPENAI_KEY'), 
    os.getenv('FOLDER_URL'),
    1
)

print(reg_reagan)

Subject: Reg Reagan's Rational Round 22 Revelations and Rants

G'day Footy Fanatics,

Hope you're ready for some proper pearl-picking predictions from your favourite footy forecaster. Our magical machine Footy Tipper has churned out predictions for the upcoming Round 22 showdowns, so get ready to make some cash. 

And no, this isn't an invitation for arguments about artificial intelligence and its place in tipping - catch me outside in a footy ring for that one. 

Dissecting the round, it seems like our mate the Footy Tipper has gone with the flow for this one, lads. The machine reckons the Wests Tigers, despite playing on their turf, will be left licking their wounds against the fearsome North Queensland Cowboys who are resting at 6th position. No major shocks yet.

What's a bit tasty though is that Penrith Panthers are expected to continue their rule supreme against our very own Newcastle Knights. You can bet your bottom dollar I hate to see that, but that's what the CPU oracle is pr

In [20]:
# sf.send_emails(
#     "footy-tipper-email-list", 
#     f"Footy Tipper Predictions for {predictions['round_name'].unique()[0]}", 
#     reg_reagan, 
#     os.getenv('MY_EMAIL'), 
#     os.getenv('EMAIL_PASSWORD'), 
#     json_path
# )