# Model Training
Welcome to the 'Model Training and Prediction' notebook, a crucial facet of our project's data science pipeline. In this notebook, we offer a meticulous examination of our rigorous model development process. The pipeline starts by accepting training data, followed by fitting three distinct types of models to it: Random Forest, Gradient Boosted Tree, and XGBoost. The initial stages include encoding categorical variables and executing Recursive Feature Elimination (RFE) for feature selection. This is succeeded by the application of genetic algorithms to hyperparameter tuning, operating in tandem with a cross-validation routine. Subsequently, the best model is selected based on the highest F1 score, indicating the balance between precision and recall. Finally, the selected model is utilized to predict the outcomes for the current week's round of NRL matches. This process is iterative and cyclical, with the potential for revisiting earlier stages based on the model's performance. Let us proceed with this in-depth exploration.

## Set up Environment
This code segment is setting up the environment for the model training pipeline. It begins by importing sys and pathlib - Python libraries used for managing system parameters and file paths, respectively.

The code then updates the system path to include the "functions" directory. This allows for the import of custom modules `modelling_functions`, `model_properties`, and `training_config` which are stored in this directory. These modules contain custom functions and configuration settings that are critical for the later stages of data preprocessing, model training, and prediction.

Following this, the `project_root` variable is defined. This is achieved by using the pathlib library to establish the root directory of the project.

Finally, the `db_path` variable is constructed. This is the relative path to the SQLite database "footy-tipper-db.sqlite", which is located in the "data" directory of the project root. This path will be used for database connectivity throughout the pipeline.

In [1]:
import sys
import pathlib

sys.path.append("functions") 
import modelling_functions as mf
import model_properties as mp
import training_config as tc

# Get to the root directory
project_root = pathlib.Path().absolute().parent.parent

# Now construct the relative path to your SQLite database
db_path = project_root / "data" / "footy-tipper-db.sqlite"

## Get data
Our process starts by establishing the root directory of the project and constructing the relative path to the 'footy-tipper-db.sqlite' database located within the 'data' directory. We then connect to this SQLite database and use a SQL query housed in the 'footy_tipping_data.sql' file, found in the 'sql' directory, to extract the required data. This data is loaded into a pandas DataFrame, footy_tipping_data, serving as the basis for our subsequent modeling activities. Upon successful extraction of the data, we ensure the database connection is closed, maintaining good coding practice and resource management.

In [2]:
training_data = mf.get_training_data(db_path, 'sql/training_data.sql')
training_data

Unnamed: 0,game_id,round_id,round_name,game_number,game_state_name,start_time,start_time_utc,venue_name,city,crowd,...,away_prev_result_diff,prev_result_diff,home_elo,away_elo,elo_diff,home_elo_prob,away_elo_prob,elo_draw_prob,elo_prob_diff,home_ground_advantage
0,2.020111e+10,1.0,Round 1,1.0,Final,1.584044e+09,1.584004e+09,CommBank Stadium,Sydney,21363.0,...,,,1510.543309,1496.559437,13.983872,0.515806,0.468960,0.015234,0.046846,8.766223
1,2.020111e+10,1.0,Round 1,2.0,Final,1.584122e+09,1.584083e+09,GIO Stadium,Canberra,10610.0,...,,,1515.710707,1466.529375,49.181332,0.553042,0.410595,0.036364,0.142447,9.452456
2,2.020111e+10,1.0,Round 1,3.0,Final,1.584126e+09,1.584090e+09,Queensland Country Bank Stadium,Townsville,22459.0,...,,,1490.406173,1483.396900,7.009273,0.505936,0.478830,0.015234,0.027106,2.220902
3,2.020111e+10,1.0,Round 1,4.0,Final,1.584198e+09,1.584158e+09,McDonald Jones Stadium,Newcastle,10239.0,...,,,1484.839646,1486.546635,-1.706988,0.493587,0.491179,0.015234,0.002408,6.906021
4,2.020111e+10,1.0,Round 1,5.0,Final,1.584207e+09,1.584167e+09,Accor Stadium,Sydney,,...,,,1505.190449,1506.316339,-1.125890,0.494410,0.490356,0.015234,0.004055,3.330532
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
722,2.023111e+10,21.0,Round 21,4.0,Final,1.690038e+09,1.690002e+09,Cbus Super Stadium,Gold Coast,15362.0,...,2.0,-4.0,1489.984964,1479.009557,10.975407,0.511551,0.473215,0.015234,0.038335,1.319097
723,2.023111e+10,21.0,Round 21,5.0,Final,1.690047e+09,1.690011e+09,McDonald Jones Stadium,Newcastle,20392.0,...,-14.0,30.0,1514.952402,1528.123609,-13.171207,0.477345,0.507422,0.015234,-0.030077,-6.718729
724,2.023111e+10,21.0,Round 21,6.0,Final,1.690054e+09,1.690018e+09,Queensland Country Bank Stadium,Townsville,20710.0,...,-28.0,102.0,1541.048829,1521.493601,19.555229,0.512441,0.451195,0.036364,0.061246,-0.360000
725,2.023111e+10,21.0,Round 21,7.0,Final,1.690121e+09,1.690085e+09,BlueBet Stadium,Penrith,21525.0,...,-4.0,12.0,1545.682868,1435.666580,110.016288,0.656564,0.343436,0.000000,0.313129,20.278597


## Modelling
During the modelling phase, the `train_and_select_best_model` function, part of our `modelling_functions` module, is invoked. This function initiates the training of three distinct models: XGBoost, Random Forest, and Gradient Boosting Classifier. It takes as input the footy tipping data, predictor variables, the outcome variable, and several configuration settings like whether to use Recursive Feature Elimination (RFE), the number of cross-validation folds, and the optimization metric, all sourced from the `training_config` module.

The function first identifies categorical columns in the feature set for one-hot encoding, creating dummy variables for categorical features. Depending on the choice of using RFE, a feature elimination step may be included in the pipeline. Each model subsequently undergoes hyperparameter tuning using a genetic algorithm, facilitated by the `GASearchCV` function.

All the models are then trained and evaluated through cross-validation. The best model, or `footy_tipper`, is selected based on the superior performance on the chosen optimization metric. Additionally, a `LabelEncoder`(`label_encoder`), used to encode the categorical target variable, is returned. This LabelEncoder is specific to the model that performed best. The selected model, encapsulated in a pipeline with pre-processing steps and hyperparameter tuning, is now ready for the prediction phase.

In [3]:
footy_tipper, label_encoder = mf.train_and_select_best_model(
    training_data, tc.predictors, tc.outcome_var,
    tc.use_rfe, tc.num_folds, tc.opt_metric
)

footy_tipper


Model training: XGBClassifier
gen	nevals	fitness 	fitness_std	fitness_max	fitness_min
0  	100   	0.704979	0.0117729  	0.730439   	0.669806   
1  	148   	0.715339	0.00733655 	0.730439   	0.697345   
2  	136   	0.72095 	0.00407514 	0.730439   	0.709787   
3  	150   	0.723566	0.00321149 	0.730439   	0.715257   
4  	136   	0.725935	0.00266092 	0.7318     	0.720775   
5  	143   	0.728014	0.00201165 	0.733198   	0.723533   
6  	136   	0.72946 	0.0020776  	0.733198   	0.723533   
7  	137   	0.730813	0.00179901 	0.733198   	0.726282   
8  	136   	0.731806	0.00147417 	0.733198   	0.727662   
9  	146   	0.732124	0.00133971 	0.733198   	0.726273   
10 	146   	0.732611	0.0011622  	0.733198   	0.724913   
11 	135   	0.733126	0.000660595	0.735947   	0.7318     
12 	143   	0.733225	0.000435388	0.735947   	0.7318     
13 	137   	0.733266	0.000354551	0.735947   	0.733198   
14 	128   	0.733348	0.000578293	0.735947   	0.733198   
15 	146   	0.733526	0.000973637	0.735947   	0.729051   
16 	137   	0.7339

ValueError: Input X contains NaN.
RFECV does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

### Display feature importance
The `get_feature_importance` function retrieves feature importances from a trained scikit-learn pipeline. It accounts for different transformations, such as one-hot encoding and recursive feature elimination. The function then returns a sorted DataFrame listing each feature alongside its respective importance, aiding in understanding the model's decision-making process.

In [4]:
# feature_importance_df = mp.get_feature_importances_from_pipeline(footy_tipper, tc.predictors)
# feature_importance_df

## Save Model
The `save_models` function stores the trained LabelEncoder and Pipeline objects to the disk. This allows for easy retrieval and reuse in future model prediction tasks, without the need to retrain these components. The objects are stored in a designated 'models' directory under the project root path, ensuring organized and consistent storage.

In [None]:
mf.save_models(label_encoder, footy_tipper, project_root)