## Reference Implementation

### E2E Architecture

![use_case_flow](assets/e2e-workflow.png)

### Solution setup
Use the following cell to change to the correct kernel. Then check that you are in the `stock` kernel. If not, navigate to `Kernel > Change kernel > Python [conda env:stock]`. Note that the cell will remain with * but you can continue running the following cells.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-defaultrisk_stock-py'})

We can view a few samples of our data.

In [None]:
import pandas as pd
data = pd.read_csv("data/credit_risk_dataset.csv")
data.tail()


For demonstrative purposes we make 2 modifications to the original dataset before experimentation using the the [`data/prepare_data.py`](data/prepare_data.py) script

1. Adding a synthetic bias_variable
    
    For the purpose of demonstrating fairness in an ML model later, we will add a bias value for each loan default prediction. This value will be generated randomly using a simple binary probability distribution as follows:
    ```

    If the loan is defaulted i.e. prediction class 1:
      assign bias_variable = 0 or 1 with the probability of 0 being 0.65

    if the loan is not defaulted i.e. prediction class 0:
      assign bias_variable = 0 or 1 with the probability of 0 being 0.35
      
    ```
    |**Feature** | **Description** |
    | :---: | :---: |
    | bias_variable | synthetic biased variable |

    For fairness quantification, we will define that this variable should belong to a [protected class](https://en.wikipedia.org/wiki/Fairness_(machine_learning)) and `bias_variable = 1` is the privileged group.

    This variable is NOT used to train the model as the expectation is that it should not be used to make decisions for fairness purposes.

2.  Splitting the dataset into 1 initial batch for training the model from scratch, and 3 additional equally sized batches for incrementally updating the trained model 
    
    To simulate the process of incremental learning, where the model is updated on new datasets, the original training set is split into 1 batch for initially training the model from scratch, and then 3 more equally sized batches for incrementally learning.  When running incremental learning, we will be using each batch to represent a new dataset that will be used to update the model..   

The final process for splitting this dataset is first, 70% for training and 30% for holdout testing.  Following this, the 70% is split as described above into 1 batch for initial training and 3 for incremental training. 

We will now run the dataprocessing discribed above

In [None]:
!cd data && python prepare_data.py --num_batches 4

### Model Building Process

The `run_training.py` script *reads the data*, *trains a preprocessor*, and *trains an XGBoost Classifier*, and *saves the model* which can be used for future inference.

The script takes the following arguments:

```shell
usage: run_training.py [-h] [--intel] [--num_cpu NUM_CPU] [--size SIZE][--trained_model TRAINED_MODEL] [--save_model_path SAVE_MODEL_PATH] --train_file TRAIN_FILE --test_file TEST_FILE
                       [--logfile LOGFILE] [--estimators ESTIMATORS]

optional arguments:
  -h, --help            show this help message and exit
  --intel               use intel daal4py for model optimization
  --num_cpu NUM_CPU     number of cpu cores to use
  --size SIZE           number of data entries to duplicate data for training and benchmarking. -1 uses the original data size. Default is -1.
  --trained_model TRAINED_MODEL
                        saved trained model to incrementally update. If not provided, trains a new model.
  --save_model_path SAVE_MODEL_PATH
                        path to save a trained model. If not provided, does not save.
  --train_file TRAIN_FILE
                        data file for training
  --test_file TEST_FILE
                        data file for testing
  --logfile LOGFILE     log file to output benchmarking results to.
  --estimators ESTIMATORS
                        number of estimators to use.
```

#### Training the Initial Model

Assuming the structure is set up, we can use this script with the following command to generate and save a brand new trained XGBoost Classifier ready to be used for inference.

```shell
cd src
conda activate defaultrisk_stock
python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/stock/model_1.pkl
```

The output of this script is a saved model `../saved_models/stock/model_1.pkl`.  In addition, the fairness metrics on a holdout test will also be shown

In [None]:
!cd src && python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/stock/model_1.pkl

For the `bias_variable` generative process described above, we can see that certain values strongly deviate from 1, indicating that the model may have detected some bias and does not seem to be making equitable predictions between the two groups.  

In comparison, we can adjust the generative process so that the `bias_variable` is explicitly fair independent of the outcome:

```

    If the loan is defaulted i.e. prediction class 1:
      assign bias_variable = 0 or 1 with the probability of 0 being 0.5

    if the loan is not defaulted i.e. prediction class 0:
      assign bias_variable = 0 or 1 with the probability of 0 being 0.5
      
```

We can do this by running our data preparation and training scripts again

In [None]:
!cd data && python prepare_data.py --bias_prob=0.5 --num_batches 4

In [None]:
!cd src && python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/stock/model_1.pkl

indicating that the model is not biased along this protected variable.

A thorough investigation of fairness and mitigation of bias is a complex process that *may require multiple iterations of training and retraining the model*, potentially excluding some variables, reweighting samples, and investigation into sources of potential sampling bias.  A few further resources on fairness for ML models, as well as techniques for mitigation include [this guide](https://afraenkel.github.io/fairness-book/intro.html) and [the `shap` package](https://shap.readthedocs.io/en/latest/example_notebooks/overviews/Explaining%20quantitative%20measures%20of%20fairness.html).


#### Updating the Initial Model with New Data (Incremental Learning)

The same script can be used to update the trained XGBoost Classifier with new data.  We can pass in the previously trained model file from above (`../saved_models/stock/model_1.pkl`) and a new dataset file(`../data/batches/credit_risk_train_2.csv`) in the same format as the original dataset to process an incremental update to the existing model and output a new model.  

```shell
cd src
conda activate defaultrisk_stock
python run_training.py --train_file ../data/batches/credit_risk_train_2.csv --test_file ../data/credit_risk_test.csv --trained_model ../saved_models/stock/model_1.pkl --save_model_path ../saved_models/stock/model_2.pkl
```

The output of this script is a newly saved model `../saved_models/stock/model_2.pkl` as well as new fairness metrics/plots on this model.  This new model can be deployed in the same environment as before and will use this newly updated model.

***The accuracy of this model, trained on the original dataset as described in the instructions above, on a holdout test set reachs ~90% with an AUC of ~0.87.  Incremental updates for this particular dataset maintains the accuracy of this model on a holdout test set at ~90% with an AUC of ~0.87.  This indicates that the model has saturated and that the data is not changing over time either.***

> **Implementation Note:** For an XGBoost Classifier, updating the model using the XGBoost built in functionality simply adds additional boosting rounds/estimators to the model, constructed using only the new data.  This does **not** update existing estimators.  As a result, after every incremental round, the model grows more complex while remembering old estimators.

In [None]:
!cd src && python run_training.py --train_file ../data/batches/credit_risk_train_2.csv --test_file ../data/credit_risk_test.csv --trained_model ../saved_models/stock/model_1.pkl --save_model_path ../saved_models/stock/model_2.pkl

### Model Inference

The saved model from each model iteration can be used on new data with the same features to infer/predict the probability of a default.  This can be deployed in any number of ways.  When the model is updated on new data, the deployed model can be transitioned over to the new model to make updated inferences given that performance is better and that the model meets the standards of the organization at hand.

### Running Inference

To use this model to make predictions on new data, we can use the `run_inference.py` script which takes in a saved model and a dataset to predict on, outputting a json to console with the above format.

The run_inference script takes the following arguments:

```shell
usage: run_inference.py [-h] [--is_daal_model] [--silent] [--size SIZE]
                        [--trained_model TRAINED_MODEL] --input_file
                        INPUT_FILE [--logfile LOGFILE]

optional arguments:
  -h, --help            show this help message and exit
  --is_daal_model       toggle if file is daal4py optimized
  --silent              don't print predictions. used for benchmarking.
  --size SIZE           number of data entries for eval, used for
                        benchmarking. -1 is default.
  --trained_model TRAINED_MODEL
                        Saved trained model to incrementally update. If None,
                        trains a new model.
  --input_file INPUT_FILE
                        input file for inference
  --logfile LOGFILE     Log file to output benchmarking results to.
```

To run inference on a new data file using one of the saved models, included by the above data preparation as 30% of the full training set, `../data/credit_risk_test.csv` we can run the command:

```shell
cd src
conda activate defaultrisk_stock
python run_inference.py --trained_model ../saved_models/stock/model_1.pkl --input_file ../data/credit_risk_test.csv
```

which outputs a json representation of the predicted probability of default for each row.

In [None]:
!cd src && python run_inference.py --trained_model ../saved_models/stock/model_1.pkl --input_file ../data/credit_risk_test.csv

Running inference on an incrementally updated model can be done using the same script, only specifying the updated model:

```shell
cd src
conda activate defaultrisk_stock
python run_inference.py --trained_model ../saved_models/stock/model_2.pkl --input_file ../data/credit_risk_test.csv
```

In [None]:
!cd src && python run_inference.py --trained_model ../saved_models/stock/model_2.pkl --input_file ../data/credit_risk_test.csv

## Optimizing the E2E Reference Solution with Intel® oneAPI

### Optimized E2E Architecture with Intel® oneAPI Components

![Use_case_flow](assets/e2e-workflow-optimized.png)



### Optimized Reference Solution Implementation 

#### Model Building Process with Intel® Optimizations

As Intel® optimizations are directly enabled by using XGBoost >v0.81 and the environment setup for the optimized version installs XGBoost v1.4.2, the `run_training.py` script can be run with no code changes otherwise to obtain a saved model with XGBoost v1.4.2. The same training process can be run, optimized with Intel® oneAPI as follows:

```shell
cd src
conda activate defaultrisk_intel
python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/stock/model_1.pkl
```

By toggling the `--intel` flag, the same process can also be used to save a **oneDAL optimized model**.  For example, the following command creates 2 saved models:

```shell
cd src
conda activate defaultrisk_intel
python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/intel/model_1.pkl --intel
```

1. ../saved_models/intel/model_1.pkl 
    
    A saved XGBoost v1.4.2 model 

2. ../saved_models/stock/model_1_daal.pkl

    The same model as above, but optimized using oneDAL.

**change kernel to Python[conda env:defaultrisk_intel]**

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-defaultrisk_intel-py'})

In [None]:
!cd src && python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/stock/model_1.pkl

In [None]:
!cd src && python run_training.py --train_file ../data/batches/credit_risk_train_1.csv --test_file ../data/credit_risk_test.csv --save_model_path ../saved_models/intel/model_1.pkl --intel

#### Model Inference with Intel® Optimizations

Inference with Intel® optimizations for v1.4.2 can also be enabled simply by using XGBoost >v0.81 as mentioned above.  To run inference on the v1.4.2 model, we can use the same `run_inference.py` script with no modifications to the call, passing in the desired v1.4.2 model:

```shell
cd src
conda activate defaultrisk_intel
python run_inference.py --trained_model ../saved_models/intel/model_1.pkl --input_file ../data/credit_risk_test.csv
```

To run inference on an Intel® oneDAL optimized model, the same `run_inference.py` script can be used, but the passed in model needs to be the saved daal4py version from training, and the `--is_daal_model` flag should be toggled:

```shell
cd src
conda activate defaultrisk_intel
python run_inference.py --trained_model ../saved_models/intel/model_1_daal.pkl --input_file ../data/credit_risk_test.csv --is_daal_model
```

In [None]:
!cd src && python run_inference.py --trained_model ../saved_models/intel/model_1.pkl --input_file ../data/credit_risk_test.csv

In [None]:
!cd src && python run_inference.py --trained_model ../saved_models/intel/model_1_daal.pkl --input_file ../data/credit_risk_test.csv --is_daal_model

## Performance Observations

In the following, we perform benchmarks comparing the Intel® technologies vs the stock alternative measuring the following tasks:

### ***1. Benchmarking Incremental Training with Intel® oneAPI Optimizations for XGBoost***

Training is conducted using Intel® oneAPI XGBoost v.1.4.2.  This is more efficient for larger datasets and model complexity.  The same optimizations apply when incrementally updating an existing model with new data.  For XGBoost, as incremental learning naturally increases the complexity of the model, later iterations may benefit more strongly from Intel® optimizations. 

As fairness and bias can be a major component in deploying a model for default risk prediction, in order to mitigate detected bias, many techniques must be explored such as dropping columns and rows, reweighting, resampling, and collecting new data.  Each of these new techniques requires a new model to be trained/incrementally updated, allowing for Intel® optimizations to continuously accelerate the discovery and training process beyond a single training iteration.

### ***2. Batch Inference with Intel® oneAPI Optimizations for XGBoost and Intel® oneDAL***

Once a model is trained, it can be deployed for inference on large data loads to predict the default risk across many different clients and many different potential loan attributes.  For other realistic scenarios, this can be used across a lot of different term structures and for scenario testing and evaluation.  

We benchmark batch inference using an v0.81 XGBoost model, a v1.4.2 XGBoost model, and a v1.4.2 XGBoost model optimized with Intel® oneDAL.

### Training Experiments

To explore performance across different dataset sizes, we replicate the original dataset to a larger size and add noise to ensure that no two data points are exactly the same.  Then we perform training and inference tasks on the following experimental configurations:

  **Experiment:**
    Model is initially trained on 3M data points.  Following this, the model is *incrementally updated* using 1M data points and used for **inference** on 1M data points.  This *incremental update* and *inference* process is repeated for 3 update rounds.

We will now run the training benchmarks, we start by changing to our stock environment.

In [None]:
!rm -rf logs

We will now run the training benchmarks, we start by changing to our stock environment.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-defaultrisk_stock-py'})

In [None]:
!conda list | grep xgboost

In [None]:
!cd src && bash benchmark_incremental_training_stock.sh

In [None]:
!cd src && bash benchmark_inference_stock.sh

We now change to our intel environment.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-defaultrisk_intel-py'})

In [None]:
!conda list | grep xgboost

In [None]:
!cd src && bash benchmark_incremental_training_intel.sh

In [None]:
!cd src && bash benchmark_inference_intel.sh

Now we can create tables and graphs to ilustrate the performance benefits in training and inference

In [None]:
import os
os.chdir(os.getenv('workdir'))

from notebooks.utils import benchmarking_utils
benchmarking_utils.print_training_benchmark_bargraph()

In [None]:
import os
os.chdir(os.getenv('workdir'))

from notebooks.utils import benchmarking_utils
benchmarking_utils.print_inference_benchmark_bargraph()