# **Fraud Detection using an ensemble technique - Intel optimized DBSCAN clustering followed by Light Gradient Boosted Model (LGBM)**

To run the following Stock instructions from the Notebook, the kernel should be set to **[conda env:FraudDetection_stock]**. To set the Notebook to the correct kernel, you can do it by running the following cell (the cell will remain in * but you can continue with the process) or you can do the change manually by selecting **Kernel > Change kernel > Python [conda env:FraudDetection_stock]** on the Notebook’s top menu bar. If the correct kernel was selected, you should be able to see the selected kernel's name on the right side of the Notebook menu bar.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: "conda-env-FraudDetection_stock-py"}) 
element.text('Stock kernel loaded. You can continue executing the next cells.')

If you want to run the Stock instructions from a terminal instead of the Notebook, you should follow the commands described on the [README.md](README.md#setting-up-stock-environment) file on the section **Setting up Stock Environment** and activate the **FraudDetection_stock** environment. Copy and paste from the markdown sections to the terminal the commands from this Notebook that run the scripts using the python interpreter, e.g. "python ./src/run_benchmarks_train.py -l ./logs/stock_training.log".

### **Data Ingestion**

Please download the data using the instructions provided in the [/data](data/data_download_instructions.txt) folder and save it as creditcard.csv in the same location. The dataset has details of more than 280,000 credit card transactions with 30 columns serve as the features for model build and a "Class" label of 0 (legitimate transaction) and 1 (fraudulent transaction). The data is read as a pandas dataframe and split into train/test portions in the training and hyperparameter tuning scripts. The training set will be used clustering and LGBM training whereas the test set will be used as "new" data for inference while evaluating accuracy.

In [None]:
#Once downloaded
import pandas as pd
data = pd.read_csv("data/creditcard.csv")
data.tail()

### **Stock Clustering + Training/Hyperparameter Tuning**

The clustering and training portion of the benchmarking can be run using the python script `run_benchmarks_train.py`. The script **reads data**, **performs DBSCAN clustering** and filters data belonging to a cluster which has the maximum proportion of fraudulent transactions.

The script then **trains an LGBM model** on the full dataset as well as the clustered data. Both trained models are saved for inference - doing so will help us quantify the benefit of using clustering as opposed to using the full dataset directly for model training. This script will also report on the execution time for these steps. 

The run benchmark script takes the following arguments:

```shell
usage: run_benchmarks_train.py [-l LOGFILE] [-i]

optional arguments:
  -l LOGFILE, --logfile LOGFILE           log file to output benchmarking results to
  -i, --intel                             use intel accelerated technologies where available
```

In [None]:
#Run to initialize libs and files
import sys
import os
sys.path.insert(0, './src')
from src import benchmark_tools as b_tools

jsonfile = './jsons/stock.json'
clusteredmodel="Clustered_LGBM_Classifier.pkl"
fullmodel="Full_LGBM_Classifier.pkl"

#Erase time values from stock json
b_tools.init_json(jsonfile)

#Remove files
if os.path.exists(clusteredmodel):
    os.remove(clusteredmodel)

if os.path.exists(fullmodel):
    os.remove(fullmodel)

To run with stock technologies, logging the performance to `logs`, we would run (after creating the appropriate environment as above):
```shell
python ./src/run_benchmarks_train.py -l ./logs/stock_training.log
```

In [None]:
log='./logs/stock_training.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_train.py' '-l' $log

b_tools.parse_logs(log, jsonfile)

The hyperparameter tuning exercise can be run by following the same procedure as described for training. It goes through the same steps prior to the supervised ML portion of the pipeline (ingestion & clustering). Following these, instead of training, the script would perform hyperparameter tuning over a predefined parameter dictionary. Only substitution would be to execute the script `run_benchmarks_hyper.py` instead of `run_benchmarks_train.py`. This execution expects the same arguments as the training case does. 

 Once again, the execution times will be reported and the trained models (using full and clustered data) will be saved for use by the prediction benchmarking script. Following are examples of how the hyperparameter tuning jobs can be triggered.

For stock technologies
```shell
python ./src/run_benchmarks_hyper.py -l ./logs/stock_hyper.log
```

In [None]:
log='./logs/stock_hyper.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_hyper.py' '-l' $log

b_tools.parse_logs(log, jsonfile)

The following is a brief description of the outputs of clustering/training/hyperparameter tuning portion of the pipeline:

#### **Expected Input Output for Training/Hyperparameter Tuning**

**Input:**

| **Section**                   | **Expected Input**                   
| :---                          | :---                                  
| Clustering                    | Portion of the Feature data which is dedicated to the training component of the ML pipeline                            
| Training/Hyperparameter tuning | Feature data post clustering as well as the full training feature data along with the respective labels.

<br>

**Output:**

| **Section**                   | **Expected Output**                   | **Comment**
| :---                          | :---                                  | :--- 
| Clustering                    | Cluster id to which each data row is assigned (-1, 0, 1...)                                 | The cluster output is not saved to an output file but appended to the dataframe as a column. The cluster column is then used to filter the data to maximize proportion of fraudulent data. The filtered data is subsequently used for model trainng
| Training/Hyperparameter tuning | Model pkl files pertaining to traning using full/clustered data <br> **Clustered_LGBM_Classifier.pkl** <br> **Full_LGBM_Classifier.pkl**     | The pkl files are saved as output in the parent directory

### **Model Inference - Batch**

The saved models then can be used for batch inference. For this purpose, we will exectue the  `run_benchmarks_predict.py`. It takes the following arguments: 

```shell
usage: run_bechmarks-predict.py [-l LOGFILE] [-i] [-mc clusteredmodel] [-mc fullmodel] [-s]

optional arguments:
  -l  --logfile             log file to output benchmarking results to
  -i, --intel               use Intel optimized libraries where available
  -mc --clusteredmodel      pkl file of model created using clustered data
  -mf --fullmodel           pkl file of model created using full data
  -s  --streaming           run streaming inference if true
```
To run with stock technologies, we would run:

```shell
python ./src/run_benchmarks_predict.py -mc Clustered_LGBM_Classifier.pkl -mf Full_LGBM_Classifier.pkl -l ./logs/stock_batch.log
```

In [None]:
log='./logs/stock_batch.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_predict.py' '-mc' 'Clustered_LGBM_Classifier.pkl' '-mf' 'Full_LGBM_Classifier.pkl' '-l' $log

b_tools.parse_logs(log, jsonfile)

For streaming inference execute the following command:
```shell
python ./src/run_benchmarks_predict.py -s -mc Clustered_LGBM_Classifier.pkl -mf Full_LGBM_Classifier.pkl -l ./logs/stock_streaming.log
```

In [None]:
log='./logs/stock_streaming.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_predict.py' '-s' '-mc' 'Clustered_LGBM_Classifier.pkl' '-mf' 'Full_LGBM_Classifier.pkl' '-l' $log

b_tools.parse_logs(log, jsonfile)

#### **Expected Input and Output for Inference**

**Input:**

| **Section**                   | **Expected Input**                   
| :---                          | :---                                  
| Batch Prediction                   | Portion of the data which is dedicated to testing. The dataset is also duplicated & shuffled to investigate if behavior changes with size. Corresponding labels are also passed as input.
| Streaming Prediction               | Similar to batch prediction, but inference is run over a randomly selected single row multiple times to simulate inference on streaming data.

<br>

**Output:**

| **Section**                   | **Expected Output**                   | **Comment**
| :---                          | :---                                  | :--- 
| Batch  Prediction                       | Array of prediction classes of whether a transaction is legitimate or fraudulent (0 for legitimate and 1 for fraudulent). Inference is run for both models, i.e. trained using clustered data as well as full data.                                 | The array is used to calculate f1_scores as well as confusion matrices for the respective models. This will help us compare the performance of the two models. The f1_score is written to the log file and the confusion matrix is saved as a png file in the working directory
| Streaming Prediction | Prediction class for a single transaction (0 for legitimate and 1 for fraudulent)     | Primary objective of running streaming inference is to benchmark time taken for prediction. Average time for a single prediction (over 1000 rows) is written to the log file.

### **Intel® Clustering + Training/Hyperparameter Tuning**

Before running the Intel instructions from the Notebook, the kernel should be set to **[conda-env-FraudDetection_intel-py]**. To set the Notebook to the correct kernel, you can do it by running the following cell (the cell will remain in * but you can continue with the process) or you can do the change manually by selecting **Kernel > Change kernel > Python [conda-env-FraudDetection_intel-py]** on the Notebook’s top menu bar. If the correct kernel was selected, you should be able to see the selected kernel's name on the right side of the Notebook menu bar.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: "conda-env-FraudDetection_intel-py"})
element.text('Intel kernel loaded. You can continue executing the next cells.') 

If you want to run the Intel® instructions from a terminal instead of the Notebook, you should follow the commands described on the [README.md](README.md#setting-up-intel-environment) file on the section **Setting up Stock Intel** and activate the **FraudDetection_intel** environment. Copy and paste from the markdown sections to the terminal the lines from this Notebook that run the scripts using the python interpreter, e.g. "python ./src/run_benchmarks_train.py -i -l ./logs/intel_training.log".

In [None]:
#Run to initialize libs and files
import sys
import os
sys.path.insert(0, './src')
from src import benchmark_tools as b_tools

jsonfile = './jsons/intel.json'
clusteredmodel="Clustered_LGBM_Classifier.pkl"
fullmodel="Full_LGBM_Classifier.pkl"

#Erase time values from stock json
b_tools.init_json(jsonfile)

#Remove files
if os.path.exists(clusteredmodel):
    os.remove(clusteredmodel)

if os.path.exists(fullmodel):
    os.remove(fullmodel)

There will be only one change here compared to the command for training a model with the stock packages, the addition of an argument (-i) which enables the use of intel-optimized packages, which in case of training/hyperparameter tuning would be Intel Extension for Scikit-Learn. To run with intel technologies, logging the performance to `logs`, we would run (after activating the intel environment):

```shell
python ./src/run_benchmarks_train.py -i -l ./logs/intel_training.log
```

In [None]:
log='./logs/intel_training.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_train.py' '-i' '-l' $log

b_tools.parse_logs(log, jsonfile)

For hyperparameter tuning, execute the following command:

```shell
python ./src/run_benchmarks_hyper.py -i -l ./logs/intel_hyper.log
```

In [None]:
log='./logs/intel_hyper.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_hyper.py' '-i' '-l' $log

b_tools.parse_logs(log, jsonfile)

### **Model Inference**

Model inference in an intel environment will leverage the daal4py module which will convert the existing LGBM model into an optimized version. The optimized model will then be used for batch/streaming prediction. The benefit of using this daal4py version of the model is key to the solution as it means faster inference times, which we will see from the plots in the results section.

For batch inference, execute the following command:
```shell
python ./src/run_benchmarks_predict.py -i -mc Clustered_LGBM_Classifier.pkl -mf Full_LGBM_Classifier.pkl -l ./logs/intel_batch.log
```

In [None]:
log='./logs/intel_batch.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_predict.py' '-i' '-mc' $clusteredmodel '-mf' $fullmodel '-l' $log

b_tools.parse_logs(log, jsonfile)

For streaming inference execute the following command
```shell
python ./src/run_benchmarks_predict.py -s -i -mc Clustered_LGBM_Classifier.pkl -mf Full_LGBM_Classifier.pkl -l ./logs/intel_streaming.log
```

In [None]:
log='./logs/intel_streaming.log'
!rm $log 2>/dev/null
%run './src/run_benchmarks_predict.py' '-s' '-i' '-mc' $clusteredmodel '-mf' $fullmodel '-l' $log

b_tools.parse_logs(log, jsonfile)

## **Comparing Performance Benefits**

In this section, we illustrate the benchmarking results comparing the Intel-optimized libraries vs the stock alternative as well as the performance of the two LGBM models (one trained using post-clustering data and one trained using the full dataset). The data for the comparison is obtained from the log files generated when executing the previous cells of the Notebook.
It should be noted that to obtain the results on the final cell both, the Intel® and Stock, instructions should have already been executed. 

In [None]:
#To see the speed comparisons between Stock vs Intel versions,
#run once Stock and Intel Jupyter Notebooks have been runned.
import sys

sys.path.insert(0, './src')
from src import benchmark_tools as b_tools
b_tools.plot_log_times('./jsons/stock.json', './jsons/intel.json')