## **Reference Implementation**

### ***E2E Architecture***

![Use_case_flow](assets/e2e_flow.png)


### ***Solution setup***

Use the following cell to change to the correct kernel. Then check that you are in the `networkintrusiondetection_stock` kernel. If not, navigate to `Kernel > Change kernel > Python [conda env:networkintrusiondetection_stock]`. Note that the cell will remain with * but you can continue running the following cells.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-networkintrusiondetection_stock-py'})

Download 2021.02.17.csv file from URL: https://www.kaggle.com/datasets/mryanm/luflow-network-intrusion-detection-data-set
and save it to `data` folder.

Once we have downloaded the data, we can view a few samples.

In [None]:
import pandas as pd
data = pd.read_csv("data/2021/02/2021.02.17/2021.02.17.csv")
data.tail()

### ***Solution implementation***

#### **Dataset Preprocessing**

To remove the rows with empty values from the downloaded CSV file, the below script has to be run:

```shell
python src/data_prep.py [-i inputfile] [-o outputfilepath]  
```
An example of using the above script is as below:
```
conda activate networkintrusiondetection_stock
python src/data_prep.py -i data/2021.02.17.csv
```

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-networkintrusiondetection_stock-py'})

In [None]:
!python src/data_prep.py -i data/2021/02/2021.02.17/2021.02.17.csv

Inspect generated data.

In [None]:
import pandas as pd
data = pd.read_csv("data/data.csv")
data.tail()

#### **Model building process**

This Network Intrusion Detection System uses NuSVC from the sci-kit learn library to train an AI model and generate labels by classification for the passed in data.  This process is captured within the `run_benchmarks.py` script. This script *reads and preprocesses the data*, and *performs training, predictions, and hyperparameter tuning analysis on NuSVC*, while also reporting on the execution time for all the mentioned steps(we will use this information later when we are optimizing the implementation for Intel® architecture).  Furthermore, this script can also save each of the intermediate models for an in-depth analysis of the quality of fit.  

The script takes the following arguments:

```shell
usage: 
python src/run_benchmarks.py [-d DATASET] [-a algorithm] [-l logfile]
optional arguments:
  -l, --logfile,           Log file to output benchmarking results

  -i , --intel,            Use intel accelerated technologies where available
                        
  -t , --hp tuning,         If hyperparameter tuning to be done  

  -a , --algo,             Name of the algorithm to be used (svc,nusvc,lr)  
                      
  -d , --datasetsize,      Dataset size for training

  -c , --inputcsvpath,     Path to the input csv file
```                   

As an example of using this, we can run the following commands to train and save NuSVC models. (To run Training with stock python and stock technologies for data size 300K, we would run):
```shell
conda activate networkintrusiondetection_stock
python src/run_benchmarks.py -d 300000 --algo nusvc -c data/data.csv
```

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-networkintrusiondetection_stock-py'})

In [None]:
!mkdir -p ./models

In [None]:
!python src/run_benchmarks.py -d 300000  --algo nusvc -c data/data.csv

In a realistic pipeline, this training process would follow the above diagram, adding a human in the loop to determine the quality of the classification solution from each of the saved models/predictions in the `saved_models` directory, or better, while tuning the model.  The quality of a classification solution is highly dependent on the human analyst and they have the ability to not only tune hyper-parameters but also modify the features being used to find better solutions.

#### **Running classification Analysis/Predictions**

To run the batch and real-time inference with stock, we would run (after creating the appropriate environment as above):
```shell
python src/inference.py -m models/NuSVC_model.sav -c data/data.csv -d 10000
```

In [None]:
!python src/inference.py -m models/NuSVC_model.sav -c data/data.csv -d 10000

**Hyperparameter tuning**

***Loop Based Hyperparameter Tuning***

It is used to apply the fit method to train and optimize by applying different parameter values in loops to get the best Silhouette score and thereby a better performing model.

**Parameters Considered**

| **Parameter** | **Description** | **Values**
| :-- | :-- | :-- 
| `kernel` | Kernel | poly,rbf
| `gamma` | Max iteration value | 1e-4

To run Hyperparameter tuning with stock python and stock technologies, we would run (after creating the appropriate environment):
```shell
python src/run_benchmarks.py -t 1 -d 300000 --algo nusvc  -c data/data.csv
```

In [None]:
!python src/run_benchmarks.py -t 1 -d 300000 --algo nusvc  -c data/data.csv

To run the batch and real-time inference with the Stock environment, we would run (after creating the appropriate environment as above and using the saved model with Hp tuning with Stock env):
```shell
python src/inference.py --modelpath models/NUSVC_model_hp.sav -c data/data.csv -d 10000
```

In [None]:
!python src/inference.py --modelpath models/NUSVC_model_hp.sav -c data/data.csv -d 10000

### ***Optimized E2E architecture with Intel® oneAPI components***

![Use_case_flow](assets/e2e_flow_optimized.png)

### ***Optimized Solution implementation***

Optimizing the NuSVC solution with Intel® oneAPI is as simple as adding the following lines of code prior to calling the sklearn algorithms:

```python
from sklearnex import patch_sklearn
patch_sklearn()
```

**Change kernel to Python[conda env:networkintrusiondetection_intel]**

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-networkintrusiondetection_intel-py'})

#### **Model building process with Intel® optimizations**

The run_benchmarks.py script is run by adding the `--intel` flag when running the training to enable the Intel flag. The same training process can be run, optimized with Intel® oneAPI as the sample code below. To run Training with Intel® python and Intel® technologies for data size 300K, we would run (after creating the appropriate environment as above):
```shell
conda activate networkintrusiondetection_intel
python -m sklearnex src/run_benchmarks.py -i 1 -d 300000 --algo nusvc -c data/data.csv
```

In [None]:
!python -m sklearnex src/run_benchmarks.py -i 1 -d 300000 --algo nusvc -c data/data.csv

To run the batch and real-time inference with Intel environment, we would run (after creating the appropriate environment as above and using the saved model trained with Intel env):
```shell
python -m sklearnex src/inference.py --i 1 --modelpath models/NuSVC_model.sav -c data/data.csv -d 10000
```

In [None]:
!python -m sklearnex src/inference.py --i 1 --modelpath models/NUSVC_model_hp.sav -c data/data.csv -d 10000

**Hyperparameter tuning**

**Parameters Considered**

| **Parameter** | **Description** | **Values**
| :-- | :-- | :-- 
| `kernel` | kernels | rbf,poly
| `gamma` | Gamma Value | 1e-4

To run Hyperparameter tuning with intel python and Intel technologies, we would run (after creating the appropriate environment as above):
```shell
python -m sklearnex src/run_benchmarks.py -i 1 -t 1 -d 300000 --algo nusvc -c data/data.csv
```

In [None]:
!python -m sklearnex src/run_benchmarks.py -i 1 -t 1 -d 300000 --algo nusvc -c data/data.csv

To run the batch and real-time inference with Intel environment, we would run (after creating the appropriate environment and using the saved model with hyperparameter tuning with Intel env):
```shell
python -m sklearnex src/inference.py --i 1 --modelpath models/NUSVC_model_hp.sav -c data/data.csv -d 10000
```

In [None]:
!python -m sklearnex src/inference.py --i 1 --modelpath models/NUSVC_model_hp.sav -c data/data.csv -d 10000

### **Performance Experiments**

**Experiment**: Model is trained using `datasetsize=10000`. Following this, the model is used for inference. This trainig and inference process is repeated for 3 rounds.

**Change kernel to Python [conda env:networkintrusiondetection_stock]**

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-networkintrusiondetection_stock-py'})

First, we make sure there are no logs

In [None]:
!rm -rf ./logs

Run training with stock technologies

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()


from notebooks.utils import benchmarking_utils
benchmarking_utils.run_training_benchmark(intel=False, iterations = 3) 

Run inference with stock technologies

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()


from notebooks.utils import benchmarking_utils
benchmarking_utils.run_inference_benchmark(intel=False, iterations = 3)

**Change kernel to Python [conda env:networkintrusiondetection_intel]**

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-networkintrusiondetection_intel-py'})

Run training with intel technologies

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()


from notebooks.utils import benchmarking_utils
benchmarking_utils.run_training_benchmark(intel=True, iterations = 3) 

Run inference with intel technologies

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()


from notebooks.utils import benchmarking_utils
benchmarking_utils.run_inference_benchmark(intel=True, iterations = 3)

Now, we can create tables and graphs to ilustrate the performance benefits in training and inference

Training performance

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()
from notebooks.utils import benchmarking_utils
benchmarking_utils.print_training_benchmark_table()

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()
from notebooks.utils import benchmarking_utils
benchmarking_utils.print_training_benchmark_bargraph()

Inference performance

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()
from notebooks.utils import benchmarking_utils
benchmarking_utils.print_inference_benchmark_table()

In [None]:
import os
import sys
if not 'workbookDir' in globals():
    import os
    workbookDir = os.getcwd()
from notebooks.utils import benchmarking_utils
benchmarking_utils.print_inference_benchmark_bargraph()