## Reference Implementation

### E2E Architecture

![Use_case_flow](assets/e2e-flow-orig.png)


### Solution setup
Use the following cell to change to the correct kernel. Then check that you are in the `stock` kernel. If not, navigate to `Kernel > Change kernel > Python [conda env:stock]`. Note that the cell will remain with * but you can continue running the following cells.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-cust_seg_stock-py'})

### Reference Implementation

#### Model Building Process

This customer segmentation approach uses KMeans and DBSCAN from scikit learn library to train an AI model and generate cluster labels for the passed in data.  This process is captured within the `hyperparameter_cluster_analysis.py` script. This script *reads and preprocesses the data*, and *performs hyperparameter cluster analysis on either KMeans or DBSCAN*, while also reporting on the execution time for preprocessing and hyperparameter cluster analysis steps(we will use this information later when we are optimizing the implementation for Intel® architecture).  Furthermore, this script can also save each of the intermediate models/cluster labels for an in-depth analysis of the quality of fit.  

The script takes the following arguments:

```shell
usage: hyperparameter_cluster_analysis.py [-h] [-l LOGFILE] [-i] [--use_small_features] [-r REPEATS] [-a {kmeans,dbscan}]
                         [--save_model_dir SAVE_MODEL_DIR]

optional arguments:
  -h, --help            show this help message and exit
  -l LOGFILE, --logfile LOGFILE
                        log file to output benchmarking results to
  -i, --intel           use intel technologies where available
  --use_small_features  use 3 features instead of 21
  -r REPEATS, --repeats REPEATS
                        number of times to clone the data
  -a {kmeans,dbscan}, --algo {kmeans,dbscan}
                        clustering algorithm to use
  --save_model_dir SAVE_MODEL_DIR
                        directory to save ALL models if desired
```

As an example of using this, we can run the following commands:

```shell
conda activate cust_seg_stock
cd src 
python hyperparameter_cluster_analysis.py --logfile logs/stock.log --algo kmeans --save_model_dir saved_models
```

This will perform hyperparameter cluster analysis using KMeans/DBSCAN for the provided data, saving the data to the `saved_models` directory and providing performance logs on the algorithm to the `logs/stock.log` file.  More generally, this script serves to create KMeans/DBSCAN models on 21 features, scanning across various hyperparameters (such as cluster size for KMeans and min_samples for DBSCAN) for each of the models and saving a model at EACH hyperparameter setting to the provided `--save_model_dir` directory.  These saved models can be further analyzed as described below in [Running Cluster Analysis/Predictions](#running-cluster-analysis-predictions).

In a realistic pipeline, this process would follow the above diagram, adding either a human in the loop to determine the quality of the clustering solution at each hyperparameter setting, or by adding heuristic measure to quantify cluster quality.  In this situation, we do not implement a clustering quality and instead save the trained models/predictions in the `--save_model_dir` directory at each hyperparameter setting for future analysis and cluster comparisons.

As an example of a possible clustering metric, Silhouette analysis is often used for KMeans to help select the number of clusters.  See [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) for further implementation details.  For example, this can also be used in the above script by adding a rough heuristic that only saves models above a certain heuristic score.

In [None]:
!cd src && python hyperparameter_cluster_analysis.py --logfile logs/stock.log --algo kmeans --save_model_dir saved_models

In [None]:
!cd src && cat logs/stock.log

#### Running Cluster Analysis/Predictions

The above script will train and save models at different hyperparameter configurations for KMeans or DBSCAN.  In addition to saving the models to `save_model_dir`, the script will also save the following files for each hyper parameter configuration:

1. `save_model_dir/data.csv` - preprocessed data file 
2. `save_model_dir/{algo}/model_{hyperparameters}.pkl` - trained model file 
3. `save_model_dir/{algo}/pred_{hyperparameters}.txt` - cluster labels for each datapoint in the data file

These files can be used to analyze each of the clustering solutions generated from the hyperparameter cluster analysis.  An example snippet of how you can load the saved model files and the predictions for further analysis is:

```python
import joblib
import pandas as pd
model = joblib.load("saved_models/kmeans/model_{hyperparameters}.pkl")
data = pd.read_csv("saved_models/kmeans/data.csv")
cluster_labels = pd.read_csv("saved_models/kmeans/preds_{hyperparameters}.txt", headers=None)
```

For KMeans, the saved model can be loaded using the `joblib` module and used to predict the cluster label of a new data point.  As an example, this may look like:

```python
import joblib
kmeans_model = joblib.load("saved_models/kmeans/model_{hyperparameters}.pkl")
kmeans_model.predict(new_X)
```

### Optimized E2E Architecture with Intel® oneAPI Components

![Use_case_flow](assets/e2e-flow-optimized.png)



### Solution setup
Use the following cell to change to the correct kernel. Then check that you are in the `intel` kernel. If not, navigate to `Kernel > Change kernel > Python [conda env:intel]`. Note that the cell will remain with * but you can continue running the following cells.

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-cust_seg_intel-py'})

#### Model Building Process with Intel® Optimizations

The above code is pre-built into the `hyperparameter_cluster_analysis.py` script by adding the `--intel` flag when running the hyperparameter cluster analysis.  The same training process can be run, optimized with Intel® oneAPI as follows:

```shell
conda activate cust_seg_intel
python hyperparameter_cluster_analysis.py --logfile logs/intel.log --algo kmeans --save_model_dir saved_models --intel
```

In [None]:
!cd src && python hyperparameter_cluster_analysis.py --logfile logs/intel.log --algo kmeans --save_model_dir saved_models --intel

In [None]:
!cd src && cat logs/intel.log

## Performance Observations

For demonstrational purposes of the scaling of Intel® oneAPI Extension for SciKit, we benchmark a **hyperparameter cluster analysis** under the following data-augmentation transformations:

1. Using 21 features
2. Replicating and jittering the data with noise to have up to 400k rows (depending on algorithm)
   1. KMeans - 40k, 400k samples
   2. DBSCAN - 40k, 60k samples

We summarize the benchmarking results comparing the Intel® technologies vs the stock alternative on the following tasks:

  1. hyperparameter cluster analysis via KMeans with 21 Features
  2. hyperparameter cluster analysis via DBSCAN with 21 Features

where hyperparameter cluster analysis in this case measures the **total time to generate cluster solutions for the given data at each point on the hyperparameter grid for a given algorithm and data size**.  

The hyperparameters for each algorithm include:

### KMeans
|**n_clusters**  | **tol** |                             
| :---: | :---: |
|2, 3, 4, 5, 10, 15, 20, 25, 30 | 1e-3, 1e-4, 1e-5 |

### DBSCAN
|**min_samples**  | **eps** |                              
| :---: | :---: |
|10, 50, 100             | 0.3, 0.5, 0.7 |

Noise is added to ensure that no two rows are exactly the same after replication.  DBSCAN testing is limited to 60k samples for because of memory constraints on the machines used for testing.

We start by removing any previous logs

In [None]:
!rm -rf logs

We then run scipts to run the benchmarks. We start by benchmarking the stock solution, we need to first change the kernel to the stock environment. 

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-cust_seg_stock-py'})

In [None]:
!cd src && bash run_exp_stock.sh

Our run_benchmark function has two parameters, intel toggles to use the intel optomized libraries in the benchmark, additionaly we have a parameter called long_test which can toggle the test above sample sizes mentioned above or a smaller sample size which can run the test benchmark much faster.

We will now change to our intel environment and use the intel paramerter to run our benchmark with intel optomizations

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-cust_seg_intel-py'})

In [None]:
!cd src && bash run_exp_intel.sh

Performance logs have been created and we can use these logs to create graphs comparing performance

In [None]:
%%javascript
Jupyter.notebook.session.restart({kernel_name: 'conda-env-cust_seg_stock-py'})

In [None]:
import os
os.chdir(os.path.join(os.getenv('workdir'),'src'))
from notebooks.utils import benchmarking_utils

benchmarking_utils.print_kmeans_plot()

In [None]:
import os
os.chdir(os.path.join(os.getenv('workdir'),'src'))
from notebooks.utils import benchmarking_utils

benchmarking_utils.print_dbscan_plot()

### Notes
***Please see this data set's applicable license for terms and conditions. Intel® does not own the rights to this data set and does not confer any rights to it.***
