## **Reference Implementation**

### ***E2E Architecture***

![Use_case_flow](assets/workflow.png)


### Set Up Environment

Use the following cell to change to the correct kernel. Then check that you are in the `intelligent_indexing_intel` kernel. If not, navigate to `Kernel > Change kernel > Python [conda env:intelligent_indexing_intel]`.

### Run Workflow

#### Setting up the data

The benchmarking scripts expects 2 files to be present in `"$DATA_DIR"/huffpost`.

`"$DATA_DIR"/huffpost/train_all.csv` : training data
`"$DATA_DIR"/huffpost/test.csv` : testing data

After download the data for benchmarking under these requirements, do the following:
   
* Use the `process_data.py` script to generate the `huffpost/train_all.csv` and `huffpost/test.csv` files for benchmarking.  This script expects `News_Category_Dataset_v3.json` to be present in the same directory.

In [None]:
!cd "$DATA_DIR" && python process_data.py

Get environment variables to use in python code.

In [None]:
import os
workspace = os.getenv("WORKSPACE")
data_dir = workspace + '/data'
output_dir = workspace + '/output'
print("workspace path: {}".format(workspace))
print("data dir path: {}".format(data_dir))
print("output dir path: {}".format(output_dir))

View a few samples of created data.

In [None]:
import pandas as pd
train_all = pd.read_csv(f"{data_dir}/huffpost/train_all.csv")
train_all.head(10)

In [None]:
test = pd.read_csv(f"{data_dir}/huffpost/test.csv")
test.head(10)

All of the benchmarking can be run using the python script `src/run_benchmarks.py`.

The script **reads and preprocesses the data**, **trains an SVC model**, and **predicts on unseen test data** using the trained model, while also reporting on the execution time for these 3 steps.

> Before running the script, we need to ensure that the appropriate conda environment is activated.

The run benchmark script takes the following arguments:

```shell
usage: run_benchmarks.py [-h] [-l LOGFILE] [-p] [-s SAVE_MODEL_DIR]

optional arguments:
  -h, --help            show this help message and exit
  -l LOGFILE, --logfile LOGFILE
                        log file to output benchmarking results to
  -p, --preprocessing_only
                        only perform preprocessing step
  -s SAVE_MODEL_DIR, --save_model_dir SAVE_MODEL_DIR
                        directory to save model to
```

To run with Intel® technologies, logging the performance to `"$OUTPUT_DIR"/logs/intel.log`, we would run:

Create logs directory.

In [None]:
!mkdir -p "$OUTPUT_DIR"/logs/

Execute python script `src/run_benchmarks.py` and save logs to `"$OUTPUT_DIR"/logs/intel.log` file.

In [None]:
!cd src && python run_benchmarks.py -l "$OUTPUT_DIR"/logs/intel.log

Inspect generated log file and check the `Test Accuracy`, `Training Time`, `Inference Time` and `Total time` of the workflow:



In [None]:
!tail "$OUTPUT_DIR"/logs/intel.log

#### Clean Up Workspace

Follow these steps to restore your ``$WORKSPACE`` directory to a initial step. Please note that all downloaded datasets, workflow files and logs created by this Jupyter Notebook will be deleted. Before execute next cell back up your important files.

In [None]:
!cd "$DATA_DIR" && rm -r huffpost News_Category_Dataset_v3.json && rm -r "$OUTPUT_DIR/logs"