# WnD Demo

Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, users are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of recommendations. Google's Wide and Deep recommender system is a popular model for recommendation problems for its robustness to signal sparsity.
This notebook contains step by step guide on how to optimize WnD model with Intel® End-to-End AI Optimization Kit, and detailed performance analysis.

# Content
* [Model Architecture](#Model-Architecture)
* [Optimizations](#Optimizations)
* [DEMO](#DEMO)

## Model Architecture
<img src="./img/wnd.png" width="800"/>

Wide and Deep model was published by Google at 2016. It jointly train wide linear models and deep neural networks, combined the benefits of memorization and generalization for recommender system. It's the first time to introduce neural network to CTR model.

The wide component is a generalized linear model. The feature set includes raw input features and transformed features
The deep component is a feed-forward neural network. The sparse, high-dimensional categorical features are first converted into an embedding vector and fed into the hidden layers of a neural network in the forward pass
The wide component and deep component are combined using a weighted sum of their output log odds as the prediction and fed to logistic loss function for joint training

## Optimizations

### Distributed Training

Use horovod for distributed training and mpirun to launch training script

### Model Optimization

Long idle time per training step for horovod communication, horovod paramter sync consume much time during distributed training, causing poor scaling performance. The overhead mainly caused by large embedding table.

<img src="./img/wnd_profile.png" width="600"/><figure>Distributed training profiling</figure>

Replace custom layer (contains embedding layer) with TensorFlow dense layer help to reduce embedding parameter size, thus reduce parameter size needed to sync by horovod, fix horovod poor scaling issue. Per step training time reduced from 5.16s to 2.71s, got about 1.9x speedup.

<img src="./img/wnd_traintime_custom_emd.png" width="600"/><figure>custom layer</figure>
<img src="./img/wnd_traintime_tf_emd.png" width="600"/><figure>TensorFlow build-in layer</figure>

### Horovod Optimization With OneCCL

Deep part embedding table cost long time hovorod communication, and Allgather is the most time-consuming operation. Enable Intel OneCCL in horovod helps to reduce Allgather time consumption, which delivers 1.2x speedup.

<img src="./img/wnd_woccl.png" width="600"/><figure>horovod timeline profiling w/o OneCCL</figure>
<img src="./img/wnd_wccl.png" width="600"/><figure>horovod timeline profiling w/ OneCCL</figure>

### Framework Related Optimization

set CCL affinity, horovod thread affinity, MPI socket binding, KMP affinity, OMP_NUM_THREADS

```bash
export CCL_WORKER_COUNT=2 # set CCL thread number
export CCL_WORKER_AFFINITY="16,17,34,35" # set CCL thread affinity
export HOROVOD_THREAD_AFFINITY="53,71" # set horovod thread affinity
export I_MPI_PIN_DOMAIN=socket # set socket binding for MPI
export I_MPI_PIN_PROCESSOR_EXCLUDE_LIST="16,17,34,35,52,53,70,71" # exclude CCL threads

mpirun -genv OMP_NUM_THREADS=16 -map-by socket -n 2 -ppn 2 -hosts localhost -genv I_MPI_PIN_DOMAIN=socket -genv OMP_PROC_BIND=true -genv KMP_BLOCKTIME=1 -genv KMP_AFFINITY=granularity=fine,compact,1,0
```

### Early Stop

Training baseline MAP stopped at 0.6553, with optimizations on training process, model converge faster and achieve 0.6553 MAP at 1.5K steps, no need to training to 9K steps. Enable early stop at 0.6553 MAP.

<img src="./img/wnd_map_GPU.png"/><figure>baseline metric curv</figure>
<img src="./img/wnd_early_stop_cpu.png"/><figure>optimized metric curv</figure>

### Input Pipeline Optimization

Training needs more system resources while input pipeline not, the resources preemption between input pipeline and training caused performance overhead. By reducing system resources allocated for input pipeline to free more resources for training, input pipeline time consuming reduced from 8.2% to 3.2% among entire training time.

<img src="./img/wnd_input_pipeline_orig.png" width="600"/><figure>original profiling</figure>
<img src="./img/wnd_input_pipeline_opt.png" width="600"/><figure>optimized profiling</figure>

### HPO With SDA (Smart Democratization Advisor)

SDA config

```
Parameters for SDA auto optimization:
- dnn_hidden_unit1: [64, 128, 256, 512] #layer width of dnn_hidden_unit1
- dnn_hidden_unit2: [64, 128, 256, 512] #layer width of dnn_hidden_unit2
- dnn_hidden_unit3: [64, 128, 256, 512] #layer width of dnn_hidden_unit3
- deep_learning_rate: 0.0001~0.1 #deep part learning rate
- linear_learning_rate: 0.01~1.0 #linear part learning rate
- deep_warmup_epochs: 1~8 #deep part warmup epochs
- deep_dropout: 0~0.5 #deep part dropout
metrics:
- name: training_time # training time threshold
  objective: minimize
  threshold: 1800
- name: MAP # training metric threshold
  objective: maximize
  threshold: 0.6553
metric:
- name: MAP
  threshold: 0.6553
```

request suggestions from SDA

```python
suggestion = self.conn.experiments(self.experiment.id).suggestions().create()
```

<img src="./img/wnd_sda.png" width="600"/><figure>wnd sda</figure>

# DEMO
* [Environment Setup](#Environment-Setup)
* [Data Process](#Data-Process)
* [Launch Training](#Launch-Training)

## Environment Setup
``` bash
# Setup ENV
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
python3 scripts/start_e2eaiok_docker.py -b tensorflow -w ${host0} ${host1} ${host2} ${host3} --proxy ""
```

## Enter Docker
```
sshpass -p docker ssh ${host0} -p 12344
```

## Workflow Prepare

``` bash
# prepare model codes
cd /home/vmagent/app/e2eaiok/modelzoo/WnD/TensorFlow2
bash patch_wnd.patch

# Download Dataset
# download and unzip dataset from https://www.kaggle.com/c/outbrain-click-prediction/data to /home/vmagent/app/dataset/outbrain/orig

# source spark env
source /home/spark-env.sh

# Start services
# only if there is no spark service running, may check ${localhost}:8080 to confirm
/home/start_spark_service.sh
```

## Data Process

In [4]:
!cd /home/vmagent/app/e2eaiok/modelzoo/WnD/TensorFlow2; sh scripts/spark_preproc.sh

22/10/31 22:02:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/31 22:02:30 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
Drop rows with empty "geo_location"...
Drop rows with empty "platform"...
valid_set_df time: 38.694966077804565                                           ]
train_set_df time: 42.35809636116028                                            1]
train/test dataset generation time: 95.60888910293579
22/10/31 22:04:18 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.deb

## Launch Training

edit conf/e2eaiok_defaults_wnd_example.conf

```
### GLOBAL SETTINGS ###
observation_budget: 1
save_path: /home/vmagent/app/e2eaiok/result/
ppn: 2
ccl_worker_num: 2
global_batch_size: 524288
num_epochs: 20
cores: 104
iface: lo
hosts:
- localhost

```

In [6]:
!cd /home/vmagent/app/e2eaiok; python run_e2eaiok.py --data_path /home/vmagent/app/dataset/outbrain/ --model_name wnd --conf conf/e2eaiok_defaults_wnd_example.conf --no_sigopt

data format is tfrecords
2022-10-31 22:20:33,833 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
2022-10-31 22:20:33,833 - E2EAIOK.SDA - INFO - Model Advisor created
2022-10-31 22:20:33,833 - E2EAIOK.SDA - INFO - model parameter initialized
2022-10-31 22:20:33,833 - E2EAIOK.SDA - INFO - start to launch training
2022-10-31 22:20:33,833 - sigopt - INFO - training launch command: mpirun -genv OMP_NUM_THREADS=24 -map-by socket -n 2 -ppn 2 -hosts localhost -print-rank-map -genv I_MPI_PIN_DOMAIN=socket -genv OMP_PROC_BIND=true -genv KMP_BLOCKTIME=1 -genv KMP_AFFINITY=granularity=fine,compact,1,0 /opt/intel/oneapi/intelpython/latest/envs/tensorflow/bin/python -u /home/vmagent/app/e2eaiok/modelzoo/WnD/TensorFlow2/main.py --results_dir /home/vmagent/app/e2eaiok/result --model_dir /home/vmagent/app/e2eaiok/result/61fab909cb1e8fb00e45984efd42565c --train_data_pattern '/home/vmagent/app/dataset/outbrain/train/part*' --eval_data_pattern '/home/vmagent/app/dataset/outbrain/valid/part*' 

INFO:tensorflow:step: 10, {'binary_accuracy': '0.7393', 'auc': '0.5222', 'loss': '0.6132', 'time': '106.3577'}
INFO:tensorflow:step: 12, {'binary_accuracy_val': 0.80657005, 'auc_val': 0.6279781, 'loss_val': 0.489207, 'map_val': 0.5816698235747075}
INFO:tensorflow:step: 12, {'binary_accuracy_val': 0.80657005, 'auc_val': 0.624491, 'loss_val': 0.4853109, 'map_val': 0.5727072315939128}
INFO:tensorflow:step: 16, {'binary_accuracy_val': 0.80657005, 'auc_val': 0.6489798, 'loss_val': 0.4800695, 'map_val': 0.5976102011193667}
INFO:tensorflow:step: 16, {'binary_accuracy_val': 0.80657005, 'auc_val': 0.6449559, 'loss_val': 0.48311973, 'map_val': 0.5906039729331615}
INFO:tensorflow:step: 20, {'binary_accuracy_val': 0.80656815, 'auc_val': 0.6591209, 'loss_val': 0.47905082, 'map_val': 0.6038914937133115}
INFO:tensorflow:step: 20, {'binary_accuracy_val': 0.80657005, 'auc_val': 0.65709734, 'loss_val': 0.48283547, 'map_val': 0.6003330487114618}
INFO:tensorflow:step: 20, {'binary_accuracy': '0.7852', 'au


FOR DEVS: If you are overwriting _tracking_metadata in your class, this property has been used to save metadata in the SavedModel. The metadta field will be deprecated soon, so please move the metadata to a different file.
INFO:tensorflow:Assets written to: /home/vmagent/app/e2eaiok/result/61fab909cb1e8fb00e45984efd42565c/assets
INFO:tensorflow:Final eval result: {'binary_accuracy_val': 0.80706024, 'auc_val': 0.69316906, 'loss_val': 0.46163183, 'map_val': 0.6270663191442578}
  [n for n in tensors.keys() if n not in ref_input_names])
  [n for n in tensors.keys() if n not in ref_input_names])

FOR DEVS: If you are overwriting _tracking_metadata in your class, this property has been used to save metadata in the SavedModel. The metadta field will be deprecated soon, so please move the metadata to a different file.
INFO:tensorflow:Assets written to: /home/vmagent/app/e2eaiok/result/61fab909cb1e8fb00e45984efd42565c/assets
INFO:tensorflow:Final eval result: {'binary_accuracy_val': 0.8066311,