[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/demo/builtin/wnd/WND_DEMO.ipynb)

# WnD Demo

Recommendation systems drive engagement on many of the most popular online platforms. As the volume of data available to power these systems grows exponentially, users are increasingly turning from more traditional machine learning methods to highly expressive deep learning models to improve the quality of recommendations. Google's Wide and Deep recommender system is a popular model for recommendation problems for its robustness to signal sparsity.
This notebook contains step by step guide on how to optimize WnD model with Intel® End-to-End AI Optimization Kit, and detailed performance analysis.

# Content
* [Overview](#Overview)
    * [Model Architecture](#Model-Architecture)
    * [Optimizations](#Optimizations)
    * [Performance](#Performance)
* [Getting Started](#Getting-Started)
    * [1. Environment Setup](#1.-Environment-Setup)
    * [2. Workflow Prepare](#2.-Workflow-Prepare)
    * [3. Data Prepare](#3.-Data-Prepare)
    * [4. Train](#4.-Train)

# Overview

## Model Architecture
<img src="./img/wnd.png" width="800"/>

Wide and Deep model was published by Google at 2016. It jointly train wide linear models and deep neural networks, combined the benefits of memorization and generalization for recommender system. It's the first time to introduce neural network to CTR model.

The wide component is a generalized linear model. The feature set includes raw input features and transformed features
The deep component is a feed-forward neural network. The sparse, high-dimensional categorical features are first converted into an embedding vector and fed into the hidden layers of a neural network in the forward pass
The wide component and deep component are combined using a weighted sum of their output log odds as the prediction and fed to logistic loss function for joint training

## Optimizations

### Distributed Training

Use horovod for distributed training and mpirun to launch training script

### Model Optimization

Long idle time per training step for horovod communication, horovod paramter sync consume much time during distributed training, causing poor scaling performance. The overhead mainly caused by large embedding table.

<img src="./img/wnd_profile.png" width="600"/><figure>Distributed training profiling</figure>

Replace custom layer (contains embedding layer) with TensorFlow dense layer help to reduce embedding parameter size, thus reduce parameter size needed to sync by horovod, fix horovod poor scaling issue. Per step training time reduced from 5.16s to 2.71s, got about 1.9x speedup.

<img src="./img/wnd_traintime_custom_emd.png" width="600"/><figure>custom layer</figure>
<img src="./img/wnd_traintime_tf_emd.png" width="600"/><figure>TensorFlow build-in layer</figure>

### Horovod Optimization With OneCCL

Deep part embedding table cost long time hovorod communication, and Allgather is the most time-consuming operation. Enable Intel OneCCL in horovod helps to reduce Allgather time consumption, which delivers 1.2x speedup.

<img src="./img/wnd_woccl.png" width="600"/><figure>horovod timeline profiling w/o OneCCL</figure>
<img src="./img/wnd_wccl.png" width="600"/><figure>horovod timeline profiling w/ OneCCL</figure>

### Framework Related Optimization

set CCL affinity, horovod thread affinity, MPI socket binding, KMP affinity, OMP_NUM_THREADS

```bash
export CCL_WORKER_COUNT=2 # set CCL thread number
export CCL_WORKER_AFFINITY="16,17,34,35" # set CCL thread affinity
export HOROVOD_THREAD_AFFINITY="53,71" # set horovod thread affinity
export I_MPI_PIN_DOMAIN=socket # set socket binding for MPI
export I_MPI_PIN_PROCESSOR_EXCLUDE_LIST="16,17,34,35,52,53,70,71" # exclude CCL threads

mpirun -genv OMP_NUM_THREADS=16 -map-by socket -n 2 -ppn 2 -hosts localhost -genv I_MPI_PIN_DOMAIN=socket -genv OMP_PROC_BIND=true -genv KMP_BLOCKTIME=1 -genv KMP_AFFINITY=granularity=fine,compact,1,0
```

### Early Stop

Training baseline MAP stopped at 0.6553, with optimizations on training process, model converge faster and achieve 0.6553 MAP at 1.5K steps, no need to training to 9K steps. Enable early stop at 0.6553 MAP.

<img src="./img/wnd_map_GPU.png"/><figure>baseline metric curv</figure>
<img src="./img/wnd_early_stop_cpu.png"/><figure>optimized metric curv</figure>

### Input Pipeline Optimization

Training needs more system resources while input pipeline not, the resources preemption between input pipeline and training caused performance overhead. By reducing system resources allocated for input pipeline to free more resources for training, input pipeline time consuming reduced from 8.2% to 3.2% among entire training time.

<img src="./img/wnd_input_pipeline_orig.png" width="600"/><figure>original profiling</figure>
<img src="./img/wnd_input_pipeline_opt.png" width="600"/><figure>optimized profiling</figure>

### HPO With SDA (Smart Democratization Advisor)

SDA config

```
Parameters for SDA auto optimization:
- dnn_hidden_unit1: [64, 128, 256, 512] #layer width of dnn_hidden_unit1
- dnn_hidden_unit2: [64, 128, 256, 512] #layer width of dnn_hidden_unit2
- dnn_hidden_unit3: [64, 128, 256, 512] #layer width of dnn_hidden_unit3
- deep_learning_rate: 0.0001~0.1 #deep part learning rate
- linear_learning_rate: 0.01~1.0 #linear part learning rate
- deep_warmup_epochs: 1~8 #deep part warmup epochs
- deep_dropout: 0~0.5 #deep part dropout
metrics:
- name: training_time # training time threshold
  objective: minimize
  threshold: 1800
- name: MAP # training metric threshold
  objective: maximize
  threshold: 0.6553
metric:
- name: MAP
  threshold: 0.6553
```

request suggestions from SDA

```python
suggestion = self.conn.experiments(self.experiment.id).suggestions().create()
```

<img src="./img/wnd_sda.png" width="600"/>

## Performance

<img src="./img/wnd_perf.png" width="900"/>

* Intel optimized TensorFlow: apply OpenMP and KMP optimizations (AFFINITY, NUM_THREADS etc.) for CPU
* Distributed training: horovod scaling delivered 1.93x speedup from 1 node to 4 nodes, got poor scaling performance
* Model optimization: reducing sparse embedding size helped to reduce horovod communication data size, delivered better scaling performance, 4 nodes training delivered 2.7x speed up over 1 node
* Lighter model: reducing deep hidden unit from [1024, 1024, 1024, 1024, 1024] to [1024, 512, 256] delivered 1.14x speedup
* Early stop: stop training when MAP@12 reached pre-defined value (0.6553) , training took 904 steps delivered 4.14x speedup

# Getting Started
* [1. Environment Setup](#1.-Environment-Setup)
* [2. Workflow Prepare](#2.-Workflow-Prepare)
* [3. Data Prepare](#3.-Data-Prepare)
* [4. Train](#4.-Train)

Notes: in order to run this demo, please follow `Environment Setup`, `Workflow Prepare` and `Data Prepare` section for pre-requirements.

## 1. Environment Setup

### Option 1 Setup Environment with Pip
pre-work: move e2eAIOK source code to /home/vmagent/app/e2eaiok. Install spark and start spark services for data preprocess

In [None]:
%%bash
pip install sigopt future pydot dill pyyaml
pip install --no-cache-dir intel-tensorflow==2.10
HOROVOD_WITHOUT_MPI=1 HOROVOD_CPU_OPERATIONS=CCL \
    HOROVOD_WITHOUT_MXNET=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 \
    pip install --no-cache-dir horovod
pip install --no-cache-dir --no-deps tensorflow-transform==0.24.1 tensorflow-metadata==0.14.0
pip install "git+https://github.com/mlperf/logging.git@1.0.0"
pip install e2eAIOK-sda --pre --no-deps --ignore-installed

### Option 2 Setup Environment with Docker
``` bash
# Setup ENV
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
python3 scripts/start_e2eaiok_docker.py -b tensorflow -w ${host0} ${host1} ${host2} ${host3} --proxy ""
# Enter Docker
sshpass -p docker ssh ${host0} -p 12344
```

## 2. Workflow Prepare

``` bash
# prepare model codes
bash workflow_prepare_wnd.sh

# source spark env
source /home/spark-env.sh

# Start services
# only if there is no spark service running, may check ${localhost}:8080 to confirm
/home/start_spark_service.sh
```

a simple example of config file, one can refer to conf/e2eaiok_defaults_wnd_example.conf for whole config file

```yaml
### GLOBAL SETTINGS ###
observation_budget: 1
save_path: /home/vmagent/app/e2eaiok/result/
ppn: 2
ccl_worker_num: 2
global_batch_size: 524288
num_epochs: 20
cores: 72
iface: lo
hosts:
- localhost

```

## 3. Data Prepare

```bash
# Download Dataset
# download and unzip dataset from https://www.kaggle.com/c/outbrain-click-prediction/data to /home/vmagent/app/dataset/outbrain/orig
```

In [4]:
!cd /home/vmagent/app/e2eaiok/modelzoo/WnD/TensorFlow2; sh scripts/spark_preproc.sh

22/10/31 22:02:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/31 22:02:30 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
Drop rows with empty "geo_location"...
Drop rows with empty "platform"...
valid_set_df time: 38.694966077804565                                           ]
train_set_df time: 42.35809636116028                                            1]
train/test dataset generation time: 95.60888910293579
22/10/31 22:04:18 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.deb

## 4. Train

Edit config file to control SDA process

In [2]:
from e2eAIOK.SDA.SDA import SDA
import yaml

# create SDA settings
settings = {}
settings["data_path"] = "/home/vmagent/app/dataset/outbrain/"
settings["enable_sigopt"] = False
settings["python_path"] = "/opt/intel/oneapi/intelpython/latest/bin/python"
settings["train_path"] = "e2eaiok/modelzoo/WnD/TensorFlow2/main.py"
# load WnD settings
with open("e2eaiok/tests/cicd/conf/e2eaiok_defaults_wnd_example.conf") as f:
    conf = yaml.load(f, Loader=yaml.FullLoader)
settings.update(conf)

sda = SDA(model="wnd", settings=settings)
sda.launch()

2023-03-23 06:47:56,462 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
Exception ignored in: <function SDA.__del__ at 0x7fb9b4048310>
Traceback (most recent call last):
  File "/home/vmagent/app/e2eaiok/e2eAIOK/SDA/SDA.py", line 79, in __del__
    with open(f"{self.custom_result_path}/latest_hydro_model", 'w') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'None/latest_hydro_model'
2023-03-23 06:47:56,465 - E2EAIOK.SDA - INFO - Model Advisor created
2023-03-23 06:47:56,466 - E2EAIOK.SDA - INFO - model parameter initialized
2023-03-23 06:47:56,467 - E2EAIOK.SDA - INFO - start to launch training
2023-03-23 06:47:56,469 - sigopt - INFO - training launch command: mpirun -genv OMP_NUM_THREADS=24 -map-by socket -n 2 -ppn 2 -hosts localhost -print-rank-map -genv I_MPI_PIN_DOMAIN=socket -genv OMP_PROC_BIND=true -genv KMP_BLOCKTIME=1 -genv KMP_AFFINITY=granularity=fine,compact,1,0 /opt/intel/oneapi/intelpython/latest/bin/python -u /home/vmagent/app/e2eaiok/modelzoo/

data format is tfrecords
params: {'ppn': 2, 'cores': 104, 'hosts': ['localhost'], 'ccl_worker_num': 2, 'python_executable': None, 'global_batch_size': 524288, 'num_epochs': 20, 'model_dir': './', 'observation_budget': 1, 'save_path': '/home/vmagent/app/e2eaiok/result/', 'dataset_meta_path': '/home/vmagent/app/dataset/outbrain/outbrain_meta.yaml', 'train_dataset_path': '/home/vmagent/app/dataset/outbrain/train', 'eval_dataset_path': '/home/vmagent/app/dataset/outbrain/valid', 'data_path': '/home/vmagent/app/dataset/outbrain/', 'enable_sigopt': False, 'python_path': '/opt/intel/oneapi/intelpython/latest/bin/python', 'iface': 'enp24s0f0', 'model_parameter': {'project': 'sda', 'experiment': 'wnd', 'parameters': [{'grid': [64, 128, 256, 512, 1024, 2048], 'name': 'dnn_hidden_unit1', 'type': 'int'}, {'grid': [64, 128, 256, 512, 1024, 2048], 'name': 'dnn_hidden_unit2', 'type': 'int'}, {'grid': [64, 128, 256, 512, 1024, 2048], 'name': 'dnn_hidden_unit3', 'type': 'int'}, {'bounds': {'max': 0.1, 

2023-03-23 06:47:56.712457: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-23 06:47:56.713833: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


rank: 0
rank: 0
All feature columns: ['doc_event_days_since_published_log_01scaled', 'doc_ad_days_since_published_log_01scaled', 'doc_event_doc_ad_sim_categories', 'doc_event_doc_ad_sim_topics', 'doc_event_doc_ad_sim_entities', 'pop_document_id', 'pop_publisher_id', 'pop_source_id', 'pop_ad_id', 'pop_advertiser_id', 'pop_campain_id', 'doc_views_log_01scaled', 'ad_views_log_01scaled', 'ad_id', 'campaign_id', 'doc_event_id', 'event_platform', 'doc_id', 'ad_advertiser', 'doc_event_source_id', 'doc_event_publisher_id', 'doc_ad_source_id', 'doc_ad_publisher_id', 'event_geo_location', 'event_country', 'event_country_state', 'display_id']
All feature columns: ['doc_event_days_since_published_log_01scaled', 'doc_ad_days_since_published_log_01scaled', 'doc_event_doc_ad_sim_categories', 'doc_event_doc_ad_sim_topics', 'doc_event_doc_ad_sim_entities', 'pop_document_id', 'pop_publisher_id', 'pop_source_id', 'pop_ad_id', 'pop_advertiser_id', 'pop_campain_id', 'doc_views_log_01scaled', 'ad_views_log_

2023-03-23 06:47:58.510237: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-23 06:47:58.510395: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Steps per epoch: 113
INFO:tensorflow:Steps per epoch: 113
  inputs = self._flatten_to_reference_inputs(inputs)
  inputs = self._flatten_to_reference_inputs(inputs)
INFO:tensorflow:step: 0, {'binary_accuracy': '0.4316', 'auc': '0.4966', 'loss': '1.0370', 'time'

INFO:tensorflow:step: 52, {'binary_accuracy_val': 0.8068619, 'auc_val': 0.6773412, 'loss_val': 0.4630782, 'map_val': 0.6269048126310016}
INFO:tensorflow:step: 56, {'binary_accuracy_val': 0.80662346, 'auc_val': 0.6912416, 'loss_val': 0.46513453, 'map_val': 0.6289117440123722}
INFO:tensorflow:step: 56, {'binary_accuracy_val': 0.80693436, 'auc_val': 0.6789451, 'loss_val': 0.4619302, 'map_val': 0.6276270785147853}
INFO:tensorflow:step: 60, {'binary_accuracy_val': 0.80662155, 'auc_val': 0.6914401, 'loss_val': 0.46614832, 'map_val': 0.6290294068335345}
INFO:tensorflow:step: 60, {'binary_accuracy_val': 0.80698013, 'auc_val': 0.6806805, 'loss_val': 0.46096572, 'map_val': 0.6284364907265818}
INFO:tensorflow:step: 60, {'binary_accuracy': '0.7960', 'auc': '0.6625', 'loss': '0.4891', 'time': '83.8929'}
INFO:tensorflow:step: 60, {'binary_accuracy': '0.7930', 'auc': '0.6588', 'loss': '0.4944', 'time': '84.8086'}
INFO:tensorflow:step: 64, {'binary_accuracy_val': 0.80664825, 'auc_val': 0.6928923, 'los

('/home/vmagent/app/e2eaiok/result/6c37b08d7939bb0d5de5f1e4c4303884f82b1c47d3d61173d1caf86893e9a845',
 [{'name': 'MAP', 'value': 0.6309602310691915},
  {'name': 'training_time', 'value': 710.9380166530609}])