# DIEN DEMO

Online ads display CTR predication, evolved from DIN(Alibaba) which uses sequence model to simulate user interest evolving process.

* original source
    * Source repo: https://github.com/alibaba/ai-matrix


# Content
* [Model Architecture](#Model-Architecture)
* [Optimizations](#Optimizations)
* [Performance](#Performance-Overview)
* [Demo](#DEMO)


------

# Model Architecture

<div><img src="./img/dien-arch.png" alt="DIEN Model Architecture" width="600"></div>

* DIEN (from bottom to up)
    * Behavior layer: convert sequence behaviors to embedding vector  
    * Interest Extractor Layer: extracts interest sequence based on behavior sequence
    * Interest evolving layer: AUGRU models interest evolving process that is relative to target item.


# Optimizations

* Motivation
    * Original ETL was implemented with pure python
    * Original Training on single node CPU showed 2.55x gap to GPU
    * Inference on CPU nodes can be run with 8X parallism
    
* Data Process
    * speeding up by 15x
    * Data Ingestion: Rewrite Data Ingestion wit spark and directly load for preprocessing, 35x speed up
    * PreProcessing: re-implement DataProcessing with RecDP spark, 12.27x speed up

<div><img src="./img/dien-dataprocessing.png" alt="DIEN Training" width="900"></div>  

* Training
    * Speeding up by 8.12x
    * Tensorflow optimization: Switch to use Intel optimized Tensorflow2
    * Optimized DataLoader: complete data categorify in ETL
    * Scaling out: Scaling out training from single node to 4 CLX-8535 nodes

<div><img src="./img/dien-scaling.png" alt="DIEN Inference" width="900"></div>
    
* Inference
    * Improved 882x
    * inference scaling out from single process on one node to 64 processes on 4 nodes
    * Optimized DataLoader: complete data categorify in ETL
    * Multi instance inference


# Performance Overview

* For Training
    * Our optimized DIEN end to end training time on CPU vs. on AWS P4D A100 shows gap as 2.14x(single CLX node), after scaling out to 4 CLX nodes, gap is reduced to 1.05x
<div><img src="./img/dien-training-perf.png" alt="DIEN Training" width="500"></div>

* For Inference
    * Our optimized DIEN inference throughput on CPU vs on AWS P4D A100 shows 2.24x better on single CLX, which can be linear scaling out to multiple nodes.
<div><img src="./img/dien-infer-perf.png" alt="DIEN Inference" width="500"></div>


------

# DEMO

* [Environment Setup](#Environment-Setup)
* [Data Process](#Data-Process)
* [Train](#Train)
* [Inference](#Inference)

## Environment Setup
``` bash
# Setup ENV
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
python3 scripts/start_e2eaiok_docker.py -b tensorflow -w ${host0} ${host1} ${host2} ${host3} --proxy ""
```

## Enter Docker

```
sshpass -p docker ssh ${host0} -p 12344
```

## Workflow Prepare

``` bash
# prepare model codes
cd /home/vmagent/app/e2eaiok/modelzoo/dien/train
sh patch_dien.sh

# Download Dataset
cd /home/vmagent/app/e2eaiok/modelzoo/dien/feature_engineering/
./download_dataset /home/vmagent/app/dataset/

# source spark env
source /home/spark-env.sh

# Start services
# only if there is no spark service running, may check ${localhost}:8080 to confirm
/home/start_spark_service.sh
```

## Data Process

In [5]:
# Data Processing
! cd /home/vmagent/app/e2eaiok/modelzoo/dien/feature_engineering/; python preprocessing.py --train --dataset_path /home/vmagent/app/dataset/amazon_reviews_proc/

sr140
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/29 00:53:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/29 00:53:59 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
recdp-scala-extension is enabled
per core memory size is 3.750 GB and shuffle_disk maximum capacity is 1200.000 GB
start spark process took 4.397668769001029 secs
save data to file:////home/vmagent/app/dataset/amazon_reviews_small//output//reviews-info
parse reviews-info with spark took 10.765656954958104 secs                      
save data to file:////home/vmagent/app/dataset/amazon_reviews_small//output//item-info
parse item-info with sp

In [2]:
# Data Processing for test
! cd /home/vmagent/app/e2eaiok/modelzoo/dien/feature_engineering/; python preprocessing.py --test --dataset_path /home/vmagent/app/dataset/amazon_reviews_proc/

sr140
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/27 16:21:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/27 16:21:01 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
recdp-scala-extension is enabled
per core memory size is 3.750 GB and shuffle_disk maximum capacity is 1200.000 GB
start spark process took 4.422215551021509 secs
save data to file:////home/vmagent/app/dataset/amazon_reviews_small/output//reviews-info
parse reviews-info with spark took 10.50614679302089 secs                       
save data to file:////home/vmagent/app/dataset/amazon_reviews_small/output//item-info
parse item-info with spar

## Train

### DEMO Single Node Train

In [None]:
! cp /home/vmagent/app/e2eaiok/modelzoo/dien/meta.yaml /home/vmagent/app/dataset/amazon_reviews/

In [8]:
# Train (single node)
! cd /home/vmagent/app/e2eaiok/; python -u run_e2eaiok.py --data_path /home/vmagent/app/dataset/amazon_reviews --model_name dien  2>dien_train.log


***    Best Trained Model    ***
  Model Type: dien
  Model Saved Path: 
  Sigopt Experiment id is None
  === Result Metrics ===
2022:10:31-21:11:00:(85796) |WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
local_rank=0, rank=0, size=1
one_ccl is_enabled:  True
Advanced train
{'uid_voc': '/home/vmagent/app/dataset/amazon_reviews/uid_voc.pkl', 'mid_voc': '/home/vmagent/app/dataset/amazon_reviews/mid_voc.pkl', 'cat_voc': '/home/vmagent/app/dataset/amazon_reviews/cat_voc.pkl', 'train_file': '/home/vmagent/app/dataset/amazon_reviews_small/train/local_train_splitByUser', 'test_file': '/home/vmagent/app/dataset/amazon_reviews_small/valid/local_test_splitByUser', 'model_type': 'DIEN', 'seed': 3, 'batch_size': 256, 'data_type': 'FP32'}
batch_size:  256
model:  DIEN
embedding_device cpu
best model will be saved to /home/vmagent/app/e2eaiok/result/3c3c066abc64c20f027199f6cdfe8ab8/dnn_best_model
/home/vmagent/app/e2eaiok/res

### DEMO Distributed Train

In [2]:
# Train (4 nodes)
! cat /home/vmagent/app/e2eaiok/conf/e2eaiok_defaults_dien_example.conf
! cd /home/vmagent/app/e2eaiok/; python -u run_e2eaiok.py --data_path /home/vmagent/app/dataset/amazon_reviews_distributed --model_name dien   --conf conf/e2eaiok_defaults_dien_example.conf 2>dien_distributed_train.log

ppn: 4
iface: ens5f1
hosts:
- 10.112.228.19
- 10.112.228.24
- 10.112.228.27
- 10.112.228.30

***    Best Trained Model    ***
  Model Type: dien
  Model Saved Path: /home/vmagent/app/e2eaiok/result/dien/20221012_224124/3c3c066abc64c20f027199f6cdfe8ab8
  Sigopt Experiment id is None
  === Result Metrics ===
    AUC: 0.8244640665336023
    training_time: 1219.0652213096619
Filtering local host names.
Remote host found: 10.112.228.24 10.112.228.27 10.112.228.30
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
[2]<stdout>:local_rank=0, rank=2, size=4
[3]<stdout>:local_rank=0, rank=3, size=4
[1]<stdout>:local_rank=0, rank=1, size=4
[0]<stdout>:local_rank=0, rank=0, size=4
[2]<stdout>:one_ccl is_enabled:  True
[3]<stdout>:one_ccl is_enabled:  True
[1]<stdout>:one_ccl is_enabled:  True
[0]<stdout>:one_ccl is_enabled:  True
[2]<stdout>:batch_size:  256
[3]<stdout>:batch_size:  256
[1]<stdout>:batch_size:  256
[0]<stdout>:batch_size:  256
[0]<stdout>:model:  DIEN


## Inference

In [None]:
# Inference with Single Node, 64 instances
! cd /home/vmagent/app/e2eaiok/modelzoo/dien; sh infer.sh