[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/demo/builtin/dien/DIEN_DEMO.ipynb)


# DIEN DEMO

Online ads display CTR predication, evolved from DIN(Alibaba) which uses sequence model to simulate user interest evolving process.

* original source
    * Source repo: https://github.com/alibaba/ai-matrix


# Content
* [Overview](#Overview)
    * [Model Architecture](#Model-Architecture)
    * [Optimizations](#Optimizations)
    * [Performance](#Performance)
* [Getting Started](#Getting-Started)
    * [1. Environment Setup](#1.-Environment-Setup)
    * [2. Workflow Prepare](#2.-Workflow-Prepare)
    * [3. Data Prepare](#3.-Data-Prepare)
    * [4. Train](#4.-Train)

------

# Overview

## Model Architecture

<div><img src="./img/dien-arch.png" alt="DIEN Model Architecture" width="600"></div>

* DIEN (from bottom to up)
    * Behavior layer: convert sequence behaviors to embedding vector  
    * Interest Extractor Layer: extracts interest sequence based on behavior sequence
    * Interest evolving layer: AUGRU models interest evolving process that is relative to target item.


## Optimizations

* Motivation
    * Original ETL was implemented with pure python
    * Original Training on single node CPU showed 2.55x gap to GPU
    * Inference on CPU nodes can be run with 8X parallism
    
* Data Process
    * speeding up by 15x
    * Data Ingestion: Rewrite Data Ingestion wit spark and directly load for preprocessing, 35x speed up
    * PreProcessing: re-implement DataProcessing with RecDP spark, 12.27x speed up

<div><img src="./img/dien-dataprocessing.png" alt="DIEN Training" width="900"></div>  

* Training
    * Speeding up by 8.12x
    * Tensorflow optimization: Switch to use Intel optimized Tensorflow2
    * Optimized DataLoader: complete data categorify in ETL
    * Scaling out: Scaling out training from single node to 4 CLX-8535 nodes

<div><img src="./img/dien-scaling.png" alt="DIEN Inference" width="900"></div>
    
* Inference
    * Improved 882x
    * inference scaling out from single process on one node to 64 processes on 4 nodes
    * Optimized DataLoader: complete data categorify in ETL
    * Multi instance inference


## Performance

* For Training
    * Our optimized DIEN end to end training time on CPU vs. on AWS P4D A100 shows gap as 2.14x(single CLX node), after scaling out to 4 CLX nodes, gap is reduced to 1.05x
<div><img src="./img/dien-training-perf.png" alt="DIEN Training" width="500"></div>

* For Inference
    * Our optimized DIEN inference throughput on CPU vs on AWS P4D A100 shows 2.24x better on single CLX, which can be linear scaling out to multiple nodes.
<div><img src="./img/dien-infer-perf.png" alt="DIEN Inference" width="500"></div>


------

# Getting Started

* [1. Environment Setup](#1.-Environment-Setup)
* [2. Workflow Prepare](#2.-Workflow-Prepare)
* [3. Data Prepare](#3.-Data-Prepare)
* [4. Train](#4.-Train)

## 1. Environment Setup

(Option 1) use pip install - recommended

In [None]:
! pip install e2eaiok-sda --pre
! pip install pyrecdp
! pip install intel-tensorflow==2.10 tqdm psutil
! HOROVOD_WITH_TENSORFLOW=1 python -m pip install horovod

(Option 2) use docker
``` bash
# 1. git clone codes
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive

# 2. build docker image
python3 scripts/start_e2eaiok_docker.py -b tensorflow -w ${host0} ${host1} ${host2} ${host3} --proxy "http://addr:ip"

# 3. Enter Docker
sshpass -p docker ssh ${host0} -p 12344

# 4. start jupyter notebook
nohup jupyter notebook --notebook-dir=/home/vmagent/app/e2eaiok --ip=${hostname} --port=8899 --allow-root &
Now you can visit demso in http://${hostname}:8899/.
```


## 2. Workflow Prepare

In [None]:
! git clone https://github.com/intel/e2eAIOK.git; cd e2eAIOK; git submodule update --init --recursive
! mv e2eAIOK e2eaiok
! cd e2eaiok/modelzoo/dien/train; sh patch_dien.sh

## 3. Data Prepare

we are using amazon product review - book dataset. using wget to download original zip file from stanford repo.

In [None]:
! wget wget https://zenodo.org/record/3463683/files/data.tar.gz
! wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books.json.gz
! wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Books.json.gz
! tar -jxvf data.tar.gz
! gunzip reviews_Books.json.gz
! gunzip meta_Books.json.gz

! mkdir -p data/raw_data
! mkdir -p data/train
! mkdir -p data/valid

! mv *json data/raw_data/; cp data/local_test_splitByUser data/raw_data/
! mv data/local_test_splitByUser data/valid/

lauch data process to downloaded files.
original data is two json files, below script will do Data Ingestion, Data Process, and Feature Engineering to prepare for ready-to-train data. 

In [4]:
# Data Processing for train
! python -u e2eaiok/modelzoo/dien/feature_engineering/preprocessing.py --train --dataset_path `pwd`/data/

/usr/local/lib/python3.10/dist-packages/pyrecdp
sr414
Will assign 48 cores and 308502 M memory for spark
23/03/21 03:45:18 WARN Utils: Your hostname, sr414 resolves to a loopback address: 127.0.1.1; using 10.1.2.14 instead (on interface enp134s0f1)
23/03/21 03:45:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/21 03:45:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
recdp-scala-extension is enabled
per core memory size is 6.276 GB and shuffle_disk maximum capacity is 1200.000 GB
start spark process took 4.007592746987939 secs
save data to file:////home/vmagent/app/e2eAIOK/demo/builtin/dien/data//output//reviews-info
parse reviews-info with spark took 15.613626144011505 secs                      
save data to file:////home/vmagent/app/e2eAIOK/demo/bu

lauch data process to downloaded files.
original test data is csv file, below script will do Data Process and Feature Engineering to prepare for ready-to-inference data. 

In [5]:
# Data Processing for test
! python -u e2eaiok/modelzoo/dien/feature_engineering/preprocessing.py --test --dataset_path `pwd`/data/

/usr/local/lib/python3.10/dist-packages/pyrecdp
sr414
Will assign 48 cores and 308502 M memory for spark
23/03/21 03:48:41 WARN Utils: Your hostname, sr414 resolves to a loopback address: 127.0.1.1; using 10.1.2.14 instead (on interface enp134s0f1)
23/03/21 03:48:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/21 03:48:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
recdp-scala-extension is enabled
per core memory size is 6.276 GB and shuffle_disk maximum capacity is 1200.000 GB
start spark process took 4.669647359987721 secs
save data to file:////home/vmagent/app/e2eAIOK/demo/builtin/dien/data//output//reviews-info
parse reviews-info with spark took 15.729018479934894 secs                      
save data to file:////home/vmagent/app/e2eAIOK/demo/bu

## 4. Train

SDA is an autoHPO component, we use SDA to trigger DIEN training and validation.
Noticed: set enable_sigopt to True, SDA will explore HyperParameter for this model. In our demo, we will use our searched best parameter.

In [None]:
from e2eAIOK.SDA.SDA import SDA
import os

python_path = !which python
python_dir_path = str(os.path.dirname(python_path[0]))
settings = dict()
settings["data_path"] = "data/"
settings["enable_sigopt"] = False
settings["python_path"] = python_dir_path
settings["train_script"] = "e2eaiok/modelzoo/dien/train/ai-matrix/script/train.py"

sda = SDA(model="DIEN", settings=settings) # default settings
sda.launch()

hydro_model = sda.snapshot()
hydro_model.explain()

2023-03-21 04:02:38,341 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
2023-03-21 04:02:38,349 - E2EAIOK.SDA - INFO - Model Advisor created
2023-03-21 04:02:38,350 - E2EAIOK.SDA - INFO - model parameter initialized
2023-03-21 04:02:38,351 - E2EAIOK.SDA - INFO - start to launch training
2023-03-21 04:02:38,353 - sigopt - INFO - training launch command: /usr/bin//python -u e2eaiok/modelzoo/dien/train/ai-matrix/script/train.py --train_path data/train/local_train_splitByUser --test_path data/valid/local_test_splitByUser --meta_path data/meta.yaml --saved_path /home/vmagent/app/e2eaiok/result/DIEN/20230321_040238/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f --num-intra-threads 48 --num-inter-threads 2 --mode train --embedding_device cpu --model DIEN --slice_id 0 --advanced true --seed 3 --data_type FP32 --batch_size 256
2023-03-21 04:02:38.508460: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neura

Advanced train
{'uid_voc': '/home/vmagent/app/e2eAIOK/demo/builtin/dien/data//uid_voc.pkl', 'mid_voc': '/home/vmagent/app/e2eAIOK/demo/builtin/dien/data//mid_voc.pkl', 'cat_voc': '/home/vmagent/app/e2eAIOK/demo/builtin/dien/data//cat_voc.pkl', 'train_file': 'data/train/local_train_splitByUser', 'test_file': 'data/valid/local_test_splitByUser', 'model_type': 'DIEN', 'seed': 3, 'batch_size': 256, 'data_type': 'FP32'}
batch_size:  256
model:  DIEN
embedding_device cpu
best model will be saved to /home/vmagent/app/e2eaiok/result/DIEN/20230321_040238/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f/dnn_best_model
/home/vmagent/app/e2eaiok/result/DIEN/20230321_040238/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f/dnn_best_model/ckpt_noshuffDIEN3
Number of uid = 8026324, mid = 2330066, cat = 2752
embedding on cpu
-----------------------------------


  query = tf.compat.v1.layers.dense(query, facts_size, activation=None, name='f1' + stag)
  d_layer_1_all = tf.compat.v1.layers.dense(din_all, 80, activation=tf.nn.sigmoid, name='f1_att' + stag)
  d_layer_2_all = tf.compat.v1.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, name='f2_att' + stag)
  d_layer_3_all = tf.compat.v1.layers.dense(d_layer_2_all, 1, activation=None, name='f3_att' + stag)
  bn1 = tf.compat.v1.layers.batch_normalization(inputs=inp, name='bn1')
  dnn1 = tf.compat.v1.layers.dense(bn1, 200, activation=None, name='f1')
  dnn2 = tf.compat.v1.layers.dense(dnn1, 80, activation=None, name='f2')
  dnn3 = tf.compat.v1.layers.dense(dnn2, 2, activation=None, name='f3')
2023-03-21 04:02:41.637716: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled


Start to load Data from disk
Loading Data from disk is completed with 114.22454190254211 secs, start to train


OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-95
OMP: Info #156: KMP_AFFINITY: 96 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 24 cores/pkg x 2 threads/core (48 total cores)
OMP: Info #213: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 48 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 49 maps to package 0 core 1 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 50 maps to package 0 core 2 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0 
OMP: Info #171: KMP_AFFI

iter: 500 ----> train_loss: 1.7206 ---- train_accuracy: 0.5585 ---- train_aux_loss: 1.3809 ---- train_time: 50.919
 test_auc: 0.6161 ----test_loss: 1.7136 ---- test_accuracy: 0.5743 ---- test_aux_loss: 1.3869 ---- eval_time: 14.828 ---- num_iters: 474
current auc is 0.6160514105711412, target auc is 0.82
iter: 1000 ----> train_loss: 1.7038 ---- train_accuracy: 0.5858 ---- train_aux_loss: 1.3713 ---- train_time: 45.815
 test_auc: 0.6471 ----test_loss: 1.6312 ---- test_accuracy: 0.6043 ---- test_aux_loss: 1.3854 ---- eval_time: 13.945 ---- num_iters: 474
current auc is 0.6470734103433915, target auc is 0.82
iter: 1500 ----> train_loss: 1.6985 ---- train_accuracy: 0.6034 ---- train_aux_loss: 1.3719 ---- train_time: 46.629
 test_auc: 0.6736 ----test_loss: 1.6197 ---- test_accuracy: 0.6221 ---- test_aux_loss: 1.3804 ---- eval_time: 13.999 ---- num_iters: 474
current auc is 0.6736113423878108, target auc is 0.82
iter: 2000 ----> train_loss: 1.6944 ---- train_accuracy: 0.6122 ---- train_aux_l