# DIEN DEMO

Online ads display CTR predication, evolved from DIN(Alibaba) which uses sequence model to simulate user interest evolving process.

* original source
    * Source repo: https://github.com/alibaba/ai-matrix


# Content
* [Model Architecture](#Model-Architecture)
* [Optimizations](#Optimizations)
* [Performance](#Performance-Overview)
* [Demo](#DEMO)


------

# Model Architecture

<div><img src="./img/dien-arch.png" alt="DIEN Model Architecture" width="600"></div>

* DIEN (from bottom to up)
    * Behavior layer: convert sequence behaviors to embedding vector  
    * Interest Extractor Layer: extracts interest sequence based on behavior sequence
    * Interest evolving layer: AUGRU models interest evolving process that is relative to target item.


# Optimizations

* Motivation
    * Original ETL was implemented with pure python
    * Original Training on single node CPU showed 2.55x gap to GPU
    * Inference on CPU nodes can be run with 8X parallism
    
* Data Process
    * speeding up by 15x
    * Data Ingestion: Rewrite Data Ingestion wit spark and directly load for preprocessing, 35x speed up
    * PreProcessing: re-implement DataProcessing with RecDP spark, 12.27x speed up

<div><img src="./img/dien-dataprocessing.png" alt="DIEN Training" width="900"></div>  

* Training
    * Speeding up by 8.12x
    * Tensorflow optimization: Switch to use Intel optimized Tensorflow2
    * Optimized DataLoader: complete data categorify in ETL
    * Scaling out: Scaling out training from single node to 4 CLX-8535 nodes

<div><img src="./img/dien-scaling.png" alt="DIEN Inference" width="900"></div>
    
* Inference
    * Improved 882x
    * inference scaling out from single process on one node to 64 processes on 4 nodes
    * Optimized DataLoader: complete data categorify in ETL
    * Multi instance inference


# Performance Overview

* For Training
    * Our optimized DIEN end to end training time on CPU vs. on AWS P4D A100 shows gap as 2.14x(single CLX node), after scaling out to 4 CLX nodes, gap is reduced to 1.05x
<div><img src="./img/dien-training-perf.png" alt="DIEN Training" width="500"></div>

* For Inference
    * Our optimized DIEN inference throughput on CPU vs on AWS P4D A100 shows 2.24x better on single CLX, which can be linear scaling out to multiple nodes.
<div><img src="./img/dien-infer-perf.png" alt="DIEN Inference" width="500"></div>


------

# DEMO

* [Environment Setup](#Environment-Setup)
* [Data Process](#Data-Process)
* [Train](#Train)
* [Inference](#Inference)

## Environment Setup

In [7]:
! git clone https://github.com/intel/e2eAIOK.git; cd e2eAIOK; git submodule update --init --recursive

fatal: destination path 'e2eAIOK' already exists and is not an empty directory.
Submodule 'modelzoo/third_party/DeepLearningExamples' (https://github.com/NVIDIA/DeepLearningExamples.git) registered for path 'modelzoo/third_party/DeepLearningExamples'
Submodule 'modelzoo/third_party/IntelAI_models' (https://github.com/IntelAI/models.git) registered for path 'modelzoo/third_party/IntelAI_models'
Submodule 'modelzoo/third_party/alibaba-ai-matrix' (https://github.com/alibaba/ai-matrix.git) registered for path 'modelzoo/third_party/alibaba-ai-matrix'
Submodule 'modelzoo/third_party/dlrm' (https://github.com/facebookresearch/dlrm.git) registered for path 'modelzoo/third_party/dlrm'
Submodule 'modelzoo/third_party/mlperf_v1.0' (https://github.com/mlcommons/training_results_v1.0.git) registered for path 'modelzoo/third_party/mlperf_v1.0'
Submodule 'modelzoo/third_party/nnUNet' (https://github.com/MIC-DKFZ/nnUNet.git) registered for path 'modelzoo/third_party/nnUNet'
Submodule 'tests/cicd/bats'

In [1]:
! pip install e2eaiok-sda

Collecting e2eaiok-sda
  Downloading e2eAIOK_sda-1.0.0-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 KB[0m [31m212.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: e2eaiok-sda
Successfully installed e2eaiok-sda-1.0.0
[0m

In [3]:
! pip install intel-tensorflow==2.10 tqdm psutil horovod

Collecting intel-tensorflow==2.10
  Downloading intel_tensorflow-2.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (237.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m237.8/237.8 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting tqdm
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 KB[0m [31m392.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tensorflow-estimator<2.11,>=2.10.0
  Downloading tensorflow_estimator-2.10.0-py2.py3-none-any.whl (438 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.7/438.7 KB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting numpy>=1.20
  Downloading numpy-1.24.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m00:01

  Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 KB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting oauthlib>=3.0.0
  Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.7/151.7 KB[0m [31m816.8 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: tensorboard-plugin-wit, pyasn1, libclang, keras, flatbuffers, wrapt, werkzeug, urllib3, typing-extensions, tqdm, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, rsa, pyasn1-modules, protobuf, oauthlib, numpy, markdown, grpcio, google-pasta, gast, charset-normalizer, certifi, cachetools, astunparse, absl-py, requests, opt-einsum, keras-preprocessing, h5py, google-auth, requests-oauthlib, google-auth-oauthlib, tensorboard, intel-tensorflow
Successfully installed absl-py-1.4.0 astunparse-1.6.3 cachetools-5.3.0 certif

## Workflow Prepare

In [9]:
! cd e2eAIOK/modelzoo/dien/train; sh patch_dien.sh

## Prepare data

In [None]:
! wget wget https://zenodo.org/record/3463683/files/data.tar.gz
! wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Books.json.gz
! wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/meta_Books.json.gz
! tar -jxvf data.tar.gz
! gunzip reviews_Books.json.gz
! gunzip meta_Books.json.gz

! mkdir -p data/train
! mkdir -p data/valid

! mv *json data/train
! mv data/local_test_splitByUser data/valid

## Data Process

In [5]:
# Data Processing
! cd /home/vmagent/app/e2eaiok/modelzoo/dien/feature_engineering/; python preprocessing.py --train --dataset_path /home/vmagent/app/dataset/amazon_reviews_proc/

sr140
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/29 00:53:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/29 00:53:59 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
recdp-scala-extension is enabled
per core memory size is 3.750 GB and shuffle_disk maximum capacity is 1200.000 GB
start spark process took 4.397668769001029 secs
save data to file:////home/vmagent/app/dataset/amazon_reviews_small//output//reviews-info
parse reviews-info with spark took 10.765656954958104 secs                      
save data to file:////home/vmagent/app/dataset/amazon_reviews_small//output//item-info
parse item-info with sp

In [2]:
# Data Processing for test
! cd /home/vmagent/app/e2eaiok/modelzoo/dien/feature_engineering/; python preprocessing.py --test --dataset_path /home/vmagent/app/dataset/amazon_reviews_proc/

sr140
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/27 16:21:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/27 16:21:01 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
recdp-scala-extension is enabled
per core memory size is 3.750 GB and shuffle_disk maximum capacity is 1200.000 GB
start spark process took 4.422215551021509 secs
save data to file:////home/vmagent/app/dataset/amazon_reviews_small/output//reviews-info
parse reviews-info with spark took 10.50614679302089 secs                       
save data to file:////home/vmagent/app/dataset/amazon_reviews_small/output//item-info
parse item-info with spar

In [11]:
!ls data

cat_voc.pkl	 reviews_Books.json  test_uid_voc.pkl  valid
meta_Books.json  test_cat_voc.pkl    train
mid_voc.pkl	 test_mid_voc.pkl    uid_voc.pkl


## Train

set data path

In [14]:
config = [
    "uid_voc: data/uid_voc.pkl" + "\n",
    "mid_voc: data/mid_voc.pkl" + "\n",
    "cat_voc: data/cat_voc.pkl" + "\n"
]
with open("data/meta.yaml", "w") as f:
    f.writelines(config)
! cat data/meta.yaml

uid_voc: data/uid_voc.pkl
mid_voc: data/mid_voc.pkl
cat_voc: data/cat_voc.pkl


In [10]:
from e2eAIOK.SDA.SDA import SDA

settings = dict()
settings["data_path"] = "data/"
settings["enable_sigopt"] = False
settings["python_path"] = "/usr/bin/"
settings["train_script"] = "e2eAIOK/modelzoo/dien/train/ai-matrix/script/train.py"

sda = SDA(model="DIEN", settings=settings) # default settings
sda.launch()

hydro_model = sda.snapshot()
hydro_model.explain()

2023-03-20 19:17:04,468 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
2023-03-20 19:17:04,471 - E2EAIOK.SDA - INFO - Model Advisor created
2023-03-20 19:17:04,472 - E2EAIOK.SDA - INFO - model parameter initialized
2023-03-20 19:17:04,473 - E2EAIOK.SDA - INFO - start to launch training
2023-03-20 19:17:04,474 - sigopt - INFO - training launch command: /usr/bin//python -u e2eaiok/modelzoo/dien/train/ai-matrix/script/train.py --train_path data/train/local_train_splitByUser --test_path data/valid/local_test_splitByUser --meta_path data/meta.yaml --saved_path /home/vmagent/app/e2eaiok/result/DIEN/20230320_191704/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f --num-intra-threads 32 --num-inter-threads 4 --mode train --embedding_device cpu --model DIEN --slice_id 0 --advanced true --seed 3 --data_type FP32 --batch_size 256
2023-03-20 19:17:04.579343: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neura

Advanced train
{'uid_voc': 'data/uid_voc.pkl', 'mid_voc': 'data/mid_voc.pkl', 'cat_voc': 'data/cat_voc.pkl', 'train_file': 'data/train/local_train_splitByUser', 'test_file': 'data/valid/local_test_splitByUser', 'model_type': 'DIEN', 'seed': 3, 'batch_size': 256, 'data_type': 'FP32'}
batch_size:  256
model:  DIEN
embedding_device cpu
best model will be saved to /home/vmagent/app/e2eaiok/result/DIEN/20230320_191704/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f/dnn_best_model
/home/vmagent/app/e2eaiok/result/DIEN/20230320_191704/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f/dnn_best_model/ckpt_noshuffDIEN3
Number of uid = 8026324, mid = 2330066, cat = 2752
embedding on cpu
-----------------------------------


  query = tf.compat.v1.layers.dense(query, facts_size, activation=None, name='f1' + stag)
  d_layer_1_all = tf.compat.v1.layers.dense(din_all, 80, activation=tf.nn.sigmoid, name='f1_att' + stag)
  d_layer_2_all = tf.compat.v1.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, name='f2_att' + stag)
  d_layer_3_all = tf.compat.v1.layers.dense(d_layer_2_all, 1, activation=None, name='f3_att' + stag)
  bn1 = tf.compat.v1.layers.batch_normalization(inputs=inp, name='bn1')
  dnn1 = tf.compat.v1.layers.dense(bn1, 200, activation=None, name='f1')
  dnn2 = tf.compat.v1.layers.dense(dnn1, 80, activation=None, name='f2')
  dnn3 = tf.compat.v1.layers.dense(dnn2, 2, activation=None, name='f3')
2023-03-20 19:17:07.389910: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled


Start to load Data from disk
Loading Data from disk is completed with 114.68978548049927 secs, start to train


OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-95
OMP: Info #156: KMP_AFFINITY: 96 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 24 cores/pkg x 2 threads/core (48 total cores)
OMP: Info #213: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 48 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 49 maps to package 0 core 1 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 50 maps to package 0 core 2 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0 
OMP: Info #171: KMP_AFFI

iter: 500 ----> train_loss: 1.6785 ---- train_accuracy: 0.5571 ---- train_aux_loss: 1.3397 ---- train_time: 53.210
 test_auc: 0.6325 ----test_loss: 1.6000 ---- test_accuracy: 0.5880 ---- test_aux_loss: 1.3862 ---- eval_time: 25.403 ---- num_iters: 474
current auc is 0.6325365220473758, target auc is 0.82
iter: 1000 ----> train_loss: 1.6727 ---- train_accuracy: 0.5816 ---- train_aux_loss: 1.3396 ---- train_time: 45.282
 test_auc: 0.6463 ----test_loss: 1.6037 ---- test_accuracy: 0.5998 ---- test_aux_loss: 1.3828 ---- eval_time: 25.293 ---- num_iters: 474
current auc is 0.6463273969847515, target auc is 0.82
iter: 1500 ----> train_loss: 1.6660 ---- train_accuracy: 0.5960 ---- train_aux_loss: 1.3367 ---- train_time: 45.676
 test_auc: 0.6833 ----test_loss: 1.5731 ---- test_accuracy: 0.6273 ---- test_aux_loss: 1.3780 ---- eval_time: 25.203 ---- num_iters: 474
current auc is 0.6832974324626151, target auc is 0.82
iter: 2000 ----> train_loss: 1.6579 ---- train_accuracy: 0.6105 ---- train_aux_l

2023-03-20 19:48:20,540 - sigopt - INFO - Training completed based in sigopt suggestion, took 1078.3184671401978 secs
2023-03-20 19:48:20,541 - E2EAIOK.SDA - INFO - training script completed



***    Best Trained Model    ***
  Model Type: DIEN
  Model Saved Path: /home/vmagent/app/e2eaiok/result/DIEN/20230320_191704/74ee8e1d3e5b4458a4b60da27e1b4540e0503a691670915b44a6d643f933ed2f
  Sigopt Experiment id is None
  === Result Metrics ===
    AUC: 0.8276467909469867
    training_time: 1078.3184671401978
