# AIDK DE-NAS BERT Demo

This demo mainly introduces the DE-NAS application on the BERT, which is mainly expected to express how to leverage the DE-NAS, a train-free and hardware-aware NAS, for optimizing the BERT-structure model to a lighter and faster model through DE-NAS. 

# Content
* [Background and Motivation](#1)
* [DE-NAS Introduction](#2)
* [DE-NAS on BERT Experiment](#3)
* [Summary](#4)

<p id="1"></p>

## Background and Motivation

An automatic approach to democratize Deep Neural networks becomes increasingly popular, where compression, HPO and recently emerging NAS are emerging to improve DL efficiency. Howerver, compression and HPO cannot cover all the required optimizations for efficient DL, where NAS with its ability to automatically design neural network has been paid more attention.

NAS is becoming increasingly important technique for automatic model design, and quite often it is capable of outperform human hand-designed architectures, conventional NAS is mostly targeting for single domain, which possesses poor cross-domain generalization ability. Additionally, it is extremely computation intensive due to the large search space and iterative training-based evaluation on the candidate networks. Moreover, determining the suitable architecture on different target hardware requires task-specific search that exacerbates this challenge. 

<center>
<img src="./img/NAS.png" width="400"/><figure>Conventional NAS</figure>
</center>

<p id="2"></p>

## DE-NAS Introduction

DE-NAS constructs compact neural architecture directly from carefully designed search spaces for multiple domains, leverages a hardware-aware search strategy based on given budget to determine the best network, and employs hardware-aware train-free scoring method to evaluate the candidate network’s performance rather than train each candidate and acquire its accuracy. 



### DE-NAS on BERT Search Space
Transformer-based search space consists of number of transformer layer, number of attention head, size of query/key/value, size of MLP, and dimension of embedding, and the supernet of DE-NAS on BERT is a BERT-based structure, which are shown as the below figure.

<center>
<img src="./img/NLP_Search_Space.png" width="800"/><figure>DE-NAS on BERT search space</figure>
</center>

### DE-NAS Search Engine on BERT
The search strategy in the DE-NAS search engine generates candidate architecture adaptively based on target-hardware from search space, maximize the DE-Score to determine the best architecture using on pluggable search strategy and innovatively integrated latency into train-free DE-Score as an indicator. Currentlty, the DE-NAS search engine supports the random , EA and Bayesian optimization. 

And the DE-score is a train-free score used as the proxy to predict model accuracy instead of full training and validation. It used a novel zero-cost metric combined Gaussian complexity based on network expressivity, NTK score based on network complexity, nuclear norm score based on network diversity, Synflow score based on network saliency, and latency score. The computation of DE-Score only takes a few forward inferences other than iterative training, making it extremely fast, lightweight, and data-free. Below figure shows the hardware-aware search strategy with EA algorightm.

<center>
<img src="./img/EA_Search_Algorithm.png" width="600"/><figure>Hardware-aware EA Search Algorithm</figure>
</center>

And the DE-score is a train-free score used as the proxy to predict model accuracy instead of full training and validation. It used a novel zero-cost metric combined Gaussian complexity based on network expressivity, NTK score based on network complexity, nuclear norm score based on network diversity, Synflow score based on network saliency, and latency score. The computation of DE-Score only takes a few forward inferences other than iterative training, making it extremely fast, lightweight, and data-free. The overall DE_Score was calculated as following equation:

$$DE_{score}=(\alpha_1D_{EXP}+\alpha_2D_{COM}+\alpha_3D_{DIV}+\alpha_4{SAL})D_{LAT}$$

<p id="3"></p>

## DE-NAS on BERT Experiment

### Environment Setup

* Build docker image

```
cd Dockerfile-ubuntu18.04
docker build -t aidk-pytorch110 . -f DockerfilePytorch110 --build-arg http_proxy --build-arg https_proxy
```

```
docker run -itd --name aidk-denas-bert --privileged --network host --device=/dev/dri -v ${dataset_path}:/home/vmagent/app/dataset -v ${aidk_code_path}:/home/vmagent/app/aidk -w /home/vmagent/app/ aidk-pytorch110 /bin/bash
```
* Enter container with `docker exec -it aidk-denas-bert bash`

* Install the jupyter

```
source /opt/intel/oneapi/setvars.sh --ccl-configuration=cpu_icc --force
conda activate pytorch-1.10.0
pip install jupyter
```

### Prepare Dataset and Pre-trained BERT

In [2]:
!cd /home/vmagent/app/dataset && mkdir -p bert-base-uncased && cd bert-base-uncased && wget https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt -O vocab.txt && wget https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin -O pytorch_model.bin && wget https://huggingface.co/bert-base-uncased/resolve/main/config.json -O bert_config.json

--2022-10-14 03:44:51--  https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.120.56
Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.120.56|:913... connected.
Proxy request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘vocab.txt’


2022-10-14 03:44:53 (321 KB/s) - ‘vocab.txt’ saved [231508/231508]

--2022-10-14 03:44:53--  https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin
Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.120.56
Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.120.56|:913... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/bert-base-uncased/097417381d6c7230bd9e3557456d726de6e83245ec8b24f529f60198a67b203a?response-content-disposition=attachment%3B%20filename%3D%22pytorch_model.bin%22&Expires=1665976343&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0c

### Launch DE-NAS Search Process

In [None]:
!cd /home/vmagent/app/aidk/DeNas && python -u search.py --domain bert --conf ../conf/denas/nlp/aidk_denas_bert.conf

paths: /home/vmagent/app/aidk/DeNas/asr/utils, /home/vmagent/app/aidk/DeNas/asr
['/home/vmagent/app/aidk/DeNas', '/opt/intel/oneapi/advisor/2022.1.0/pythonapi', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python39.zip', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9/lib-dynload', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9/site-packages', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9/site-packages/warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg', '', '..', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas/asr']
loading archive file /home/vmagent/app/dataset/bert-base-uncased
10/14/2022 05:26:19 - INFO - nlp.supernet_bert -   Model config {
  "architectures": [
    "BertForMaskedLM"
  

10/14/2022 05:29:48 - INFO - DENAS -   random 48/50 structure (8, 10, 640, 656, 1984) nas_score 157.05245971679688 params 55.113584
10/14/2022 05:29:51 - INFO - DENAS -   random 49/50 structure (9, 9, 576, 608, 2400) nas_score 201.1036834716797 params 58.184448
10/14/2022 05:29:54 - INFO - DENAS -   random 50/50 structure (11, 10, 640, 512, 2208) nas_score 287.7393493652344 params 55.522144
10/14/2022 05:29:54 - INFO - DENAS -   random_num = 50
10/14/2022 05:29:57 - INFO - DENAS -   mutation 1/25 structure (6, 9, 576, 768, 2912) nas_score 179.87596130371094 params 61.937088
10/14/2022 05:29:59 - INFO - DENAS -   mutation 2/25 structure (10, 10, 640, 768, 1088) nas_score 301.97967529296875 params 60.876416
10/14/2022 05:30:03 - INFO - DENAS -   mutation 3/25 structure (9, 10, 640, 608, 2400) nas_score 194.6211395263672 params 59.587008
10/14/2022 05:30:06 - INFO - DENAS -   mutation 4/25 structure (9, 11, 704, 768, 1728) nas_score 211.99317932128906 params 67.855872
10/14/2022 05:30:10 

### Launch DE-NAS Training Process

In [None]:
!cd /home/vmagent/app/aidk/DeNas && python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=1 --nnodes=1 ./trainer/train.py --domain bert --conf /home/vmagent/app/aidk/conf/denas/nlp/aidk_denas_train_bert.conf --do_lower_case

2022-10-14 05:49:14,458 - __main__ - INFO - MASTER_ADDR=127.0.0.1
2022-10-14 05:49:14,458 - __main__ - INFO - MASTER_PORT=29500
2022-10-14 05:49:14,458 - __main__ - INFO - I_MPI_PIN_DOMAIN=[0xfffffffffff0,]
2022-10-14 05:49:14,459 - __main__ - INFO - OMP_NUM_THREADS=44
2022-10-14 05:49:14,459 - __main__ - INFO - Using Intel OpenMP
2022-10-14 05:49:14,459 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2022-10-14 05:49:14,459 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-14 05:49:14,459 - __main__ - INFO - LD_PRELOAD=/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/libiomp5.so
2022-10-14 05:49:14,459 - __main__ - INFO - CCL_WORKER_COUNT=4
2022-10-14 05:49:14,459 - __main__ - INFO - CCL_WORKER_AFFINITY=0,1,2,3
2022-10-14 05:49:14,459 - __main__ - INFO - ['mpiexec.hydra', '-l', '-np', '1', '-ppn', '1', '-genv', 'I_MPI_PIN_DOMAIN=[0xfffffffffff0,]', '-genv', 'OMP_NUM_THREADS=44', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/bin/python', '-u', './trainer

Iteration:  40%|###9      | 35/88 [00:51<01:07,  1.27s/it][A[0] 
Iteration:  41%|####      | 36/88 [00:52<01:08,  1.31s/it][A[0] 
Iteration:  42%|####2     | 37/88 [00:54<01:05,  1.29s/it][A[0] 
Iteration:  43%|####3     | 38/88 [00:55<01:05,  1.32s/it][A[0] 
Iteration:  44%|####4     | 39/88 [00:56<01:04,  1.32s/it][A[0] 
Iteration:  45%|####5     | 40/88 [00:58<01:05,  1.37s/it][A[0] 
Iteration:  47%|####6     | 41/88 [00:59<01:02,  1.32s/it][0] [A[0] 
Iteration:  48%|####7     | 42/88 [01:00<01:01,  1.34s/it][A[0] 
Iteration:  49%|####8     | 43/88 [01:02<00:58,  1.31s/it][A[0] 
Iteration:  50%|#####     | 44/88 [01:03<00:57,  1.30s/it][A[0] 
Iteration:  51%|#####1    | 45/88 [01:04<00:54,  1.27s/it][0] [A[0] 
Iteration:  52%|#####2    | 46/88 [01:05<00:52,  1.25s/it][A[0] 
Iteration:  53%|#####3    | 47/88 [01:07<00:51,  1.27s/it][0] [A[0] 
Iteration:  55%|#####4    | 48/88 [01:08<00:49,  1.25s/it][A[0] 10/14/2022 05:50:31 - INFO - model.nlp.bert_trainer -   ***** Run

Iteration:  44%|####4     | 39/88 [02:14<00:49,  1.00s/it][0] [A[0] 
Iteration:  45%|####5     | 40/88 [02:15<00:47,  1.01it/s][A[0] 
Iteration:  47%|####6     | 41/88 [02:16<00:46,  1.01it/s][A[0] 
Iteration:  48%|####7     | 42/88 [02:16<00:44,  1.02it/s][A[0] 
Iteration:  49%|####8     | 43/88 [02:17<00:43,  1.03it/s][A[0] 
Iteration:  50%|#####     | 44/88 [02:18<00:42,  1.03it/s][A[0] 
Iteration:  51%|#####1    | 45/88 [02:19<00:41,  1.03it/s][A[0] 
Iteration:  52%|#####2    | 46/88 [02:20<00:41,  1.02it/s][A[0] 
Iteration:  53%|#####3    | 47/88 [02:21<00:39,  1.03it/s][A[0] 
Iteration:  55%|#####4    | 48/88 [02:22<00:38,  1.03it/s][A[0] 
Iteration:  56%|#####5    | 49/88 [02:23<00:37,  1.04it/s][A[0] 
Iteration:  57%|#####6    | 50/88 [02:24<00:38,  1.02s/it][0] [A[0] 
Iteration:  58%|#####7    | 51/88 [02:25<00:37,  1.02s/it][A[0] 
Iteration:  59%|#####9    | 52/88 [02:26<00:36,  1.01s/it][A[0] 
Iteration:  60%|######    | 53/88 [02:27<00:34,  1.00it/s][A[0] 
It

### DE-NAS Performance on BERT
* Overall DE-NAS performance on BERT

<center>
<img src="./img/Overall_Performance.png" width="600"/><figure>Overall Performance</figure>
</center>

DE-NAS assists BERT-base with the same training setting except the early stop, which delivers 1.64x parameter reduction, 7.15x training speedup and 1.01 F1 score improvement.

* Training Optimization

    * The DE-NAS helps the BERT delivers the 1.56x speedup within full epoch training.
    * With the early stop optimization, the DE-BERT achieves further 4.59x speedup, and totally 7.15x speedup.
    * With the distribution optimization (2 processes in 1 SKX node), the DE-BERT delivers 1.38x speedup, and totally 9.84x speedup.

<center>
<img src="./img/Training_Performance.png" width="600"/><figure>Training Optimization Performance</figure>
</center>

<p id="4"></p>

## Summary

* The DE-NAS deployed on BERT helps it deliver a lighter (1.64x parameter reduction) and faster (7.15x speedup) model within the similar performance (1.01 F1 score improvement).
* With the training script optimization, the DE-NAS can help BERT deliver more performance speedup.