# AIOK DE-NAS BERT Demo

This demo mainly introduces the DE-NAS application on the BERT, which is mainly expected to express how to leverage the DE-NAS, a train-free and hardware-aware NAS, for optimizing the BERT-structure model to a lighter and faster model through DE-NAS. 

# Content
* [Architecture](#1)
* [Performance Overview](#2)
* [Demo](#3)

<p id="1"></p>

## Architecture

DE-NAS constructs compact neural architecture directly from carefully designed search spaces for multiple domains, leverages a hardware-aware search strategy based on given budget to determine the best network, and employs hardware-aware train-free scoring method to evaluate the candidate network’s performance rather than train each candidate and acquire its accuracy. 



### DE-NAS on BERT Search Space
Transformer-based search space consists of number of transformer layer, number of attention head, size of query/key/value, size of MLP, and dimension of embedding, and the supernet of DE-NAS on BERT is a BERT-based structure, which are shown as the below figure.

<center>
<img src="./img/NLP_Search_Space.png" width="800"/><figure>DE-NAS on BERT search space</figure>
</center>

### DE-NAS Search Engine on BERT
The search strategy in the DE-NAS search engine generates candidate architecture adaptively based on target-hardware from search space, maximize the DE-Score to determine the best architecture using on pluggable search strategy and innovatively integrated latency into train-free DE-Score as an indicator. Currentlty, the DE-NAS search engine supports the random , EA and Bayesian optimization. Below is the example of EA search engine.

<center>
<img src="./img/EA_Search_Algorithm.png" width="600"/><figure>Hardware-aware EA Search Algorithm</figure>
</center>

And the DE-score is a train-free score used as the proxy to predict model accuracy instead of full training and validation. It used a novel zero-cost metric combined Gaussian complexity based on network expressivity, NTK score based on network complexity, nuclear norm score based on network diversity, Synflow score based on network saliency, and latency score. The computation of DE-Score only takes a few forward inferences other than iterative training, making it extremely fast, lightweight, and data-free.

$$DE_{score}=(\alpha_1D_{EXP}+\alpha_2D_{COM}+\alpha_3D_{DIV}+\alpha_4{SAL})D_{LAT}$$

### DE-NAS BERT Architecture
By deploying the train-free EA search engine on DE-NAS BERT search space and supernet, the DE-NAS BERT delivered the architecture as shown in the below figure:

<center>
<img src="./img/DENAS BERT Architecture.png" width="400"/><figure>DE-NAS BERT Architecture</figure>
</center>

<p id="2"></p>

## Performance Overview

DE-NAS assists BERT-base with the same training setting except the early stop, which delivers higher parameter reduction, more training speedup and F1 score improvement.

Training Optimization

* The DE-NAS helps the BERT delivers the speedup within full epoch training.
* With the early stop optimization, the DE-BERT achieves further speedup.
* With the distribution optimization, the DE-BERT delivers the best speedup.

<p id="3"></p>

## Demo

* [Environment Setup](#4)
* [Configuration](#5)
* [Launch Search](#6)
* [Train Best Searched Model](#7)

<p id="4"></p>

### Environment Setup

* Build docker image

``` shell
# clone the e2eaiok repo
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init –recursive

# build the docker
python3 scripts/start_e2eaiok_docker.py -b pytorch120 -w ${host0} ${host1} ${host2} ${host3} --proxy ""
# connect the docker
sshpass -p docker ssh ${host0} -p 12347
```

<p id="5"></p>

### Configuration

* Conf for BERT DE-NAS Search
```yaml
model_type: bert
search_engine: EvolutionarySearchEngine
batch_size: 32
random_max_epochs: 1000
max_epochs: 10
select_num: 50
population_num: 50
m_prob: 0.2
s_prob: 0.4
crossover_num: 25
mutation_num: 25
supernet_cfg: ../../conf/denas/nlp/supernet-bert-base.yaml
pretrained_bert: /home/vmagent/app/dataset/bert-base-uncased
pretrained_bert_config: /home/vmagent/app/dataset/bert-base-uncased/bert_config.json
img_size: 128
max_param_limits: 110
min_param_limits: 55
seed: 0
expressivity_weight: 0
complexity_weight: 0
diversity_weight: 0.00001
saliency_weight: 1
latency_weight: 0.01
```
* Conf for BERT Supernet and Search Space
```yaml
SUPERNET:
  LAYER_NUM: 12
  NUM_ATTENTION_HEADS: 12
  HIDDEN_SIZE: 768
  INTERMEDIATE_SIZE: 3072
  QKV_SIZE: 768
SEARCH_SPACE:
  LAYER_NUM:
    bounds:
      min: 4
      max: 12
      step: 1
    type: int
  HIDDEN_SIZE:
    bounds:
      min: 128
      max: 768
      step: 16
    type: int
  QKV_SIZE:
    bounds:
      min: 180
      max: 768
      step: 12
    type: int
  HEAD_NUM:
    bounds:
      min: 8
      max: 12
      step: 1
    type: int
  INTERMEDIATE_SIZE:
    bounds:
      min: 128
      max: 3072
      step: 32
    type: int
```
* Conf for BERT DE-NAS Train
```yaml
domain: bert
task_name: squad1
data_set: SQuADv1.1
num_train_examples: 87599
best_model_structure: /home/vmagent/app/e2eAIOK/e2eAIOK/DeNas/best_model_structure.txt
model: /home/vmagent/app/dataset/bert-base-uncased/
model_dir: /home/vmagent/app/dataset/bert-base-uncased/
data_dir: /home/vmagent/app/dataset/SQuAD/
output_dir: /home/vmagent/app/e2eAIOK/e2eAIOK/DeNas/nlp/
dist_backend: gloo
gradient_accumulation_steps: 1
warmup_proportion: 0.1
learning_rate: 0.00003
weight_decay: 0.0001
train_epochs: 4
max_seq_length: 384
doc_stride: 128
train_batch_size: 12
eval_batch_size: 32
eval_step: 200
n_best_size: 20
max_answer_length: 30
max_query_length: 64
criterion: "CrossEntropyQALoss"
optimizer: "BertAdam"
lr_scheduler: "warmup_linear"
version_2_with_negative: 0
null_score_diff_threshold: 0.0
num_labels: 2
num_workers: 10
pin_mem: True
verbose_logging: False
no_cuda: True
do_lower_case: True
metric_threshold: 81.5
eval_metric: "qa_f1"
```

<p id="6"></p>

### Launch Search

In [8]:
!cd /home/vmagent/app/e2eaiok/e2eAIOK/DeNas && /opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/bin/python -u search.py --domain bert --conf /home/vmagent/app/e2eaiok/conf/denas/nlp/e2eaiok_denas_bert.conf

paths: /home/vmagent/app/e2eaiok/e2eAIOK/DeNas/asr/utils, /home/vmagent/app/e2eaiok/e2eAIOK/DeNas/asr
['/home/vmagent/app/e2eaiok/e2eAIOK/DeNas', '/opt/intel/oneapi/advisor/2022.3.0/pythonapi', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/lib/python39.zip', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/lib/python3.9', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/lib/python3.9/lib-dynload', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/lib/python3.9/site-packages', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/lib/python3.9/site-packages/e2eAIOK-0.2.1-py3.9.egg', '', '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas', '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas', '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas', '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas', '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas', '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas/asr']
loading archive file /home/vmagent/app/dataset/bert-base-uncased
12/01/2022 13:43:12 - INFO - nlp.super

<p id="7"></p>

### Train Best Searched Model

In [13]:
!cd /home/vmagent/app/e2eaiok/e2eAIOK/DeNas && /opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=1 --nnodes=1 train.py --domain bert --conf /home/vmagent/app/e2eaiok/conf/denas/nlp/e2eaiok_denas_train_bert.conf

2022-12-01 13:52:41,414 - __main__ - INFO - MASTER_ADDR=127.0.0.1
2022-12-01 13:52:41,414 - __main__ - INFO - MASTER_PORT=29500
2022-12-01 13:52:41,415 - __main__ - INFO - I_MPI_PIN_DOMAIN=[0xffffffffffff0,]
2022-12-01 13:52:41,415 - __main__ - INFO - OMP_NUM_THREADS=48
2022-12-01 13:52:41,415 - __main__ - INFO - Using Intel OpenMP
2022-12-01 13:52:41,416 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2022-12-01 13:52:41,416 - __main__ - INFO - KMP_BLOCKTIME=1
2022-12-01 13:52:41,416 - __main__ - INFO - LD_PRELOAD=/opt/intel/oneapi/intelpython/latest/lib/libiomp5.so
2022-12-01 13:52:41,416 - __main__ - INFO - CCL_WORKER_COUNT=4
2022-12-01 13:52:41,416 - __main__ - INFO - CCL_WORKER_AFFINITY=0,1,2,3
2022-12-01 13:52:41,416 - __main__ - INFO - mpiexec.hydra -l -np 1 -ppn 1 -genv I_MPI_PIN_DOMAIN=[0xffffffffffff0,] -genv OMP_NUM_THREADS=48 /opt/intel/oneapi/intelpython/latest/envs/pytorch-1.12.0/bin/python -u train.py --domain bert --conf /home/vmagent/app/e2eaiok/conf/dena

[0] 12/01/2022 13:52:44 - INFO - e2eAIOK.common.trainer.data.data_builder_squad -   load 1027 examples!
[0] 12/01/2022 13:52:46 - INFO - e2eAIOK.DeNas.module.nlp.tokenization -   loading vocabulary file
[0] 12/01/2022 13:52:47 - INFO - e2eAIOK.common.trainer.data.data_builder_squad -   load 1680 examples!
[0] 12/01/2022 13:52:49 - INFO - Trainer -   Trainer config: {'domain': 'bert', 'task_name': 'squad1', 'data_set': 'SQuADv1.1', 'num_train_examples': 87599, 'best_model_structure': '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas/best_model_structure.txt', 'model': '/home/vmagent/app/dataset/bert-base-uncased/', 'model_dir': '/home/vmagent/app/dataset/bert-base-uncased/', 'data_dir': '/home/vmagent/app/dataset/SQuAD/', 'output_dir': '/home/vmagent/app/e2eaiok/e2eAIOK/DeNas/nlp/', 'dist_backend': 'gloo', 'gradient_accumulation_steps': 1, 'warmup_proportion': 0.1, 'learning_rate': 3e-05, 'weight_decay': 0.0001, 'train_epochs': 4, 'max_seq_length': 384, 'doc_stride': 128, 'train_batch_size': 12,