# DE-NAS & TLK on BERT with Hugging Face Demo
This demo mainly introduces the Hugging Face, TLK and DE-NAS joint application on the BERT, which is mainly expected to express how to leverage them together for optimizing the BERT-structure model from Hugging Face to a lighter and faster model through DE-NAS and TLK, and the optimized model can be uploaded into Hugging Face repo for broader usage.

# Content
* [Background](#1)
* [Motivation](#2)
* [Hugging Face](#3)
* [Experiment](#4)
* [Summary](#5)

<p id="1"></p>

## Background

* Conventional NAS is extremely computation-intensive, and poor scalability on diverse models. As shown in the figure, the CNN-based model and RNN-based model are commonly used in the CV and NLP respectively, and the conventional NAS work within these two domains are always computation-intensive.

<center>
<img src="./img/Nas_problem.png" width="400"/><figure>Conventional NAS</figure>
</center>

* NAS with transfer learning (knowledge distillation) has been proved useful in almost all of the recent NLP NAS on BERT.

<style>
table {
margin: auto;
}
</style>

| Institution | Representative Work | Hugging Face Repo |
|:----|:----|:----|
| Huawei | [DynaBert(NeurIPS-20)](https://arxiv.org/abs/2004.04037), [TinyBert (EMNLP-20)](https://arxiv.org/abs/1909.10351), [AutoTinyBert (ACL-21)](https://arxiv.org/abs/2107.13686), [EfficientBERT (EMNLP-21)](https://arxiv.org/abs/2109.07222), [AutoBERT-Zero (AAAI-22)](https://arxiv.org/abs/2107.07445) | Almost |
| Alibaba | [AdaBERT (IJCAI-20)](https://arxiv.org/abs/2001.04246) | No |
| Microsoft | [NAS-BERT (KDD-21)](https://arxiv.org/abs/2105.14444), [AutoDistil (NeurIPS-22)](https://arxiv.org/abs/2201.08539) | Others |

* Noted: 
    * "Almost" means that almost of the above work has its hugging face repo.
    * "Others" means that other models except BERT (etc., Transformer) or other light BERT (etc., by compression) have its hugging face repo.

<p id="2"></p> 

## Motivation

### Language Model
The latest major innovation in the world of NLP is undoubtedly large pretrained language models. The language model benefits from a good pre-trained model and fine-tuning on the target task, which is a transfer learning process indeed.

The most representive work of language model, BERT, is pretrained on large unannotated text corpora, fine-tuned on 11 NLP tasks and achieves the state-of-art results. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned.

<center>
<img src="./img/BERT.png" width="500"/><figure>Overall pre-training and fine-tuning procedures for BERT</figure>
</center>

### DE-NAS with Hugging Face
* DE-NAS is a train-free, and cross-domain (unified <u>transformer</u>) NAS.
* Hugging Face is <u>transformer-based</u>.

<center>
<img src="./img/DENAS_Huggingface.png" width="500"/><figure>DE-NAS with Hugging Face</figure>
</center>


### DE-NAS with Transfer Learning
* DE-NAS can provide a lighter and faster model, but it can be further improved.
* Transfer Learning can use a light model as the target model to inject knowledge from others.

<center>
<img src="./img/DENAS_TLK.png" width="500"/><figure>DE-NAS with TLK</figure>
</center>

<p id="3"></p> 

## Hugging Face

Hugging face was originally a chatbot start-up service provider at New York, then open sourced a Transformers library on github, which began bigger and bigger as its development.

<center>
<img src="./img/Huggingface.png" width="600"/><figure>Hugging Face</figure>
</center>

* Database: [Datasets](https://huggingface.co/datasets), [Models](https://huggingface.co/models)
* API: [Transformer](https://huggingface.co/docs/transformers/index)...
* Community: [Intel Page](https://huggingface.co/Intel?sort_models=downloads#models), [Forum](https://discuss.huggingface.co/), [Course](https://huggingface.co/course/chapter1/1)...

<p id="4"></p> 

## Experiment

### Environment Setup

* Build docker image

```
cd Dockerfile-ubuntu18.04
docker build -t aidk-pytorch110 . -f DockerfilePytorch110 --build-arg http_proxy --build-arg https_proxy
```

```
docker run -itd --name aidk-denas-bert --privileged --network host --device=/dev/dri -v ${dataset_path}:/home/vmagent/app/dataset -v ${aidk_code_path}:/home/vmagent/app/aidk -w /home/vmagent/app/ aidk-pytorch110 /bin/bash
```
* Enter container with `docker exec -it aidk-denas-bert bash`

* Install the jupyter and Huggingface API

```
source /opt/intel/oneapi/setvars.sh --ccl-configuration=cpu_icc --force
conda activate pytorch-1.10.0
pip install jupyter
pip install transformers[torch]
```

### Launch DE-NAS with TLK 

* Prepare dataset and pre-trained BERT from Hugging Face

In [5]:
!cd /home/vmagent/app/dataset && mkdir -p bert-base-uncased && cd bert-base-uncased && wget https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt -O vocab.txt && wget https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin -O pytorch_model.bin && wget https://huggingface.co/bert-base-uncased/resolve/main/config.json -O bert_config.json

--2022-10-11 08:21:00--  https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.120.56
Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.120.56|:913... connected.
Proxy request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘vocab.txt’


2022-10-11 08:21:02 (266 KB/s) - ‘vocab.txt’ saved [231508/231508]

--2022-10-11 08:21:02--  https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin
Resolving child-prc.intel.com (child-prc.intel.com)... 10.239.120.56
Connecting to child-prc.intel.com (child-prc.intel.com)|10.239.120.56|:913... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/bert-base-uncased/097417381d6c7230bd9e3557456d726de6e83245ec8b24f529f60198a67b203a?response-content-disposition=attachment%3B%20filename%3D%22pytorch_model.bin%22&Expires=1665727214&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0c

* Loading the pre-trained model from Hugging Face

After downloading the pre-trained model from Hugging Face, we can load it into the DE-NAS supernet and searched candidate for further optimization as followings:

``` python
# using pytorch_model.bin and bert_config.json Hugging Face to construct and initialize the DE-NAS model
""" SuperBertForQuestionAnswering Parameters:
    pretrained_model_name_or_path: the path that places the "pytorch_model.bin" and "bert_config.json"
    config: the path that points to the "bert_config.json"
"""
model = SuperBertForQuestionAnswering.from_pretrained(pretrained_model_name_or_path, config)
```

* Launch DE-NAS search process

In [8]:
!cd /home/vmagent/app/aidk/DeNas && python -u search.py --domain bert --conf ../conf/denas/nlp/aidk_denas_bert.conf

paths: /home/vmagent/app/aidk/DeNas/asr/utils, /home/vmagent/app/aidk/DeNas/asr
['/home/vmagent/app/aidk/DeNas', '/opt/intel/oneapi/advisor/2022.1.0/pythonapi', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python39.zip', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9/lib-dynload', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9/site-packages', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/python3.9/site-packages/warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg', '', '..', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas', '/home/vmagent/app/aidk/DeNas/asr']
loading archive file /home/vmagent/app/dataset/bert-base-uncased
10/10/2022 07:32:06 - INFO - nlp.supernet_bert -   Model config {
  "architectures": [
    "BertForMaskedLM"
  

10/10/2022 07:35:38 - INFO - DENAS -   random 48/50 structure (8, 10, 640, 656, 1984) nas_score 155.35948181152344 params 55.113584
10/10/2022 07:35:41 - INFO - DENAS -   random 49/50 structure (9, 9, 576, 608, 2400) nas_score 200.03761291503906 params 58.184448
10/10/2022 07:35:46 - INFO - DENAS -   random 50/50 structure (11, 10, 640, 512, 2208) nas_score 233.29034423828125 params 55.522144
10/10/2022 07:35:46 - INFO - DENAS -   random_num = 50
10/10/2022 07:35:49 - INFO - DENAS -   mutation 1/25 structure (6, 9, 576, 768, 2912) nas_score 173.75631713867188 params 61.937088
10/10/2022 07:35:52 - INFO - DENAS -   mutation 2/25 structure (10, 10, 640, 768, 1088) nas_score 278.1615295410156 params 60.876416
10/10/2022 07:35:57 - INFO - DENAS -   mutation 3/25 structure (9, 11, 704, 768, 1728) nas_score 170.14674377441406 params 67.855872
10/10/2022 07:36:01 - INFO - DENAS -   mutation 4/25 structure (11, 12, 768, 640, 1632) nas_score 212.39134216308594 params 64.965536
10/10/2022 07:36:

* Launch DE-NAS Training with TLK

1. DE-NAS model wrapped with TLK

``` python
# ---bert_trainer.py---

# import tlk
import sys
sys.path.append("/home/vmagent/app/aidk/AIDK/")
sys.path.append("/home/vmagent/app/aidk/AIDK/TransferLearningKit")
from TransferLearningKit.src.engine_core import transferrable_model
from TransferLearningKit.src.engine_core.distiller import kd
...
...
...
class BertTrainer(BaseTrainer):
    def __init__(self, args):
        ...
        ...
        # construct teacher model builder
        if self.args.is_transfer_learning:
            self.teacher_model_builder = BertModelBuilder(self.args)
            self.teacher_model_builder.model_dir = self.args.teacher_model_dir
        ...
        ...
    def fit(self):
        ...
        ...
        if self.args.is_transfer_learning:
            # construct teacher model
            self.teacher_model = self.teacher_model_builder.init_model()
            self.teacher_model_config = self.teacher_model_builder.decode_arch(filename = "best_model_structure_bert.txt")
            self.teacher_model.module.set_sample_config(self.teacher_model_config) if hasattr(model, 'module') \
            else self.teacher_model.set_sample_config(self.teacher_model_config)
            # warp DE-NAS model with knowledge distillation
            self.teacher_distiller = kd.KD(pretrained_model=self.teacher_model, is_frozen=True, temperature=4)
            model = transferrable_model.make_transferrable_with_knowledge_distillation(model, model.loss, self.teacher_distiller, None, "x", True, 0.1, 0.9)
        ...
        ...
```
2. DE-NAS with TLK training script

In [19]:
!cd /home/vmagent/app/aidk/DeNas && python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=1 --nnodes=1 ./trainer/train.py --domain bert --conf /home/vmagent/app/aidk/conf/denas/nlp/aidk_denas_train_bert.conf --do_lower_case --is_transfer_learning

2022-10-10 13:01:37,943 - __main__ - INFO - MASTER_ADDR=127.0.0.1
2022-10-10 13:01:37,943 - __main__ - INFO - MASTER_PORT=29500
2022-10-10 13:01:37,943 - __main__ - INFO - I_MPI_PIN_DOMAIN=[0xfffffffffff0,]
2022-10-10 13:01:37,944 - __main__ - INFO - OMP_NUM_THREADS=44
2022-10-10 13:01:37,944 - __main__ - INFO - Using Intel OpenMP
2022-10-10 13:01:37,944 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2022-10-10 13:01:37,944 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-10 13:01:37,945 - __main__ - INFO - LD_PRELOAD=/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/lib/libiomp5.so
2022-10-10 13:01:37,945 - __main__ - INFO - CCL_WORKER_COUNT=4
2022-10-10 13:01:37,945 - __main__ - INFO - CCL_WORKER_AFFINITY=0,1,2,3
2022-10-10 13:01:37,945 - __main__ - INFO - ['mpiexec.hydra', '-l', '-np', '1', '-ppn', '1', '-genv', 'I_MPI_PIN_DOMAIN=[0xfffffffffff0,]', '-genv', 'OMP_NUM_THREADS=44', '/opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/bin/python', '-u', './trainer

[0] 	add_(Number alpha, Tensor other)
[0] Consider using one of the following signatures instead:
[0] 	add_(Tensor other, *, Number alpha) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:1050.)
[0]   next_m.mul_(beta1).add_(1 - beta1, grad)
[0] 
Iteration:   1%|1         | 1/88 [00:09<13:42,  9.45s/it][0] [A[0] 
Iteration:   2%|2         | 2/88 [00:12<08:23,  5.86s/it][A[0] 
Iteration:   3%|3         | 3/88 [00:16<07:01,  4.96s/it][A[0] 
Iteration:   5%|4         | 4/88 [00:19<05:40,  4.05s/it][A[0] 
Iteration:   6%|5         | 5/88 [00:21<04:47,  3.46s/it][A[0] 
Iteration:   7%|6         | 6/88 [00:24<04:13,  3.09s/it][0] [A[0] 
Iteration:   8%|7         | 7/88 [00:26<03:52,  2.87s/it][A[0] 
Iteration:   9%|9         | 8/88 [00:29<03:40,  2.75s/it][A[0] 
Iteration:  10%|#         | 9/88 [00:31<03:33,  2.70s/it][A[0] 
Iteration:  11%|#1        | 10/88 [00:34<03:25,  2.63s/it][A[0] 
Iteration:  12%|#2        | 11/88 [00:36<03:16,  2.55s/it][A[0] 
Iteration

Iteration: 100%|##########| 88/88 [07:01<00:00,  2.02s/it][0] [A[0] 
Iteration: 100%|##########| 88/88 [07:02<00:00,  4.80s/it]
[0] 
Epoch: 100%|██████████| 2/2 [12:25<00:00, 381.39s/it][0] 
Epoch: 100%|██████████| 2/2 [12:25<00:00, 372.65s/it]
[0] **************S*************
[0] task_name = squad1
[0] architecture = {'sample_layer_num': 11, 'sample_num_attention_heads': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'sample_qkv_sizes': [704, 704, 704, 704, 704, 704, 704, 704, 704, 704, 704], 'sample_hidden_size': 768, 'sample_intermediate_sizes': [1408, 1408, 1408, 1408, 1408, 1408, 1408, 1408, 1408, 1408, 1408]}
[0] parameter size = 72096320
[0] total training time = 745.3035025596619
[0] best_acc = f1: 14.333607124473287; em: 10.178571428571429
[0] time_per_batch_infer = 807.586 ms
[0] infer_cnt = 162
[0] **************E*************
[0] 


### DE-NAS with TLK performance

<center>
<img src="./img/DENAS_performance.png" width="500"/><figure>DE-NAS Performance</figure>
</center>

<center>
<img src="./img/DENAS_W_TLK_performance.png" width="500"/><figure>DE-NAS Performance with TLK</figure>
</center>

* As shown in the above two figures:
    * Models in DE-NAS can deliver lighter and training speedup.
    * Furthermore, DE-NAS with TLK delivered higher F1 score in almost all steps within one epoch, which demonstrates that TLK helps DE-NAS models to get faster convergence and achieve better performance.

### Upload the DE-NAS w/wo TLK Model to Hugging Face
* Through the DE-NAS w/wo TLK model, we can optimize the model from Hugging Face to a lighter and faster DE-NAS model with the similar or higher F1 score, which can be uploaded into the Hugging Face and expected to help easily deployment into the hardware.
* Below figure shows that the uploading process to Hugging Face Personal repo, which can be acted as the github ops (etc., submitted the PR) to Intel open repo.
    * Step 1: create the model repo

    <center>
    <img src="./img/Create_HuggingFace_Model.png" width="500"/><figure>Create Model Repo in Hugging Face</figure>
    </center>

    * Step 2: upload the model files
    <center>
    <img src="./img/Upload_HuggingFace_Model.png" width="800"/><figure>Upload Model into Hugging Face Repo</figure>
    </center>

<p id="5"></p> 

## Summary

* DE-NAS automatically designs a well-performed and compact BERT.
* TLK helps DE-NAS BERT in the fine-tuning stage to further improve its performance.
* Hugging Face is as the source to offer models to DE-NAS and TLK to do optimization, and as the repo to contain the optimized model for broader usage.

<center>
    <img src="./img/Overall_Workflow.png" width="600"/><figure>Overall Workflow</figure>
</center>