[![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/demo/builtin/bert/BERT_DEMO.ipynb)

# BERT Demo
This demo mainly introduces how to apply Intel® End-to-End AI Optimization Kit on BERT, which mainly includes distributed training, early stop with Lamb optimization and SDA, and is expected to improve the E2E performance of BERT.

# Content
* [Overview](#ovewview)
    * [Model Architecture](#model-architecture)
    * [Optimizations](#optimizations)
    * [Performance](#performance)
* [Getting Started](#getting-started)
    * [1. Environment Setup](#1-environment-setup)
    * [2. Workflow Prepare](#2-workflow-prepare)
    * [3. Data Prepare](#3-data-prepare)
    * [4. Train](#4-train)

# Ovewview

## Model Architecture

### Natural Language Processing

<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/QA.png?raw=1" width="600"/><figure>Question Answer Task</figure>

* Natural language processing (NLP) is the intersection of computer science, linguistics and machine learning, where the pre-trained language model BERT is the most representive model in a wide area of NLP tasks, like question and answer.
* The end-to-end NLP system is a BERT-based network that uses the pretrained model weight for the downstream question answer task SQuAD (v1.1). In the question-answer task, given the input question and paragraph/context sequence, the BERT aims to predict the start/end index in the paragraph to indicate the answer span.

### Model

<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/TransformerEncoder.png?raw=1" width="400"/><figure>Transformer Architecture</figure>

[Transformer](https://arxiv.org/pdf/1706.03762.pdf) includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. 

BERT’s model architecture is a multi-layer bidirectional Transformer encoder, where an attention mechanism that learns contextual relations between words (or sub-words) in a text. The input is a sequence of word tokens, which are first embedded into vectors and then processed in each Transformer layer (the multi-head attention layer and feed-forward layer).

In the question-answer task, the output hidden state of the last Tranformer layer in the BERT is feed into one specific output layer to learn the beginning and the end index in the paragraph to indicate the answer span.


## Optimizations

For BERT model democratization, we enabled distributed training with horovod and oneCCL to scale out model training on multi nodes, added early stop when reaching the target F1 score and avoiding over training, added lamb optimization to accept larger batch size, and SDA to fine tune the parameters.

### Distributed training
Using data parallelism can split a large batch of data into small pieces and send them to each node so as to reduce computation cost per node, where oneCCL and horovod provide a straightforward approach to apply the model distributed in different nodes.
<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/Horovod.png?raw=1" width="600"/><figure>Data Parallel</figure>

* cmd shell

``` shell
horovodrun --binding-args='-map-by socket' python -np {num_mpi_processes} -H {host_addr} --network-interface {network_interface} HOROVOD_CPU_OPERATIONS=CCL CCL_ATL_TRANSPORT=mpi run_squad.py
```
* python script

``` python
# data parallel
if hvd.size():
    hooks = [hvd.BroadcastGlobalVariablesHook(0)]
...
...
...
if hvd.size():
    import horovod.tensorflow as hvd
    optimizer = hvd.DistributedOptimizer(optimizer, sparse_as_dense=True)
```
<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/SingleProcess.png?raw=1" width="400"/><figure>w/o Distributed Training</figure>
<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/DistributedTraining.png?raw=1" width="400"/><figure>w/ Distributed Training</figure>

As shown in the above figure, distributed training "w/ Distributed Training" helps to reduce the burden of one single node as compared with "w/o Distributed Training".

### Add early stop and validation mechanism with Lamb optimizer

1. Early Stop and Validation

Adding early stop and validation mechanism can help reduce the over training:
* Subsample 10% test data as validation data (reduce validation time)
* Add early stop when reaching the target F1 90.874

``` python
def should_stop_fn(predictions_results):
        global_step = estimator.get_variable_value("global_step")
        global_step_int = int(global_step)
        if global_step_int >= FLAGS.step_threshold or float(predictions_results["f1"]) > FLAGS.f1_threshold:
            return True
...
...
...
early_stopping_hook = tf.compat.v1.estimator.experimental.make_early_stopping_hook(
        estimator=estimator,
        should_stop_fn=should_stop_fn,
        run_every_secs=None,
        run_every_steps=FLAGS.num_to_evaluate)
```

2. Add lamb optimization

Training with large batch size using lamb optimizer (ref: [LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES](https://arxiv.org/abs/1904.00962)), where its algorithm is shown as below:

<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/Lamb.png?raw=1" width="400"/><figure>Lamb optimization</figure>

As shown in the below figure, "w/ Early Stop" enables BERT to stop training at the proper step 600 as compared with the long step 7299 in "w/o Early Stop".

<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/WOEarlystop.png?raw=1" width="1000"/><figure>w/o Early Stop</figure>
<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/WEarlystop.png?raw=1" width="1000"/><figure>w/ Early Stop</figure>


### HPO with SDA (Smart Democratization Advisor)

SDA can assit BERT to do hyper-parameter optimization such as the "learning rate" ranging from 3e-5 to 3e-4, "warmup_rate" ranging from 0.1 to 0.3, "batch_size" selected from \[12, 16, 24, 96, 128\], etc., which is useful to select the proper hyper-parameter for improving the BERT performance.

SDA config

```
Parameters for SDA auto optimization:
  - learning_rate: 3.0e-5~3.0e-4 # learning rate for optimizer
  - warmup_rate: 0.1~0.3 # warmup rate for learning
  - batch_size: [12, 16, 24, 96, 128] # batch size for training
metrics:
- name: training_time
  objective: minimize
  threshold: 10000
- name: F1
  objective: maxmize
  threshold: 90.87
 ```

request suggestions from SDA

```python
suggestion = self.conn.experiments(self.experiment.id).suggestions().create()
```


<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/WOSDA.png?raw=1" width="800"/><figure>w/o SDA</figure>
<img src="https://github.com/intel/e2eAIOK/blob/main/demo/builtin/bert/img/WSDA.png?raw=1" width="800"/><figure>w/ SDA</figure>

As shown in the above figure, SDA helps BERT handle more samples per second (3.1 examples/sec of "w/ SDA" vs 2.85 examples/sec of "w/o SDA").

## Performance
<center>
<img src="./img/Performance.png" width="800"/><figure>SDA BERT Performance</figure>
</center>

* Distributed training with HW scaling delivered 3.42x speedup from 1 node to 4 nodes
    * Within 1% F1 score gap (90.48 vs. 90.874)
* Early stop and sampled validation dataset delivered 1.37x speedup, and 4.70x speedup over baseline
    * Within 1% F1 score gap (90.85 vs. 90.874)
* HPO with SDA (a component of AIOK, smart democratization advisor) delivered 2.15x speedup, and 10.10x speedup over baseline
    * Within 1% F1 score gap (90.14 vs. 90.874)
* Baseline converged and stopped at 2 epoch, optimized model converged and stopped at 2 epoch


# Getting Started

## 1. Environment Setup

### Option 1 Setup Environment with Pip

In [None]:
! pip install e2eAIOK-sda --pre
! pip install intel-tensorflow==2.10

### Option 2 Setup Environment with Docker

Step1. prepare code
``` bash
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
```

Step2. build docker image
```bash
python3 scripts/start_e2eaiok_docker.py -b tensorflow -w ${host0} ${host1} ${host2} ${host3} --proxy ""
```

Step3. run docker and start conda env
```bash
sshpass -p docker ssh ${host0} -p 12344
```

## 2. Workflow Prepare

* Prepare model codes
``` bash
cd /home/vmagent/app/e2eaiok/modelzoo/bert
sh patch_bert.sh
```

* Download pre-trained model

Download and extract one of BERT large-uncased pretrained models from [Google BERT repository](https://github.com/google-research/bert#pre-trained-models) to /home/vmagent/app/dataset/SQuAD/pre-trained-model/bert-large-uncased/

## 3. Data Prepare

* Download Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.

* Download from below path to /home/vmagent/app/dataset/SQuAD

    * Train Data: [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
    * Test Data: [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
    * Data Format:

``` bash
{
    "answers": {
        "answer_start": [1],
        "text": ["This is a test text"]
    },
    "context": "This is a test context.",
    "id": "1",
    "question": "Is this a test?",
    "title": "train test"
}

## 4. Train

In [None]:
from e2eAIOK.SDA.SDA import SDA

settings = dict()
settings["data_path"] = "/home/vmagent/app/dataset/SQuAD/"
settings["train_dataset_path"] = "/home/vmagent/app/dataset/SQuAD/"
settings["data_path"] = "/home/vmagent/app/dataset/SQuAD/"
settings["mpi_num_processes"] = 1
settings["enable_sigopt"] = False
settings["python_path"] = "/opt/intel/oneapi/intelpython/latest/envs/tensorflow/bin"
settings["num_epochs"] = 2
settings["metric"] = "f1"
settings["metric_objective"] = "maximize"
settings["metric_threshold"] = 90.874
settings["step_threshold"] = 100000

sda = SDA(model="BERT", settings=settings) # default settings
sda.launch()

hydro_model = sda.snapshot()
hydro_model.explain()

data format is binary

***    Best Trained Model    ***
  Model Type: bert
  Model Saved Path: 
  Sigopt Experiment id is None
  === Result Metrics ===
2022-10-31 18:57:44,290 - E2EAIOK - INFO - Above info is history record of this model
2022-10-31 18:57:44,290 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
2022-10-31 18:57:44,290 - E2EAIOK.SDA - INFO - Model Advisor created
2022-10-31 18:57:44,290 - E2EAIOK.SDA - INFO - model parameter initialized
2022-10-31 18:57:44,290 - E2EAIOK.SDA - INFO - start to launch training
training launch command: PYTHONPATH=$PYTHONPATH:/home/vmagent/app/e2eaiok/modelzoo/bert/benchmarks/ /opt/intel/oneapi/intelpython/latest/envs/tensorflow/bin//python /home/vmagent/app/e2eaiok/modelzoo/bert/benchmarks/launch_benchmark.py --model-name=bert_large --precision=fp32 --mode=training --framework=tensorflow --batch-size=24 --output-dir /home/vmagent/app/e2eaiok/result/1526042db72aeb23ba4ab6ddd4d4074d --host_file=$MODEL_DIR/hosts --num-intra-threads 3

INFO:tensorflow:Calling model_fn.
I1031 18:57:47.453577 140537919225600 estimator.py:1162] Calling model_fn.
INFO:tensorflow:Running train on CPU/GPU
I1031 18:57:47.453701 140537919225600 tpu_estimator.py:3198] Running train on CPU/GPU
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W1031 18:57:47.459250 140537919225600 deprecation.py:534] From /opt/intel/oneapi/intelpython/latest/envs/tensorflow/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
INFO:tensorflow:Done calling model_fn.
I1031 18:58:04.846139 140537919225600 estimator.py:1164] Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
I1031 18:

INFO:tensorflow:global_step/sec: 0.0750579
I1031 19:01:43.320894 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0750579
INFO:tensorflow:examples/sec: 1.80139
I1031 19:01:43.322280 140537919225600 tpu_estimator.py:2403] examples/sec: 1.80139
W1031 19:01:43.322602 140537919225600 basic_session_run_hooks.py:734] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 7 vs previous value: 7. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
INFO:tensorflow:global_step/sec: 0.0735472
I1031 19:01:56.916933 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0735472
INFO:tensorflow:examples/sec: 1.76513
I1031 19:01:56.917140 140537919225600 tpu_estimator.py:2403] examples/sec: 1.76513
INFO:tensorflow:	 Loss  = 5.868999, 	 Step  = 10 (40.672 sec)
I1031 19:02:10.668398 140537919225600 basic_session_run_hooks.py:260] 	 Loss  = 5.868999, 	 Step  = 10 (4

INFO:tensorflow:global_step/sec: 0.0738676
I1031 19:05:58.614260 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0738676
INFO:tensorflow:examples/sec: 1.77282
I1031 19:05:58.614543 140537919225600 tpu_estimator.py:2403] examples/sec: 1.77282
INFO:tensorflow:	 Loss  = 5.538436, 	 Step  = 28 (40.215 sec)
I1031 19:06:12.021176 140537919225600 basic_session_run_hooks.py:260] 	 Loss  = 5.538436, 	 Step  = 28 (40.215 sec)
INFO:tensorflow:global_step/sec: 0.0745841
I1031 19:06:12.021905 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0745841
INFO:tensorflow:examples/sec: 1.79002
I1031 19:06:12.022136 140537919225600 tpu_estimator.py:2403] examples/sec: 1.79002
INFO:tensorflow:global_step/sec: 0.074834
I1031 19:06:25.385033 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.074834
INFO:tensorflow:examples/sec: 1.79602
I1031 19:06:25.385593 140537919225600 tpu_estimator.py:2403] examples/sec: 1.79602
INFO:tensorflow:global_step/sec: 0.0744085
I1031 19:06:38.824194 140

INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 50...
I1031 19:11:12.664677 140537919225600 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 50...
INFO:tensorflow:global_step/sec: 0.0507644
I1031 19:11:12.677465 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0507644
INFO:tensorflow:examples/sec: 1.21835
I1031 19:11:12.677664 140537919225600 tpu_estimator.py:2403] examples/sec: 1.21835
INFO:tensorflow:global_step/sec: 0.0738109
I1031 19:11:26.214749 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0738109
INFO:tensorflow:examples/sec: 1.77146
I1031 19:11:26.214953 140537919225600 tpu_estimator.py:2403] examples/sec: 1.77146
INFO:tensorflow:	 Loss  = 5.108763, 	 Step  = 52 (46.314 sec)
I1031 19:11:39.280578 140537919225600 basic_session_run_hooks.py:260] 	 Loss  = 5.108763, 	 Step  = 52 (46.314 sec)
INFO:tensorflow:global_step/sec: 0.0765258
I1031 19:11:39.282430 140537919225600 tpu_estimator.py:2402] global_ste

INFO:tensorflow:global_step/sec: 0.0746215
I1031 19:17:13.511189 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0746215
INFO:tensorflow:examples/sec: 1.79092
I1031 19:17:13.511489 140537919225600 tpu_estimator.py:2403] examples/sec: 1.79092
INFO:tensorflow:global_step/sec: 0.0745354
I1031 19:17:26.927915 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0745354
INFO:tensorflow:examples/sec: 1.78885
I1031 19:17:26.928602 140537919225600 tpu_estimator.py:2403] examples/sec: 1.78885
INFO:tensorflow:	 Loss  = 4.8659067, 	 Step  = 79 (40.402 sec)
I1031 19:17:40.511939 140537919225600 basic_session_run_hooks.py:260] 	 Loss  = 4.8659067, 	 Step  = 79 (40.402 sec)
INFO:tensorflow:global_step/sec: 0.0736104
I1031 19:17:40.512654 140537919225600 tpu_estimator.py:2402] global_step/sec: 0.0736104
INFO:tensorflow:examples/sec: 1.76665
I1031 19:17:40.512849 140537919225600 tpu_estimator.py:2403] examples/sec: 1.76665
INFO:tensorflow:global_step/sec: 0.0751762
I1031 19:17:53.814895

2022-10-31 19:25:57,309 - sigopt - INFO - Training completed based in sigopt suggestion, took 1687.907425403595 secs
2022-10-31 19:25:57,310 - E2EAIOK.SDA - INFO - training script completed

We found the best model! Here is the model explaination

***    Best Trained Model    ***
  Model Type: bert
  Model Saved Path: /home/vmagent/app/e2eaiok/result/1526042db72aeb23ba4ab6ddd4d4074d
  Sigopt Experiment id is None
  === Result Metrics ===
    f1: 9.091953263048236
    training_time: 1687.907425403595
