# RNN-T Demo

Automatic speech recognition (ASR) systems convert audio into text representation. RNN-T is an end-to-end rnn based ASR model that directly output word transcripts given the input audio. This notebook contains step by step guide on how to optimize RNN-T model with Intel® End-to-End AI Optimization Kit, and detailed performance analysis.

# Content
* [Model Architecture](#Model-Architecture)
* [Optimizations](#Optimizations)
* [DEMO](#DEMO)

## ASR
<img src="./img/asr.png" width="800"/>

* The traditional ASR system (top picture) contains acoustic, phonetic and language components that work together as in a pipeline system
* The end-to-end ASR system is a single neural network that receives raw audio signal as input and provides a sequence of words at output

## Model Architecture
<img src="./img/rnnt_structure.png"/>

RNN-T is an end-to-end ASR model that directly converts audio into text representation.

The encoder network is a RNN which maps input acoustic frames into a higher-level representation.
The prediction network is a RNN that is explicitly conditioned on the history of previous non-blank targets predicted by the model.
The joint network is a feed-forward network that combines the outputs of the prediction network and the encoder to produce logits followed by a softmax layer to produce a distribution over the next output symbol.

## Optimizations

### Model architecture Intro

For RNN-T model democratization, we enabled distributed training with pytorch DDP to scale out model training on multi nodes, added time stack layer and increased time stack factor to reduce input sequence lengh, added layer and batch normalization to speedup training converge, decreased layer size to get a lighter model.

<img src="./img/model_base.png" width="600"/><figure>base model</figure>
<img src="./img/model_opt.png" width="600"/><figure>democratized model</figure>


### Distributed training

``` python
# data parallel
if world_size > 1:
    model = DDP(model, find_unused_parameters=True)
```

### Add time stack layer

For ASR systems, the number of time frames for an audio input sequence is significantly higher than the number of output text labels. LSTM is sequential model which leads to much time cost in process long sequence data like audio data. The StackTime layer stacks audio frames to reduce sequence length and form a higher dimension input, which helps to speedup training process.

```python
class StackTime(nn.Module):
    def __init__(self, factor):
        super().__init__()
        self.factor = int(factor)

    def stack(self, x):
        x = x.transpose(0, 1)
        T = x.size(1)
        padded = torch.nn.functional.pad(x, (0, 0, 0, (self.factor - (T % self.factor)) % self.factor))
        B, T, H = padded.size()
        x = padded.reshape(B, T // self.factor, -1)
        x = x.transpose(0, 1)
        return x

    def forward(self, x, x_lens):
        if type(x) is not list:
            x = self.stack(x)
            x_lens = (x_lens.int() + self.factor - 1) // self.factor
            return x, x_lens
        else:
            if len(x) != 2:
                raise NotImplementedError("Only number of seq segments equal to 2 is supported")
            assert x[0].size(1) % self.factor == 0, "The length of the 1st seq segment should be multiple of stack factor"
            y0 = self.stack(x[0])
            y1 = self.stack(x[1])
            x_lens = (x_lens.int() + self.factor - 1) // self.factor
            return [y0, y1], x_lens
```

About 4x speedup after increase time stack factor from 2 to 8.

<img src="./img/time_stack_2.PNG" width="600"/><figure>time_stack = 2</figure>
<img src="./img/time_stack_8.PNG" width="600"/><figure>time_stack = 8</figure>

Profiling data proves that less time cost on forward/backward since input sequence reduced with time stack layer

<img src="./img/stack_profile_base.png" width="600"/><figure>base model profiling</figure>
<img src="./img/stack_profile_democratize.png" width="600"/><figure>democratized model profiling</figure>


## Add layer normalization and batch normalization

Layer normalization for LSTM is important to the success of RNN-T modeling. Add layer normalization for LSTM and batch normalization for input feature help to speedup training converge. It takes 52 epochs to converge without normalization, while only 49 epochs needed with normalization. 

```python
enc_mod["batch_norm"] = nn.BatchNorm1d(pre_rnn_input_size)
```

```python
self.layer_norm = torch.nn.LayerNorm(hidden_size)
```

<img src="./img/no_norm.PNG" width="600"/><figure>without normalization</figure>
<img src="./img/norm.PNG" width="600"/><figure>with normalization</figure>


## HPO with SDA (Smart Democratization Advisor)

SDA config

```
Parameters for SDA auto optimization:
- learning_rate: 1.0e-3~1.0e-2 #training learning rate
- warmup_epochs: 1~10 #epoch to warmup learning rate
metrics:
- name: training_time # training time threshold
  objective: minimize
  threshold: 43200
- name: WER # training metric threshold
  objective: minimize
  threshold: 0.25
 ```

request suggestions from SDA

```python
suggestion = self.conn.experiments(self.experiment.id).suggestions().create()
```


## Framework related optimization

leverage IPEX for distributed training and enable socket binding for training in two socket system

```bash
# Use IPEX launch to launch training, enable NUMA binding in two socket system.
${CONDA_PREFIX}/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=2 --nnodes=4 --hostfile hosts train.py ${ARGS}
```

<img src="./img/no_numa_binding.png" width="600"/><figure>without numa binding</figure>
<img src="./img/numa_binding.png" width="600"/><figure>enable numa binding</figure>


# DEMO
* [Environment Setup](#Environment-setup)
* [Launch training](#Launch-training)

## Environment setup
``` bash
# Setup ENV
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
python3 scripts/start_e2eaiok_docker.py -b pytorch -w ${host0} ${host1} ${host2} ${host3} --proxy ""
```

Notes: RNN-T training is based on LibriSpeech train-clean-100 and evaluated on dev-clean, we evaluated WER with stock model (based on MLPerf submission) at train-clean-100 dataset, and final WER is 0.25, all the following optimization guarantee 0.25 WER. MLPerf submission took 38.7min with 8x A100 on LibriSpeech train-960h dataset.

public reference on train-clean-100: https://arxiv.org/pdf/1807.10893.pdf, https://arxiv.org/pdf/1811.00787.pdf

## Enter Docker

```
sshpass -p docker ssh ${host0} -p 12345
```

## Workflow Prepare

``` bash
# prepare model codes
cd /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch
bash patch_rnnt.sh

# Download Dataset
# Download and unzip dataset from https://www.openslr.org/12 to /home/vmagent/app/dataset/LibriSpeech

# Generate tokenizer and tokenize text
cd /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch
bash scripts/preprocess_librispeech.sh
```

## Launch training

edit conf/e2eaiok_defaults_rnnt_example.conf

```
### GLOBAL SETTINGS ###
observation_budget: 1
save_path: /home/vmagent/app/e2eaiok/result/
ppn: 2
train_batch_size: 8
eval_batch_size: 8
iface: lo
hosts:
- localhost
epochs: 2
```

In [1]:
!cd /home/vmagent/app/e2eaiok && python run_e2eaiok.py --data_path /home/vmagent/app/dataset/LibriSpeech --model_name rnnt --conf conf/e2eaiok_defaults_rnnt_example.conf 

2022-10-31 23:21:36,263 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
{'dataset_dir': '/home/vmagent/app/dataset/LibriSpeech', 'train_manifests': ['/home/vmagent/app/dataset/LibriSpeech/metadata/train-test.json'], 'val_manifests': ['/home/vmagent/app/dataset/LibriSpeech/metadata/dev-test.json']}
2022-10-31 23:21:36,264 - E2EAIOK.SDA - INFO - Model Advisor created
2022-10-31 23:21:36,264 - E2EAIOK.SDA - INFO - model parameter initialized
2022-10-31 23:21:36,264 - E2EAIOK.SDA - INFO - start to launch training
2022-10-31 23:21:36,264 - sigopt - INFO - training launch command: /opt/intel/oneapi/intelpython/latest/envs/pytorch-1.10.0/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=2 --nnodes=1 --hostfile hosts /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py --output_dir /home/vmagent/app/e2eaiok/result/3a43b6cbb0b39444d905130fa1a0b679 --dist --dist_backend ccl --batch_size 8 --val_batch_size 8 --lr 0.007 --warmup_epochs 6 --beta1 0

[0] :::MLLOG {"namespace": "", "time_ms": 1667258501551, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 378}}
[0] :::MLLOG {"namespace": "", "time_ms": 1667258501551, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "rnnt", "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 384}}
[0] :::MLLOG {"namespace": "", "time_ms": 1667258501551, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "Intel", "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 385}}
[0] :::MLLOG {"namespace": "", "time_ms": 1667258501551, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 386}}
[0] :::MLLOG {"namespace": "", "time_ms": 1667258501551, "event_type":

[0] Dataset read by DALI. Number of samples: 73
[0] Initializing DALI with parameters:
[0] 	           __class__ : <class 'common.data.dali.pipeline.DaliPipeline'>
[0] 	          batch_size : 8
[0] 	           device_id : None
[0] 	        dither_coeff : 1e-05
[0] 	       dont_use_mmap : False
[0] 	           file_root : /home/vmagent/app/dataset/LibriSpeech/valid
[0] 	    in_mem_file_list : False
[0] 	        max_duration : inf
[0] 	           nfeatures : 80
[0] 	                nfft : 512
[0] 	         num_threads : 4
[0] 	       pipeline_type : val
[0] 	            pre_sort : False
[0] 	       preemph_coeff : 0.97
[0] 	preprocessing_device : cpu
[0] 	      resample_range : None
[0] 	         sample_rate : 16000
[0] 	             sampler : <common.data.dali.sampler.SimpleSampler object at 0x7f8da80aa0a0>
[0] 	                seed : 2021
[0] 	                self : <common.data.dali.pipeline.DaliPipeline object at 0x7f8da01b6910>
[0] 	   silence_threshold : -60
[0] 	   synthetic_seq_l

[0]   pivot_len = (audio_shape_sorted[self.split_batch_size] + stack_factor-1) // stack_factor * stack_factor
[0]   batch_offset = torch.cumsum(g_len * ((feat_lens+self.enc_stack_time_factor-1)//self.enc_stack_time_factor), dim=0)
[0]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[1]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[0] DLL 2022-10-31 23:21:47.870652 - epoch    1 | iter    1/6 | loss  967.40 | utts/s     3 | took  5.13 s | lrate 3.78e-04
[0] DLL 2022-10-31 23:21:51.735583 - epoch    1 | iter    2/6 | loss  906.09 | utts/s     4 | took  3.87 s | lrate 5.68e-04
[0] DLL 2022-10-31 23:21:55.114815 - epoch    1 | iter    3/6 | loss  788.63 | utts/s     5 | took  3.38 s | lrate 7.57e-04
[0] DLL 2022-10-31 23:21:57.761047 - epoch    1 | iter    4/6 | loss  516.45 | utts/s     6 | took  2.65 s | lrate 9.46e-04
[0] DLL 2022-10-31 23:22:02.049904 - epoch    1 | iter    5/6 | loss  955.55 | utts/s     4 | took  4.29 s | lrate 1.14e-03
[0] DLL 2022-10-31 23:2

[0] -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
[0]                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
[0] -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
[0]                                                aten::mm        20.34%        7.033s        20.41%        7.059s     171.976us         41047  
[0]                                               aten::add        19.96%        6.902s        19.96%        6.903s     287.130us         24043  
[0]                                           ProfilerStep*        12.83%        4.435s       100.00%       34.578s       17.289s             2  
[0]                                             aten::addmm        10.13%        3.502s        11.62%        4.020s      95.

[0] :::MLLOG {"namespace": "", "time_ms": 1667258808913, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 20.484347826086957, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 259, "epoch_num": 2}}
[0] :::MLLOG {"namespace": "", "time_ms": 1667258808914, "event_type": "INTERVAL_END", "key": "eval_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 260, "epoch_num": 2}}
[0] DLL 2022-10-31 23:26:48.914943 - epoch    2 |   dev ema wer 2048.43 | took 19.00 s
[0] Saving /home/vmagent/app/e2eaiok/result/3a43b6cbb0b39444d905130fa1a0b679/RNN-T_epoch2_checkpoint.pt...
2022-10-31 23:27:05,976 - sigopt - INFO - Training completed based in sigopt suggestion, took 329.71129155158997 secs
2022-10-31 23:27:05,976 - E2EAIOK.SDA - INFO - training script completed

We found the best model! Here is the model explaination

***    Best Trained Model    ***
  Model Type: rnnt
  Model Saved Pa