[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/demo/builtin/rnnt/RNNT_DEMO.ipynb)

# RNN-T Demo

Automatic speech recognition (ASR) systems convert audio into text representation. RNN-T is an end-to-end rnn based ASR model that directly output word transcripts given the input audio. This notebook contains step by step guide on how to optimize RNN-T model with Intel® End-to-End AI Optimization Kit, and detailed performance analysis.

# Content
* [Overview](#Overview)
    * [Model Architecture](#Model-Architecture)
    * [Optimizations](#Optimizations)
    * [Performance](#Performance)
* [Getting Started](#Getting-Started)
    * [1. Environment Setup](#1.-Environment-Setup)
    * [2. Workflow Prepare](#2.-Workflow-Prepare)
    * [3. Data Prepare](#3.-Data-Prepare)
    * [4. Train](#4.-Train)

# Overview
<img src="./img/asr.png" width="800"/>

* The traditional ASR system (top picture) contains acoustic, phonetic and language components that work together as in a pipeline system
* The end-to-end ASR system is a single neural network that receives raw audio signal as input and provides a sequence of words at output

## Model Architecture
<img src="./img/rnnt_structure.png"/>

RNN-T is an end-to-end ASR model that directly converts audio into text representation.

The encoder network is a RNN which maps input acoustic frames into a higher-level representation.
The prediction network is a RNN that is explicitly conditioned on the history of previous non-blank targets predicted by the model.
The joint network is a feed-forward network that combines the outputs of the prediction network and the encoder to produce logits followed by a softmax layer to produce a distribution over the next output symbol.

## Optimizations

### Model architecture Intro

For RNN-T model democratization, we enabled distributed training with pytorch DDP to scale out model training on multi nodes, added time stack layer and increased time stack factor to reduce input sequence lengh, added layer and batch normalization to speedup training converge, decreased layer size to get a lighter model.

<img src="./img/model_base.png" width="600"/><figure>base model</figure>
<img src="./img/model_opt.png" width="600"/><figure>democratized model</figure>


### Distributed training

``` python
# data parallel
if world_size > 1:
    model = DDP(model, find_unused_parameters=True)
```

### Add time stack layer

For ASR systems, the number of time frames for an audio input sequence is significantly higher than the number of output text labels. LSTM is sequential model which leads to much time cost in process long sequence data like audio data. The StackTime layer stacks audio frames to reduce sequence length and form a higher dimension input, which helps to speedup training process.

```python
class StackTime(nn.Module):
    def __init__(self, factor):
        super().__init__()
        self.factor = int(factor)

    def stack(self, x):
        x = x.transpose(0, 1)
        T = x.size(1)
        padded = torch.nn.functional.pad(x, (0, 0, 0, (self.factor - (T % self.factor)) % self.factor))
        B, T, H = padded.size()
        x = padded.reshape(B, T // self.factor, -1)
        x = x.transpose(0, 1)
        return x

    def forward(self, x, x_lens):
        if type(x) is not list:
            x = self.stack(x)
            x_lens = (x_lens.int() + self.factor - 1) // self.factor
            return x, x_lens
        else:
            if len(x) != 2:
                raise NotImplementedError("Only number of seq segments equal to 2 is supported")
            assert x[0].size(1) % self.factor == 0, "The length of the 1st seq segment should be multiple of stack factor"
            y0 = self.stack(x[0])
            y1 = self.stack(x[1])
            x_lens = (x_lens.int() + self.factor - 1) // self.factor
            return [y0, y1], x_lens
```

About 4x speedup after increase time stack factor from 2 to 8.

<img src="./img/time_stack_2.PNG" width="600"/><figure>time_stack = 2</figure>
<img src="./img/time_stack_8.PNG" width="600"/><figure>time_stack = 8</figure>

Profiling data proves that less time cost on forward/backward since input sequence reduced with time stack layer

<img src="./img/stack_profile_base.png" width="600"/><figure>base model profiling</figure>
<img src="./img/stack_profile_democratize.png" width="600"/><figure>democratized model profiling</figure>


### Add layer normalization and batch normalization

Layer normalization for LSTM is important to the success of RNN-T modeling. Add layer normalization for LSTM and batch normalization for input feature help to speedup training converge. It takes 52 epochs to converge without normalization, while only 49 epochs needed with normalization. 

```python
enc_mod["batch_norm"] = nn.BatchNorm1d(pre_rnn_input_size)
```

```python
self.layer_norm = torch.nn.LayerNorm(hidden_size)
```

<img src="./img/no_norm.PNG" width="600"/><figure>without normalization</figure>
<img src="./img/norm.PNG" width="600"/><figure>with normalization</figure>


### HPO with SDA (Smart Democratization Advisor)

SDA config

```
Parameters for SDA auto optimization:
- learning_rate: 1.0e-3~1.0e-2 #training learning rate
- warmup_epochs: 1~10 #epoch to warmup learning rate
metrics:
- name: training_time # training time threshold
  objective: minimize
  threshold: 43200
- name: WER # training metric threshold
  objective: minimize
  threshold: 0.25
 ```

request suggestions from SDA

```python
suggestion = self.conn.experiments(self.experiment.id).suggestions().create()
```


### Framework related optimization

leverage IPEX for distributed training and enable socket binding for training in two socket system

```bash
# Use IPEX launch to launch training, enable NUMA binding in two socket system.
${CONDA_PREFIX}/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=2 --nnodes=4 --hostfile hosts train.py ${ARGS}
```

<img src="./img/no_numa_binding.png" width="600"/><figure>without numa binding</figure>
<img src="./img/numa_binding.png" width="600"/><figure>enable numa binding</figure>


## Performance

<img src="./img/rnnt_perf.png" width="900"/>

* Distributed training with HW scaling delivered 3.83x speedup from 1 node to 4 nodes
* HPO delivered 1.35x speedup, and 5.16x speedup over baseline
* Time stacking + reduce LSTM layer size delivered 1.86x speedup, and 9.63x speedup over baseline
* Add layer normalization in encoder and decoder, add batch normalization for input feature delivered 1.07x speedup, and 10.31x speedup over baseline
* Reduce CCL worker number delivered 1.07x speedup, and 11.06x speedup over baseline
* Thread affinity optimization delivered 1.28x speedup, and 14.19x speedup over baseline

# Getting Started
* [1. Environment Setup](#1.-Environment-Setup)
* [2. Workflow Prepare](#2.-Workflow-Prepare)
* [3. Data Prepare](#3.-Data-Prepare)
* [4. Train](#4.-Train)

Notes: in order to run this demo, please follow `Environment Setup`, `Workflow Prepare` and `Data Prepare` section for pre-requirements.

## 1. Environment Setup

### Option1 Setup Environment with Pip
pre-work: move e2eAIOK source code to /home/vmagent/app/e2eaiok

In [None]:
%%bash
pip install torchaudio==0.12.1 torch==1.12.1 --extra-index-url https://download.pytorch.org/whl/cpu
pip install oneccl_bind_pt==1.12.100 intel-extension-for-pytorch==1.12.100 -f https://developer.intel.com/ipex-whl-stable
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110==1.9.0
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install "git+https://github.com/mlperf/logging.git@1.0.0"
pip install sentencepiece Unidecode tensorboard inflect soundfile librosa sox pandas pyyaml sigopt
git clone https://github.com/HawkAaron/warp-transducer && cd warp-transducer \
    && mkdir build && cd build \
    && cmake .. && make && cd ../pytorch_binding \
    && python setup.py install
pip install e2eAIOK-sda --pre
apt install -y numactl

### Option2 Setup Environment with Docker
``` bash
# Setup ENV
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
python3 scripts/start_e2eaiok_docker.py -b pytorch112 -w ${host0} ${host1} ${host2} ${host3} --proxy ""
# Enter Docker
sshpass -p docker ssh ${host0} -p 12347
```

## 2. Workflow Prepare

``` bash
# prepare model codes
bash workflow_prepare_rnnt.sh
```

a simple example of config file, one can refer to conf/e2eaiok_defaults_rnnt_example.conf for whole config file

```yaml
### GLOBAL SETTINGS ###
observation_budget: 1
save_path: /home/vmagent/app/e2eaiok/result/
ppn: 2
train_batch_size: 8
eval_batch_size: 8
iface: lo
hosts:
- localhost
epochs: 2
```

## 3. Data Prepare

```bash
# Download Dataset
# Download and unzip dataset from https://www.openslr.org/12 to /home/vmagent/app/dataset/LibriSpeech

cd /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch
bash scripts/preprocess_librispeech.sh
```

Notes: RNN-T training is based on LibriSpeech train-clean-100 and evaluated on dev-clean, we evaluated WER with stock model (based on MLPerf submission) at train-clean-100 dataset, and final WER is 0.25, all the following optimization guarantee 0.25 WER. MLPerf submission took 38.7min with 8x A100 on LibriSpeech train-960h dataset.

public reference on train-clean-100: https://arxiv.org/pdf/1807.10893.pdf, https://arxiv.org/pdf/1811.00787.pdf

## 4. Train

Edit config file to control SDA process

In [25]:
from e2eAIOK.SDA.SDA import SDA
import yaml

# create SDA settings
settings = {}
settings["data_path"] = "/home/vmagent/app/dataset/LibriSpeech/"
settings["enable_sigopt"] = False
settings["python_path"] = "/opt/intel/oneapi/intelpython/latest/bin/python"
settings["train_path"] = "e2eaiok/modelzoo/rnnt/pytorch/train.py"
settings["model_config"] = "e2eaiok/modelzoo/rnnt/pytorch/configs/baseline_v3-1023sp.yaml"
# load RNN-T settings
with open("e2eaiok/tests/cicd/conf/e2eaiok_defaults_rnnt_example.conf") as f:
    conf = yaml.load(f, Loader=yaml.FullLoader)
settings.update(conf)

sda = SDA(model="rnnt", settings=settings)
sda.launch()

2023-03-23 07:37:53,954 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
2023-03-23 07:37:53,956 - E2EAIOK.SDA - INFO - Model Advisor created
2023-03-23 07:37:53,957 - E2EAIOK.SDA - INFO - model parameter initialized
2023-03-23 07:37:53,958 - E2EAIOK.SDA - INFO - start to launch training
2023-03-23 07:37:53,959 - sigopt - INFO - training launch command: /opt/intel/oneapi/intelpython/latest/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=2 --nnodes=1 --hostfile hosts /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py --output_dir /home/vmagent/app/e2eaiok/result/357dc3f8a3dfe894b3a3fcdd15fd1129f95f71cf887c8475679b1ff5b50674d8 --dist --dist_backend gloo --batch_size 8 --val_batch_size 8 --lr 0.007 --warmup_epochs 6 --beta1 0.9 --beta2 0.999 --max_duration 16.7 --target 0.25 --min_lr 1e-05 --lr_exp_gamma 0.939 --epochs 2 --epochs_this_job 0 --ema 0.999 --model_config modelzoo/rnnt/pytorch/configs/baseline_v3-1023sp.yaml --train_dataset

{'dataset_dir': '/home/vmagent/app/dataset/LibriSpeech', 'train_manifests': ['/home/vmagent/app/dataset/LibriSpeech/metadata/train-test.json'], 'val_manifests': ['/home/vmagent/app/dataset/LibriSpeech/metadata/dev-test.json']}


2023-03-23 07:37:54,865 - __main__ - INFO - MASTER_ADDR=127.0.0.1
2023-03-23 07:37:54,865 - __main__ - INFO - MASTER_PORT=29500
2023-03-23 07:37:54,866 - __main__ - INFO - I_MPI_PIN_DOMAIN=[0x3fff0,0xfffc00000,]
2023-03-23 07:37:54,866 - __main__ - INFO - OMP_NUM_THREADS=14
2023-03-23 07:37:54,866 - __main__ - INFO - Using Intel OpenMP
2023-03-23 07:37:54,867 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-03-23 07:37:54,867 - __main__ - INFO - KMP_BLOCKTIME=1
2023-03-23 07:37:54,867 - __main__ - INFO - LD_PRELOAD=/opt/intel/oneapi/intelpython/latest/lib/libiomp5.so
2023-03-23 07:37:54,867 - __main__ - INFO - CCL_WORKER_COUNT=4
2023-03-23 07:37:54,867 - __main__ - INFO - CCL_WORKER_AFFINITY=0,1,2,3,18,19,20,21
2023-03-23 07:37:54,867 - __main__ - INFO - mpiexec.hydra -l -np 2 -ppn 2 -genv I_MPI_PIN_DOMAIN=[0x3fff0,0xfffc00000,] -genv OMP_NUM_THREADS=14 /opt/intel/oneapi/intelpython/latest/bin/python -u /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py --output

[0] No module named 'torch_ccl'
[1] No module named 'torch_ccl'
[0] world_size:2,rank:0
[1] world_size:2,rank:1
[0] :::MLLOG {"namespace": "", "time_ms": 1679557076109, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 357}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557076192, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2021, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 362}}
[0] DLL 2023-03-23 07:37:56.194502 - PARAMETER | epochs :  2
[0] DLL 2023-03-23 07:37:56.194556 - PARAMETER | warmup_epochs :  6
[0] DLL 2023-03-23 07:37:56.194581 - PARAMETER | hold_epochs :  40
[0] DLL 2023-03-23 07:37:56.194609 - PARAMETER | epochs_this_job :  0
[0] DLL 2023-03-23 07:37:56.194632 - PARAMETER | cudnn_benchmark :  True
[0] DLL 2023-03-23 07:37:56.194656 - PARAMETER | amp_level :  1
[0] DLL 2023-03-23 07:37:56.194675 - PARAMETER |

[0] 2023-03-23 07:37:56,098 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
[1] 2023-03-23 07:37:56,098 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
[0] 2023-03-23 07:37:56,098 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[1] 2023-03-23 07:37:56,098 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.


[0] :::MLLOG {"namespace": "", "time_ms": 1679557076296, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/common/rnn.py", "lineno": 89, "tensor": "pre_rnn"}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557076552, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/common/rnn.py", "lineno": 89, "tensor": "post_rnn"}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557076556, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/rnnt/model.py", "lineno": 159, "tensor": "pred_embed"}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557076582, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/common/rnn.py",



[0] Dataset read by DALI. Number of samples: 73
[0] Initializing DALI with parameters:
[0] 	           __class__ : <class 'common.data.dali.pipeline.DaliPipeline'>
[0] 	          batch_size : 8
[0] 	           device_id : None[0] 
[0] 	        dither_coeff : 1e-05
[0] 	       dont_use_mmap : False[0] 
[0] 	           file_root : /home/vmagent/app/dataset/LibriSpeech/valid
[0] 	    in_mem_file_list : False
[0] 	        max_duration : inf
[0] 	           nfeatures : 80[0] 
[0] 	                nfft : 512
[0] 	         num_threads : 4
[0] 	       pipeline_type : val
[0] 	            pre_sort : False
[0] 	       preemph_coeff : 0.97
[0] 	preprocessing_device : cpu
[0] 	      resample_range : None
[0] 	         sample_rate : 16000
[0] 	             sampler : <common.data.dali.sampler.SimpleSampler object at 0x7f39eec970d0>[0] 
[0] 	                seed : 2021
[0] 	                self : <common.data.dali.pipeline.DaliPipeline object at 0x7f39eec97be0>
[0] 	   silence_threshold : -60
[0] 	  

[1]   warn("Profiler won't be using warmup, this can skew profiler results")
[1]   x_lens = (x_lens.int() + stacking - 1) // stacking
[1]   pivot_len = (audio_shape_sorted[self.split_batch_size] + stack_factor-1) // stack_factor * stack_factor
[1]   batch_offset = torch.cumsum(g_len * ((feat_lens+self.enc_stack_time_factor-1)//self.enc_stack_time_factor), dim=0)
[0]   warn("Profiler won't be using warmup, this can skew profiler results")
[0]   x_lens = (x_lens.int() + stacking - 1) // stacking
[0]   pivot_len = (audio_shape_sorted[self.split_batch_size] + stack_factor-1) // stack_factor * stack_factor
[0]   batch_offset = torch.cumsum(g_len * ((feat_lens+self.enc_stack_time_factor-1)//self.enc_stack_time_factor), dim=0)
[0]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[1]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[0] [W kineto_shim.cpp:337] Profiler is not initialized: ski

[0] DLL 2023-03-23 07:38:02.765200 - epoch    1 | iter    1/6 | loss  958.22 | utts/s     3 | took  5.08 s | lrate 3.78e-04
[0] DLL 2023-03-23 07:38:06.465558 - epoch    1 | iter    2/6 | loss  906.90 | utts/s     4 | took  3.70 s | lrate 5.68e-04


[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation


[0] DLL 2023-03-23 07:38:09.720707 - epoch    1 | iter    3/6 | loss  801.17 | utts/s     5 | took  3.26 s | lrate 7.57e-04


[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation


[0] DLL 2023-03-23 07:38:12.573296 - epoch    1 | iter    4/6 | loss  535.04 | utts/s     6 | took  2.85 s | lrate 9.46e-04
[0] DLL 2023-03-23 07:38:16.913737 - epoch    1 | iter    5/6 | loss 1013.65 | utts/s     4 | took  4.34 s | lrate 1.14e-03
[0] DLL 2023-03-23 07:38:22.809516 - epoch    1 | iter    6/6 | loss  896.75 | utts/s     3 | took  5.90 s | lrate 1.32e-03
[0] :::MLLOG {"namespace": "", "time_ms": 1679557102810, "event_type": "INTERVAL_END", "key": "epoch_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 786, "epoch_num": 1}}
[0] DLL 2023-03-23 07:38:22.810912 - epoch    1 | avg train utts/s     4 | took 25.16 s
[0] :::MLLOG {"namespace": "", "time_ms": 1679557102810, "event_type": "POINT_IN_TIME", "key": "throughput", "value": 3.816295986713841, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 793}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557102811, "event_type":

[1]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[0]   x_lens = (x_lens.int() + self.factor - 1) // self.factor


[0] :::MLLOG {"namespace": "", "time_ms": 1679557114384, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 20.484347826086957, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 260, "epoch_num": 1}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557114384, "event_type": "INTERVAL_END", "key": "eval_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 261, "epoch_num": 1}}
[0] DLL 2023-03-23 07:38:34.385417 - epoch    1 |   dev ema wer 2048.43 | took 11.57 s
[0] :::MLLOG {"namespace": "", "time_ms": 1679557114386, "event_type": "INTERVAL_END", "key": "block_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 811, "first_epoch_num": 1}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557114386, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo

[0] :::MLLOG {"namespace": "", "time_ms": 1679557337509, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 20.484347826086957, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 260, "epoch_num": 2}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679557337510, "event_type": "INTERVAL_END", "key": "eval_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 261, "epoch_num": 2}}
[0] DLL 2023-03-23 07:42:17.510989 - epoch    2 |   dev ema wer 2048.43 | took 11.45 s
[0] Saving /home/vmagent/app/e2eaiok/result/357dc3f8a3dfe894b3a3fcdd15fd1129f95f71cf887c8475679b1ff5b50674d8/RNN-T_epoch2_checkpoint.pt...


2023-03-23 07:42:28,613 - sigopt - INFO - Training completed based in sigopt suggestion, took 274.6536593437195 secs
2023-03-23 07:42:28,616 - E2EAIOK.SDA - INFO - training script completed


('/home/vmagent/app/e2eaiok/result/357dc3f8a3dfe894b3a3fcdd15fd1129f95f71cf887c8475679b1ff5b50674d8',
 [{'name': 'WER', 'value': 20.484347826086957},
  {'name': 'training_time', 'value': 274.6536593437195}])