[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/intel/e2eAIOK/blob/main/demo/builtin/rnnt/RNNT_DEMO.ipynb)

# RNN-T Demo

Automatic speech recognition (ASR) systems convert audio into text representation. RNN-T is an end-to-end rnn based ASR model that directly output word transcripts given the input audio. This notebook contains step by step guide on how to optimize RNN-T model with Intel® End-to-End AI Optimization Kit, and detailed performance analysis.

# Content
* [Model Architecture](#Model-Architecture)
* [Optimizations](#Optimizations)
* [DEMO](#DEMO)

## ASR
<img src="./img/asr.png" width="800"/>

* The traditional ASR system (top picture) contains acoustic, phonetic and language components that work together as in a pipeline system
* The end-to-end ASR system is a single neural network that receives raw audio signal as input and provides a sequence of words at output

## Model Architecture
<img src="./img/rnnt_structure.png"/>

RNN-T is an end-to-end ASR model that directly converts audio into text representation.

The encoder network is a RNN which maps input acoustic frames into a higher-level representation.
The prediction network is a RNN that is explicitly conditioned on the history of previous non-blank targets predicted by the model.
The joint network is a feed-forward network that combines the outputs of the prediction network and the encoder to produce logits followed by a softmax layer to produce a distribution over the next output symbol.

## Optimizations

### Model architecture Intro

For RNN-T model democratization, we enabled distributed training with pytorch DDP to scale out model training on multi nodes, added time stack layer and increased time stack factor to reduce input sequence lengh, added layer and batch normalization to speedup training converge, decreased layer size to get a lighter model.

<img src="./img/model_base.png" width="600"/><figure>base model</figure>
<img src="./img/model_opt.png" width="600"/><figure>democratized model</figure>


### Distributed training

``` python
# data parallel
if world_size > 1:
    model = DDP(model, find_unused_parameters=True)
```

### Add time stack layer

For ASR systems, the number of time frames for an audio input sequence is significantly higher than the number of output text labels. LSTM is sequential model which leads to much time cost in process long sequence data like audio data. The StackTime layer stacks audio frames to reduce sequence length and form a higher dimension input, which helps to speedup training process.

```python
class StackTime(nn.Module):
    def __init__(self, factor):
        super().__init__()
        self.factor = int(factor)

    def stack(self, x):
        x = x.transpose(0, 1)
        T = x.size(1)
        padded = torch.nn.functional.pad(x, (0, 0, 0, (self.factor - (T % self.factor)) % self.factor))
        B, T, H = padded.size()
        x = padded.reshape(B, T // self.factor, -1)
        x = x.transpose(0, 1)
        return x

    def forward(self, x, x_lens):
        if type(x) is not list:
            x = self.stack(x)
            x_lens = (x_lens.int() + self.factor - 1) // self.factor
            return x, x_lens
        else:
            if len(x) != 2:
                raise NotImplementedError("Only number of seq segments equal to 2 is supported")
            assert x[0].size(1) % self.factor == 0, "The length of the 1st seq segment should be multiple of stack factor"
            y0 = self.stack(x[0])
            y1 = self.stack(x[1])
            x_lens = (x_lens.int() + self.factor - 1) // self.factor
            return [y0, y1], x_lens
```

About 4x speedup after increase time stack factor from 2 to 8.

<img src="./img/time_stack_2.PNG" width="600"/><figure>time_stack = 2</figure>
<img src="./img/time_stack_8.PNG" width="600"/><figure>time_stack = 8</figure>

Profiling data proves that less time cost on forward/backward since input sequence reduced with time stack layer

<img src="./img/stack_profile_base.png" width="600"/><figure>base model profiling</figure>
<img src="./img/stack_profile_democratize.png" width="600"/><figure>democratized model profiling</figure>


## Add layer normalization and batch normalization

Layer normalization for LSTM is important to the success of RNN-T modeling. Add layer normalization for LSTM and batch normalization for input feature help to speedup training converge. It takes 52 epochs to converge without normalization, while only 49 epochs needed with normalization. 

```python
enc_mod["batch_norm"] = nn.BatchNorm1d(pre_rnn_input_size)
```

```python
self.layer_norm = torch.nn.LayerNorm(hidden_size)
```

<img src="./img/no_norm.PNG" width="600"/><figure>without normalization</figure>
<img src="./img/norm.PNG" width="600"/><figure>with normalization</figure>


## HPO with SDA (Smart Democratization Advisor)

SDA config

```
Parameters for SDA auto optimization:
- learning_rate: 1.0e-3~1.0e-2 #training learning rate
- warmup_epochs: 1~10 #epoch to warmup learning rate
metrics:
- name: training_time # training time threshold
  objective: minimize
  threshold: 43200
- name: WER # training metric threshold
  objective: minimize
  threshold: 0.25
 ```

request suggestions from SDA

```python
suggestion = self.conn.experiments(self.experiment.id).suggestions().create()
```


## Framework related optimization

leverage IPEX for distributed training and enable socket binding for training in two socket system

```bash
# Use IPEX launch to launch training, enable NUMA binding in two socket system.
${CONDA_PREFIX}/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=2 --nnodes=4 --hostfile hosts train.py ${ARGS}
```

<img src="./img/no_numa_binding.png" width="600"/><figure>without numa binding</figure>
<img src="./img/numa_binding.png" width="600"/><figure>enable numa binding</figure>


# DEMO
* [Environment Setup](#Environment-setup)
* [Launch training](#Launch-training)

## Environment setup

### Option1 Setup Environment with Docker
``` bash
# Setup ENV
git clone https://github.com/intel/e2eAIOK.git
cd e2eAIOK
git submodule update --init --recursive
python3 scripts/start_e2eaiok_docker.py -b pytorch110 -w ${host0} ${host1} ${host2} ${host3} --proxy ""
# Enter Docker
sshpass -p docker ssh ${host0} -p 12345
```

### Option2 Setup Environment with Pip
pre-work: move e2eAIOK source code to /home/vmagent/app/e2eaiok

In [1]:
%%bash
pip install torchaudio==0.12.1 torch==1.12.1 --extra-index-url https://download.pytorch.org/whl/cpu
pip install oneccl_bind_pt==1.12.100 intel-extension-for-pytorch==1.12.100 -f https://developer.intel.com/ipex-whl-stable
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110==1.9.0
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install "git+https://github.com/mlperf/logging.git@1.0.0"
pip install sentencepiece unidecode tensorboard inflect soundfile librosa sox pandas
git clone https://github.com/HawkAaron/warp-transducer && cd warp-transducer \
    && mkdir build && cd build \
    && cmake .. && make && cd ../pytorch_binding \
    && python setup.py install
pip install e2eAIOK-sda --pre
apt install -y numactl

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu




Collecting torchaudio==0.12.1
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-0.12.1%2Bcpu-cp39-cp39-linux_x86_64.whl (3.5 MB)
Collecting torch==1.12.1
  Using cached https://download.pytorch.org/whl/cpu/torch-1.12.1%2Bcpu-cp39-cp39-linux_x86_64.whl (189.2 MB)
Collecting typing-extensions
  Using cached typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Installing collected packages: typing-extensions, torch, torchaudio
Successfully installed torch-1.12.1+cpu torchaudio-0.12.1+cpu typing-extensions-4.5.0




Looking in links: https://developer.intel.com/ipex-whl-stable
Collecting oneccl_bind_pt==1.12.100
  Downloading http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.12.100/oneccl_bind_pt-1.12.100%2Bcpu-cp39-cp39-linux_x86_64.whl (39.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.3/39.3 MB 16.7 MB/s eta 0:00:00
Collecting intel-extension-for-pytorch==1.12.100
  Downloading http://intel-optimized-pytorch.s3.cn-north-1.amazonaws.com.cn/wheels/v1.12.100/intel_extension_for_pytorch-1.12.100%2Bcpu-cp39-cp39-linux_x86_64.whl (36.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.8/36.8 MB 17.1 MB/s eta 0:00:00
Installing collected packages: oneccl_bind_pt, intel-extension-for-pytorch
Successfully installed intel-extension-for-pytorch-1.12.100+cpu oneccl_bind_pt-1.12.100+cpu




Looking in indexes: https://pypi.org/simple, https://developer.download.nvidia.com/compute/redist
Collecting nvidia-dali-cuda110==1.9.0
  Using cached https://developer.download.nvidia.cn/compute/redist/nvidia-dali-cuda110/nvidia_dali_cuda110-1.9.0-3647997-py3-none-manylinux2014_x86_64.whl (680.9 MB)
Installing collected packages: nvidia-dali-cuda110
Successfully installed nvidia-dali-cuda110-1.9.0




Collecting dllogger
  Cloning https://github.com/NVIDIA/dllogger to /tmp/pip-install-usu5pi_7/dllogger_95e707ff269b484eb37c3a9d466b7903


  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/dllogger /tmp/pip-install-usu5pi_7/dllogger_95e707ff269b484eb37c3a9d466b7903


  Resolved https://github.com/NVIDIA/dllogger to commit 0540a43971f4a8a16693a9de9de73c1072020769
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: dllogger
  Building wheel for dllogger (setup.py): started
  Building wheel for dllogger (setup.py): finished with status 'done'
  Created wheel for dllogger: filename=DLLogger-1.0.0-py3-none-any.whl size=5670 sha256=131d96013a5ae501cdae3e8e8c83c3d82ffb987f76e1f0e86530085b7dda0d9c
  Stored in directory: /tmp/pip-ephem-wheel-cache-l63ryb44/wheels/a8/c5/92/8f746e8bdf74b42fb8ac27010b5a78abefe56ad1964594ae95
Successfully built dllogger
Installing collected packages: dllogger
Successfully installed dllogger-1.0.0




Collecting git+https://github.com/mlperf/logging.git@1.0.0
  Cloning https://github.com/mlperf/logging.git (to revision 1.0.0) to /tmp/pip-req-build-a11yoev1


  Running command git clone --filter=blob:none --quiet https://github.com/mlperf/logging.git /tmp/pip-req-build-a11yoev1
  Running command git checkout -q 982b15a62604491f23b7afdfacda57829d174f36


  Resolved https://github.com/mlperf/logging.git to commit 982b15a62604491f23b7afdfacda57829d174f36
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: mlperf-logging
  Building wheel for mlperf-logging (setup.py): started
  Building wheel for mlperf-logging (setup.py): finished with status 'done'
  Created wheel for mlperf-logging: filename=mlperf_logging-1.0.0-py3-none-any.whl size=74955 sha256=90105652c7b9b8b8505502242c53165f5cd920e1d9bbcee517184e887445c647
  Stored in directory: /tmp/pip-ephem-wheel-cache-yizbom7w/wheels/4f/2e/6b/3e15a32e71e45ac35f4085c8140095429f5567e5f1c4364f0e
Successfully built mlperf-logging
Installing collected packages: mlperf-logging
Successfully installed mlperf-logging-1.0.0




Collecting sentencepiece
  Using cached sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting unidecode
  Using cached Unidecode-1.3.6-py3-none-any.whl (235 kB)
Collecting tensorboard
  Using cached tensorboard-2.12.0-py3-none-any.whl (5.6 MB)
Collecting inflect
  Using cached inflect-6.0.2-py3-none-any.whl (34 kB)
Collecting soundfile
  Using cached soundfile-0.12.1-py2.py3-none-any.whl (24 kB)
Collecting librosa
  Downloading librosa-0.10.0.post2-py3-none-any.whl (253 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 253.0/253.0 kB 314.9 kB/s eta 0:00:00
Collecting sox
  Downloading sox-1.4.1-py2.py3-none-any.whl (39 kB)
Collecting pandas
  Downloading pandas-1.5.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.2/12.2 MB 8.5 MB/s eta 0:00:00
Collecting google-auth-oauthlib<0.5,>=0.4.1
  Using cached google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)
Collecting grpcio>=1.

Cloning into 'warp-transducer'...
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m


-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done


[0mCUDA_TOOLKIT_ROOT_DIR not found or specified[0m


-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
-- cuda found FALSE
-- Building shared library with no GPU support
-- Configuring done
-- Generating done
-- Build files have been written to: /home/vmagent/app/e2eaiok/demo/builtin/rnnt/warp-transducer/build
[ 12%] [32mBuilding CXX object CMakeFiles/warprnnt.dir/src/rnnt_entrypoint.cpp.o[0m
[ 25%] [32m[1mLinking CXX shared library libwarprnnt.so[0m
[ 25%] Built target warprnnt
[ 37%] [32mBuilding CXX object CMakeFiles/test_cpu.dir/tests/test_cpu.cpp.o[0m
[ 50%] [32mBuilding CXX object CMakeFiles/test_cpu.dir/tests/random.cpp.o[0m
[ 62%] [32m[1mLinking CXX executable test_cpu[0m
[ 62%] Built target test_cpu
[ 75%] [32mBuilding CXX object CMakeFiles/test_time.dir/tests/test_time.cpp.o[0m
[ 87%] [32mBuilding CXX object CMakeFiles/test_time.dir/tests/random.cpp.o[0m
[100%] [32m[1mLinking CXX executable test_time[0m
[100%] Built target test_time
Torch was n



reading manifest file 'warprnnt_pytorch.egg-info/SOURCES.txt'
writing manifest file 'warprnnt_pytorch.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.9
creating build/lib.linux-x86_64-3.9/warprnnt_pytorch
copying warprnnt_pytorch/__init__.py -> build/lib.linux-x86_64-3.9/warprnnt_pytorch
running build_ext
building 'warprnnt_pytorch.warp_rnnt' extension
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/src
gcc -pthread -B /opt/intel/oneapi/intelpython/latest/envs/pytorch/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -Wformat -Wformat-security -fstack-protector-all -D_FORTIFY_SOURCE=2 -fpic -fPIC -O2 -Wl,-z,noexecstack,-z,relro,-z,now,-rpath,$ORIGIN/../..,-rpath,$ORIGIN/../../.. -I/opt/intel/oneapi/intelpython/latest/envs/pytorch/include -Wformat -Wformat-security -fstack-protector-all -D_FORTIFY_SOURCE=2 -fpic -fPIC -O

src/binding.cpp: In function ‘int cpu_rnnt(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, int)’:
     switch (acts.type().scalarType()) {
                       ^
In file included from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3:0,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/DeviceGuard.h:4,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/ATen.h:11,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/p

                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/extension.h:4,
                 from src/binding.cpp:4:
/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:235:7: note: declared here
   T * data() const {
       ^~~~
                          labels.data<int>(), label_lengths.data<int>(),
                                                                      ^
In file included from /opt/intel/o

                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/stateful.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:8,
                 

In file included from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3:0,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/DeviceGuard.h:4,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/ATen/ATen.h:11,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/types.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader_options.h:4,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/data/dataloader/base.h:3,
                 from /opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/t

g++ -pthread -B /opt/intel/oneapi/intelpython/latest/envs/pytorch/compiler_compat -shared -L/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib -Wl,-z,noexecstack,-z,relro,-z,now,-rpath,$ORIGIN/../..,-rpath,$ORIGIN/../../.. -L/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib -L/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib -Wl,-z,noexecstack,-z,relro,-z,now,-rpath,$ORIGIN/../..,-rpath,$ORIGIN/../../.. build/temp.linux-x86_64-3.9/src/binding.o -L/home/vmagent/app/e2eaiok/demo/builtin/rnnt/warp-transducer/build -L/opt/intel/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/lib -lwarprnnt -lc10 -ltorch -ltorch_cpu -ltorch_python -o build/lib.linux-x86_64-3.9/warprnnt_pytorch/warp_rnnt.cpython-39-x86_64-linux-gnu.so -Wl,-rpath,/home/vmagent/app/e2eaiok/demo/builtin/rnnt/warp-transducer/build




creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/warprnnt_pytorch
copying build/lib.linux-x86_64-3.9/warprnnt_pytorch/__init__.py -> build/bdist.linux-x86_64/egg/warprnnt_pytorch
copying build/lib.linux-x86_64-3.9/warprnnt_pytorch/warp_rnnt.cpython-39-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/warprnnt_pytorch
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_pytorch/__init__.py to __init__.cpython-39.pyc
creating stub loader for warprnnt_pytorch/warp_rnnt.cpython-39-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_pytorch/warp_rnnt.py to warp_rnnt.cpython-39.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_pytorch.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_pytorch.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_pytorch.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_py

zip_safe flag not set; analyzing archive contents...
warprnnt_pytorch.__pycache__.warp_rnnt.cpython-39: module references __file__


creating dist
creating 'dist/warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg
creating /opt/intel/oneapi/intelpython/python3.9/envs/pytorch/lib/python3.9/site-packages/warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg
Extracting warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg to /opt/intel/oneapi/intelpython/python3.9/envs/pytorch/lib/python3.9/site-packages
Adding warprnnt-pytorch 0.1 to easy-install.pth file

Installed /opt/intel/oneapi/intelpython/python3.9/envs/pytorch/lib/python3.9/site-packages/warprnnt_pytorch-0.1-py3.9-linux-x86_64.egg
Processing dependencies for warprnnt-pytorch==0.1
Finished processing dependencies for warprnnt-pytorch==0.1
Collecting e2eAIOK-sda
  Downloading e2eAIOK_sda-1.0.1b2023031702-py3-none-any.whl (91 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.7/91.7 kB 214.9 kB/s eta 0:00:00
Collecting xgb





Reading package lists...
Building dependency tree...
Reading state information...
numactl is already the newest version (2.0.11-2.1ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 26 not upgraded.


## Workflow Prepare

``` bash
# prepare model codes
cd /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch
bash patch_rnnt.sh

# Download Dataset
# Download and unzip dataset from https://www.openslr.org/12 to /home/vmagent/app/dataset/LibriSpeech

# Generate tokenizer and tokenize text
cd /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch
bash scripts/preprocess_librispeech.sh
```

Notes: RNN-T training is based on LibriSpeech train-clean-100 and evaluated on dev-clean, we evaluated WER with stock model (based on MLPerf submission) at train-clean-100 dataset, and final WER is 0.25, all the following optimization guarantee 0.25 WER. MLPerf submission took 38.7min with 8x A100 on LibriSpeech train-960h dataset.

public reference on train-clean-100: https://arxiv.org/pdf/1807.10893.pdf, https://arxiv.org/pdf/1811.00787.pdf

## Launch training

edit conf/e2eaiok_defaults_rnnt_example.conf

```
### GLOBAL SETTINGS ###
observation_budget: 1
save_path: /home/vmagent/app/e2eaiok/result/
ppn: 2
train_batch_size: 8
eval_batch_size: 8
iface: lo
hosts:
- localhost
epochs: 2
```

In [10]:
%%bash
cd /home/vmagent/app/e2eaiok
# sed -i '/ppn:/ s/:.*/: 1/' tests/cicd/conf/e2eaiok_defaults_rnnt_example.conf
python run_e2eaiok.py --data_path /home/vmagent/app/dataset/LibriSpeech --model_name rnnt --conf tests/cicd/conf/e2eaiok_defaults_rnnt_example.conf 

2023-03-20 02:41:47,824 - E2EAIOK - INFO - Above info is history record of this model
2023-03-20 02:41:47,824 - E2EAIOK.SDA - INFO - ### Ready to submit current task  ###
2023-03-20 02:41:47,825 - E2EAIOK.SDA - INFO - Model Advisor created
2023-03-20 02:41:47,825 - E2EAIOK.SDA - INFO - model parameter initialized
2023-03-20 02:41:47,825 - E2EAIOK.SDA - INFO - start to launch training
2023-03-20 02:41:47,825 - sigopt - INFO - training launch command: /opt/intel/oneapi/intelpython/latest/envs/pytorch/bin/python -m intel_extension_for_pytorch.cpu.launch --distributed --nproc_per_node=2 --nnodes=1 --hostfile hosts /home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py --output_dir /home/vmagent/app/e2eaiok/result/357dc3f8a3dfe894b3a3fcdd15fd1129f95f71cf887c8475679b1ff5b50674d8 --dist --dist_backend gloo --batch_size 8 --val_batch_size 8 --lr 0.007 --warmup_epochs 6 --beta1 0.9 --beta2 0.999 --max_duration 16.7 --target 0.25 --min_lr 1e-05 --lr_exp_gamma 0.939 --epochs 2 --epochs_this_job

[0] No module named 'torch_ccl'
[1] No module named 'torch_ccl'
[0] world_size:2,rank:0
[1] world_size:2,rank:1


[0] 2023-03-20 02:41:49,890 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
[1] 2023-03-20 02:41:49,890 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
[0] 2023-03-20 02:41:49,890 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[1] 2023-03-20 02:41:49,890 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.


[0] :::MLLOG {"namespace": "", "time_ms": 1679280109904, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 357}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280110003, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2021, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 362}}
[0] DLL 2023-03-20 02:41:50.006883 - PARAMETER | epochs :  2
[0] DLL 2023-03-20 02:41:50.006950 - PARAMETER | warmup_epochs :  6
[0] DLL 2023-03-20 02:41:50.007040 - PARAMETER | hold_epochs :  40
[0] DLL 2023-03-20 02:41:50.007164 - PARAMETER | epochs_this_job :  0
[0] DLL 2023-03-20 02:41:50.007195 - PARAMETER | cudnn_benchmark :  True
[0] DLL 2023-03-20 02:41:50.007238 - PARAMETER | amp_level :  1
[0] DLL 2023-03-20 02:41:50.007282 - PARAMETER | seed :  2021
[0] DLL 2023-03-20 02:41:50.007326 - PARAMETER | local_rank :  0
[0] DLL 2023-03-20 02:41:50.00737

[0] :::MLLOG {"namespace": "", "time_ms": 1679280110031, "event_type": "POINT_IN_TIME", "key": "model_weights_initialization_scale", "value": 0.5, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 397}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280110115, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/common/rnn.py", "lineno": 89, "tensor": "pre_rnn"}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280110371, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/common/rnn.py", "lineno": 89, "tensor": "post_rnn"}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280110375, "event_type": "POINT_IN_TIME", "key": "weights_initialization", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/rnnt/model.py", "lineno": 159, "t



[0] Dataset read by DALI. Number of samples: 73
[0] Initializing DALI with parameters:
[0] 	           __class__ : <class 'common.data.dali.pipeline.DaliPipeline'>
[0] 	          batch_size : 8[0] 
[0] 	           device_id : None[0] 
[0] 	        dither_coeff : 1e-05
[0] 	       dont_use_mmap : False
[0] 	           file_root : /home/vmagent/app/dataset/LibriSpeech/valid
[0] 	    in_mem_file_list : False[0] 
[0] 	        max_duration : inf[0] 
[0] 	           nfeatures : 80
[0] 	                nfft : 512
[0] 	         num_threads : 4
[0] 	       pipeline_type : val[0] 
[0] 	            pre_sort : False
[0] 	       preemph_coeff : 0.97
[0] 	preprocessing_device : cpu
[0] 	      resample_range : None[0] 
[0] 	         sample_rate : 16000[0] 
[0] 	             sampler : <common.data.dali.sampler.SimpleSampler object at 0x7fd12130ad00>
[0] 	                seed : 2021
[0] 	                self : <common.data.dali.pipeline.DaliPipeline object at 0x7fd1213153d0>[0] 
[0] 	   silence_thresho

[1]   warn("Profiler won't be using warmup, this can skew profiler results")


[0] :::MLLOG {"namespace": "", "time_ms": 1679280111170, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 96, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 651}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280111170, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 73, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 652}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280111171, "event_type": "POINT_IN_TIME", "key": "opt_name", "value": "lamb", "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 654}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280111171, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.007, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 655}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280111171, "event_type": "POINT_IN_TIME", "key": "opt_la

[0]   warn("Profiler won't be using warmup, this can skew profiler results")
[1]   x_lens = (x_lens.int() + stacking - 1) // stacking
[0]   x_lens = (x_lens.int() + stacking - 1) // stacking
[0]   pivot_len = (audio_shape_sorted[self.split_batch_size] + stack_factor-1) // stack_factor * stack_factor
[0]   batch_offset = torch.cumsum(g_len * ((feat_lens+self.enc_stack_time_factor-1)//self.enc_stack_time_factor), dim=0)
[1]   pivot_len = (audio_shape_sorted[self.split_batch_size] + stack_factor-1) // stack_factor * stack_factor
[1]   batch_offset = torch.cumsum(g_len * ((feat_lens+self.enc_stack_time_factor-1)//self.enc_stack_time_factor), dim=0)
[1]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[0]   x_lens = (x_lens.int() + self.factor - 1) // self.factor


[0] DLL 2023-03-20 02:41:55.755945 - epoch    1 | iter    1/6 | loss  958.22 | utts/s     4 | took  4.49 s | lrate 3.78e-04[0] 


[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation


[0] DLL 2023-03-20 02:41:59.148334 - epoch    1 | iter    2/6 | loss  906.90 | utts/s     5 | took  3.39 s | lrate 5.68e-04[0] 


[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation


[0] DLL 2023-03-20 02:42:02.231807 - epoch    1 | iter    3/6 | loss  801.17 | utts/s     5 | took  3.08 s | lrate 7.57e-04[0] 


[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation


[0] DLL 2023-03-20 02:42:04.772209 - epoch    1 | iter    4/6 | loss  535.04 | utts/s     6 | took  2.54 s | lrate 9.46e-04[0] 


[0] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation
[1] [W kineto_shim.cpp:337] Profiler is not initialized: skipping step() invocation


[0] DLL 2023-03-20 02:42:08.976775 - epoch    1 | iter    5/6 | loss 1013.65 | utts/s     4 | took  4.20 s | lrate 1.14e-03[0] 
[0] DLL 2023-03-20 02:42:14.720571 - epoch    1 | iter    6/6 | loss  896.75 | utts/s     3 | took  5.74 s | lrate 1.32e-03[0] 
[0] :::MLLOG {"namespace": "", "time_ms": 1679280134720, "event_type": "INTERVAL_END", "key": "epoch_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 786, "epoch_num": 1}}
[0] DLL 2023-03-20 02:42:14.721592 - epoch    1 | avg train utts/s     4 | took 23.54 s
[0] :::MLLOG {"namespace": "", "time_ms": 1679280134721, "event_type": "POINT_IN_TIME", "key": "throughput", "value": 4.077337338816833, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 793}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280134721, "event_type": "INTERVAL_START", "key": "eval_start", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt

[1]   x_lens = (x_lens.int() + self.factor - 1) // self.factor
[0]   x_lens = (x_lens.int() + self.factor - 1) // self.factor


[0] :::MLLOG {"namespace": "", "time_ms": 1679280146604, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 20.484347826086957, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 260, "epoch_num": 1}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280146604, "event_type": "INTERVAL_END", "key": "eval_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 261, "epoch_num": 1}}
[0] DLL 2023-03-20 02:42:26.605304 - epoch    1 |   dev ema wer 2048.43 | took 11.88 s
[0] :::MLLOG {"namespace": "", "time_ms": 1679280146605, "event_type": "INTERVAL_END", "key": "block_stop", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo/rnnt/pytorch/train.py", "lineno": 811, "first_epoch_num": 1}}
[0] :::MLLOG {"namespace": "", "time_ms": 1679280146605, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "/home/vmagent/app/e2eaiok/modelzoo

[0]                                            aten::narrow         0.35%      79.656ms         1.00%     225.783ms       1.700us        132830  
[0]                                               aten::div         0.32%      71.734ms         0.32%      72.809ms     271.675us           268  
[0]                                            aten::expand         0.28%      62.077ms         0.40%      90.029ms       1.173us         76774  
[0]                                  aten::sigmoid_backward         0.27%      61.088ms         0.27%      61.088ms       6.646us          9192  
[0]                                   aten::constant_pad_nd         0.25%      57.246ms         0.84%     190.251ms      16.336us         11646  
[0]     autograd::engine::evaluate_function: AddmmBackward0         0.24%      53.343ms        16.44%        3.702s       1.203ms          3078  
[0]                                              aten::mul_         0.23%      52.620ms         0.26%      57.917ms     304.

[0] autograd::engine::evaluate_function: torch::autograd...         0.01%       2.367ms         0.20%      44.614ms     719.581us            62  
[0]     autograd::engine::evaluate_function: StackBackward0         0.01%       2.090ms         0.04%       8.795ms     314.107us            28  
[0]                                           <backward op>         0.01%       1.373ms         2.22%     499.017ms     249.508ms             2  
[0] autograd::engine::evaluate_function: LogSoftmaxBackw...         0.01%       1.328ms         0.15%      33.563ms      16.782ms             2  
[0]                                          StackBackward0         0.01%       1.265ms         0.03%       6.572ms     234.714us            28  
[0] autograd::engine::evaluate_function: torch::jit::(an...         0.00%     971.000us         2.22%     500.044ms     250.022ms             2  
[0]                                             aten::zero_         0.00%     967.000us         0.12%      28.103ms     121.

2023-03-20 02:46:21,206 - sigopt - INFO - Training completed based in sigopt suggestion, took 273.3797791004181 secs
2023-03-20 02:46:21,206 - E2EAIOK.SDA - INFO - training script completed



***    Best Trained Model    ***
  Model Type: rnnt
  Model Saved Path: 
  Sigopt Experiment id is None
  === Result Metrics ===
{'dataset_dir': '/home/vmagent/app/dataset/LibriSpeech', 'train_manifests': ['/home/vmagent/app/dataset/LibriSpeech/metadata/train-test.json'], 'val_manifests': ['/home/vmagent/app/dataset/LibriSpeech/metadata/dev-test.json']}

We found the best model! Here is the model explaination

***    Best Trained Model    ***
  Model Type: rnnt
  Model Saved Path: /home/vmagent/app/e2eaiok/result/357dc3f8a3dfe894b3a3fcdd15fd1129f95f71cf887c8475679b1ff5b50674d8
  Sigopt Experiment id is None
  === Result Metrics ===
    WER: 20.484347826086957
    training_time: 273.3797791004181
