<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/rivaasrasr-finetuning-conformer-ctc-nemo/nvidia_logo.png" style="width: 90px; float: right;">

# How to Fine-Tune a Character Based Riva ASR Acoustic Model with NVIDIA NeMo
# Take Mandarin (ZH) ASR as an example
Characters are usually used as modeling untis for East-Asia Languages like Mandarin, Cantonese and Japanese in Speech and Language Processing tasks. In this tutorial, we will take Mandarin ASR as an example to walk you through how to fine-tune a Char-based ASR acoustic model with NVIDIA NeMo.

## 0. NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech processing and natural language understanding (NLU) services such as:

- Automated speech recognition (ASR). 
- Text-to-Speech synthesis (TTS). 
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will fine-tune a Riva ASR acoustic model with NeMo. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/stable/asr-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

**HINTS: We highly recommend you to run this Jupyter Notebook using pre-built NeMo docker image `nvcr.io/nvidia/nemo:23.04` to save your time!!**

## 1. NeMo (Neural Modules)
[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo) is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and NLU models with a simple Python interface. For information about how to set up NeMo, refer to the [NeMo GitHub](https://github.com/NVIDIA/NeMo) instructions. 

In [None]:
"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
!pip install wget
!apt-get update && apt-get install -y sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

# ## Install NeMo
!pip3 install nemo_toolkit['all']==1.12.0

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option, 
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

---
## 2. Fine-Tuning an ASR model with NeMo
### 2.1 Pre-trained models
First of all, you can check all pre-trained Riva Mandarin ASR models in [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_zh_cn_conformer). 

You have to login in NGC with your account and download the desired models. You need to download the `trainable_*` models for finetune. You can download the .nemo file from [website](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_zh_cn_conformer/files?version=trainable_v5.0) or using [NGC CLI](https://ngc.nvidia.com/setup/installers/cli), for example: 

In [None]:
!ngc registry model download-version "nvidia/riva/speechtotext_zh_cn_conformer:trainable_v5.0"

Then you will get the pre-trained model in `speechtotext_zh_cn_conformer_vtrainable_v5.0/Conformer-CTC-L_char_zh-CN_5.0.nemo`. 

### 2.2 Download Mandarin Data

In this tutorial, we will use open-sourced `AISHELL-1` dataset as an example. Now let's download the AISHELL-1 data! 

In [None]:
!mkdir -p datasets/aishell

You will use the `get_aishell_data.py` script located in the nemo/scripts/dataset_processing dir if you cloned NeMo repo

In [None]:
import os

if not os.path.exists("get_aishell_data.py"):
    !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/get_aishell_data.py

Then just run the following command, you will finish downloading and data processing for AISHELL-1. It will take some time for downloading which depends on your network. 

In [None]:
!python get_aishell_data.py --data_root "datasets/aishell"

In [None]:
!ls datasets/aishell/data_aishell/

You will get `train.json`, `dev.json` and `test.json` in the directory. To save time, you can create smaller subsets of the Mandarin dataset for training and validation. 

In [None]:
!head -1000 datasets/aishell/data_aishell/train.json > datasets/aishell/data_aishell/train_1000.json
!head -1000 datasets/aishell/data_aishell/dev.json > datasets/aishell/data_aishell/dev_1000.json

In [None]:
!head -n 3 datasets/aishell/data_aishell/train_1000.json
!head -n 3 datasets/aishell/data_aishell/dev_1000.json

Each line of the manifest has 3 required fields including `audio_filepath` which means the file path of each audio, `duration` indicating the duration of the audio and `text` giving the transcription of the audio file.

Let's listen to a sample audio file.

In [None]:
# change path of the file here
import os
import IPython.display as ipd
path = 'datasets/aishell/data_aishell/wav/train/S0062/BAC009S0062W0157.wav'
ipd.Audio(path)

### 2.3 Use your own data
NeMo provides several data processing scripts for Mandarin including AISHELL-1 and AISHELL-2. You can check all supported datasets [here](https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing). 

If you want to train the model using your own data, you have to prepare training manifest similar as AISHELL-1 shown above. The `audio_filepath`, `duration` and `text` must be filled with your own data, like:  

 * {"audio_filepath": "datasets/aishell/data_aishell/wav/dev/S0728/BAC009S0728W0126.wav", "duration": 3.758, "text": "必然先行抛售二三线城市的房产"}
 
You can use AISHELL-1 dev set for evaluation and also other testsets can be used. The data format should be the same as above. 

### 2.4 Training 
After data preparation, now we can go through the training section! 

#### 2.4.0 Load a checkpoint
First of all, you have to load the pre-trained checkpoint which you have just downloaded from NGC.

You can use same vocabulary for your ASR model training and skip Section 2.4.1 since our model has covered most of the Chinese Characters. But if you custormize the vocaulary, next section will show the detailed steps.

In [None]:
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models import ASRModel
from nemo.utils import model_utils

# Locate the .nemo file which you downloaded from NGC
model_path = 'Conformer-CTC-L_char_zh-CN_5.0.nemo'

model_cfg = ASRModel.restore_from(restore_path=model_path, return_config=True)
classpath = model_cfg.target  # original class path
imported_class = model_utils.import_class_by_path(classpath)  # type: ASRModel

asr_model = imported_class.restore_from(restore_path=model_path)

#### 2.4.1 Change vocabulary

Character based models don't need the tokenizer creation as only single characters are regarded as elements in the vocabulary. 

In [None]:
import json
from collections import Counter

# Get frequency of each character and get their orders
def get_occ(filename):
    char_set = Counter()
    with open(filename, 'r') as f:
        for line in f:
            data = json.loads(line.strip())
            txt = data['text']
            for c in txt:
                char_set[c] += 1
    char_set = dict(sorted(char_set.items(), key=lambda x: x[1], reverse=True))
    return char_set

train_charset = get_occ('datasets/aishell/data_aishell/train_1000.json')
print(len(train_charset))

vocab_train = train_charset.keys()
print(vocab_train)

Then we use the following function to change the vocabulary. 

In [None]:
asr_model.change_vocabulary(new_vocabulary=list(vocab_train))

print(len(asr_model.cfg.labels))

In [None]:
print(asr_model.cfg.labels)

In [None]:
asr_model.cfg.labels = asr_model.cfg.train_ds.labels
asr_model.save_to("asr_customize_vocab.nemo")

In [None]:
asr_model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path="asr_customize_vocab.nemo")

#### 2.4.2 Change configurations
NeMo uses `.yaml` files to configure the training parameters. There's an [example yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_char.yaml) file in NeMo for the Char-based ASR model. You may update them directly by editing the configuration file or from the command-line interface. For example, if the number of epochs needs to be modified, along with a change in the learning rate, you can add `trainer.max_epochs=100` and `optim.lr=0.02` and train the model. 

You can also use `model.cfg` to modify the configurations since each NeMo model has a config embedded in it. 

In [None]:
print(len(asr_model.cfg.labels))
print(asr_model.cfg.use_cer)

For Mandarin ASR model, we use `Character Error Rate (CER)` to evaluate the performaces so we must set `use_cer=True` in NeMo:

In [None]:
asr_model.cfg.use_cer = True

#### 2.4.3 Initialize a trainer

In [None]:
import torch
import pytorch_lightning as ptl

GRAD_ACCUM=1
MAX_EPOCHS=5
GPUS=[0]
LOG_EVERY_N_STEPS=10

trainer = ptl.Trainer(devices=GPUS, 
                      accelerator="gpu",
                      max_epochs=MAX_EPOCHS, 
                      accumulate_grad_batches=GRAD_ACCUM,
                      precision=32,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=LOG_EVERY_N_STEPS,
                      enable_progress_bar=True,
                      check_val_every_n_epoch=1)



In [None]:
asr_model.set_trainer(trainer)

#### 2.4.4 Specify the training and validation manifests

In [None]:
train_ds = {}
train_ds['manifest_filepath'] = ['datasets/aishell/data_aishell/train_1000.json']
train_ds['sample_rate'] = 16000
train_ds['labels'] = asr_model.cfg.labels
train_ds['batch_size'] = 16
train_ds['fused_batch_size'] = 16
train_ds['shuffle'] = True
train_ds['max_duration'] = 20.0
train_ds['pin_memory'] = True
train_ds['is_tarred'] = False
train_ds['num_workers'] = 4

In [None]:
asr_model.setup_training_data(train_data_config=train_ds)  

In [None]:
validation_ds = {}
validation_ds['sample_rate'] = 16000
validation_ds['manifest_filepath'] = ['datasets/aishell/data_aishell/dev_1000.json']
validation_ds['labels'] = asr_model.cfg.labels
validation_ds['batch_size'] = 32
validation_ds['shuffle'] = False
validation_ds['num_workers'] = 4

In [None]:
asr_model.setup_multiple_validation_data(val_data_config=validation_ds) 

#### 2.4.5 Set Optimizer

In [None]:
optimizer_conf = {}

optimizer_conf['name'] = 'adamw'
optimizer_conf['lr'] = 0.01
optimizer_conf['betas'] =  [0.9, 0.98]
optimizer_conf['weight_decay'] = 0

sched = {}
sched['name'] = 'CosineAnnealing'
sched['warmup_steps'] = None
sched['warmup_ratio'] = 0.10
sched['min_lr'] = 1e-6
optimizer_conf['sched'] = sched

In [None]:
asr_model.setup_optimization(optimizer_conf)

#### 2.4.6 Set exp manager

In [None]:
from nemo.utils import exp_manager
from omegaconf import OmegaConf
import os

config = exp_manager.ExpManagerConfig(
    exp_dir="experiments/zh/"",
    name="Conformer-CTC",
    checkpoint_callback_params=exp_manager.CallbackParams(
        monitor="val_wer",
        mode="min",
        always_save_nemo=True,
        save_best_model=True,
    )
)

config = OmegaConf.structured(config)

In [None]:
asr_model.log_predictions = False
# set to True if you would like to track the evaluation loss
asr_model.compute_eval_loss = False

Let us freeze the encoder for easier initial convergence and faster training. On a smaller dataset when retraining the decoder, this is often a good idea.

In [None]:
import torch
import torch.nn as nn

def enable_bn_se(m):
    if type(m) == nn.BatchNorm1d:
        m.train()
        for param in m.parameters():
            param.requires_grad_(True)
asr_model.encoder.freeze()
asr_model.encoder.apply(enable_bn_se)

#### 2.4.7 Start the training

In [None]:
# asr_model.cfg.labels = asr_model.cfg.train_ds.labels
trainer.fit(asr_model)

#### 2.4.8 Save the model

In [None]:
asr_model.save_to("train_asr_customize_vocab.nemo")

### 2.5 Use Training Scripts
You can directly use the [speech_to_text_ctc.py](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc.py) script to start the training. The above parameters are supported in the scripts. For example, you can start a training with:
```bash
    python speech_to_text_ctc.py \
        model.train_ds.manifest_filepath="datasets/aishell/data_aishell/train_1000.json" \
        model.validation_ds.manifest_filepath="datasets/aishell/data_aishell/dev_1000.json" \
        trainer.devices=1 \
        trainer.accelerator='gpu' \
        trainer.max_epochs=5
```

## 3. ASR Evaluation

Now that we have a model trained, we need to check how well it performs.

In [None]:
test_ds = {}
test_ds['sample_rate'] = 16000
test_ds['manifest_filepath'] = ['datasets/aishell/data_aishell/dev_1000.json']
test_ds['batch_size'] = 32
test_ds['num_workers'] = 4
test_ds['labels'] = asr_model.cfg.labels
test_ds['use_cer'] = True

asr_model.setup_test_data(test_data_config=test_ds)
trainer.test(asr_model)


## 4. ASR Model Export 

With NeMo, you can also export your model in a format that can be deployed using NVIDIA Riva: a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

### 4.1 Install the Packages

We will now install the NeMo and `nemo2riva` packages. `nemo2riva` is available on [NVIDIA NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/resources/riva_quickstart/files?version=2.8.1). Make sure you install `NGC CLI` first before running the following commands.

In [None]:
!pip install nvidia-pyindex
!pip install nemo2riva
!pip install protobuf==3.20.0

### 4.2 Convert to Riva

Convert the downloaded model to the `.riva` format. We will set the encryption key with `--key=nemotoriva`. Choose a different encryption key value when generating `.riva` models for production.

In [None]:
!nemo2riva --out "train_asr_customize_vocab.riva" "train_asr_customize_vocab.nemo"

## More Resources
You can find more information about working with NeMo's ASR models in the [ASR section](https://github.com/NVIDIA/NeMo/tree/main/tutorials/asr) of the NeMo tutorials.

## What's Next?

You can use NeMo to build custom models for your own applications, and deploy them with NVIDIA Riva! Refer to the [Conformer-CTC deployment tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-deployment-conformer-ctc.ipynb).