<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/rivaasrasr-finetuning-conformer-ctc-nemo/nvidia_logo.png" style="width: 90px; float: right;">

# How to Fine-Tune a Riva ASR Acoustic Model with NVIDIA NeMo
This tutorial walks you through how to fine-tune an NVIDIA Riva ASR acoustic model with NVIDIA NeMo.

**Important**: If you plan to fine-tune an ASR acoustic model using the same tokenizer with which the model was trained, skip this tutorial and refer to the "Sub-word Encoding CTC Model" section (starting with the "Load pre-trained model" subsection) of the [NeMo ASR Language Finetuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb).

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding (NLU) services such as:

- Automated speech recognition (ASR). 
- Text-to-Speech synthesis (TTS). 
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will fine-tune a Riva ASR acoustic model with NeMo. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## NeMo (Neural Modules)
[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo) is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and NLU models with a simple Python interface. For information about how to set up NeMo, refer to the [NeMo GitHub](https://github.com/NVIDIA/NeMo) instructions.

In [None]:
"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option, 
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

In [None]:
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

---
## Fine-Tuning an ASR model with NeMo

---
<a id='isc-prepare-data'></a>
### Preparing the Dataset
#### LibriSpeech ASR train-clean-100 Dataset
For this tutorial, we use the clean, 100-hour version of the LibriSpeech ASR training dataset to train our Conformer-CTC acoustic model, and the clean development split to validate the model. The LibriSpeech ASR dataset is available [here](https://www.openslr.org/12/).

#### Crowdsourced High-Quality Nigerian English Speech Dataset
For this tutorial, we also use the Nigerian English speech dataset to evaluate and fine-tune our Conformer-CTC acoustic model. The Nigerian English speech dataset is available [here](https://www.openslr.org/70/).

### Downloading and Preprocessing the Datasets
#### LibriSpeech ASR Dataset
The `train-clean-100` split of the LibriSpeech ASR dataset, which we'll use as the training set, is publicly available [here](https://www.openslr.org/resources/12/train-clean-100.tar.gz) and can be downloaded directly. The `dev-clean` split of the LibriSpeech ASR dataset, which we'll use as the validation set, is publicly available [here](https://www.openslr.org/resources/12/dev-clean.tar.gz) and can also be downloaded directly. We've provided a script that downloads the splits for you. The preprocessing step entails converting the audio files from their native `.flac` format to `.wav` and generating a manifest file containing metadata for each audio file, both of which TAO Toolkit needs to train the model. 

Install modules that the downloading and preprocessing script requires which aren't part of the Python standard library.

In [None]:
! sudo apt install -y sox
! pip install sox
! pip install tqdm

Set some environmental variables for convenience.

In [None]:
import os
MODEL_DIR = os.path.abspath("asr-models")
DATA_DIR  = os.path.abspath("asr-models/datasets")
os.environ["MODEL_DIR"] = MODEL_DIR
os.environ["DATA_DIR"]  = DATA_DIR

In [None]:
! python ./get_librispeech_data.py --data_root=$DATA_DIR --data_sets='train_clean_100,dev_clean'

Remove the `.tar.gz` archive files to save space.

In [None]:
! rm $DATA_DIR/train_clean_100.tar.gz
! rm $DATA_DIR/dev_clean.tar.gz

Let's listen to a sample audio file from the preprocessed training dataset.

In [None]:
# change path of the file here
import os
import IPython.display as ipd
path = os.path.join(DATA_DIR, 'LibriSpeech/train-clean-100-processed/163-121908-0000.wav')
ipd.Audio(path)

#### Crowdsourced High-Quality Nigerian English Speech Dataset
The evaluation/fine-tuning data is publicly available in several files [here](https://www.openslr.org/resources/70/).

In [None]:
# Download the audio data
!wget 'https://www.openslr.org/resources/70/en_ng_female.zip' -P $DATA_DIR
!wget 'https://www.openslr.org/resources/70/en_ng_male.zip'   -P $DATA_DIR

In [None]:
# Extract the evaluation/finetuning data
# Ensure that the unzip utility is available. If not, install it.
!unzip -nq $DATA_DIR/en_ng_female.zip -d $DATA_DIR/en_ng_female
!mv $DATA_DIR/en_ng_female/line_index.tsv $DATA_DIR/en_ng_female/line_index_female.tsv
!unzip -nq $DATA_DIR/en_ng_male.zip -d $DATA_DIR/en_ng_male
!mv $DATA_DIR/en_ng_male/line_index.tsv $DATA_DIR/en_ng_male/line_index_male.tsv

Remove the `.zip` files to save space.

In [None]:
! rm $DATA_DIR/en_ng_*.zip

Define a function to extract the relevant information from the `.tsv` metadata files included with this dataset.

In [None]:
import os
import subprocess

def process_en_ng_tsvs(data_dir):
    genders = ['female','male']
    entries = []
    # Extract the relevant information from the tsv files
    for gender in genders: 
        dataset  = f'en_ng_{gender}'
        tsv_name = f'line_index_{gender}.tsv'
        tsv_file = os.path.join(data_dir, dataset, tsv_name)
        with open(tsv_file, encoding='utf-8') as fin:
            for line in fin:
                label, text = line[: line.index("\t")], line[line.index("\t") + 1 :]
                speaker_id  = label.split('_')[1]
                host_wav_file = os.path.join(data_dir, dataset, label + '.wav')
                wav_file = os.path.join(data_dir, dataset, label + '.wav')
                transcript_text = text.lower().strip()

                # check duration
                duration = subprocess.check_output("soxi -D {0}".format(host_wav_file), shell=True)

                entry = {}
                entry['audio_filepath'] = wav_file
                entry['duration'] = float(duration)
                entry['text'] = transcript_text
                entry['gender'] = gender
                entry['speaker_id'] = speaker_id
                entries.append(entry)
    return entries

Define a function to generate `*manifest.json` metadata files from the `.tsv` metadata files included with this dataset.

In [None]:
import json
import random

def generate_en_ng_manifest(data_dir, random_seed=0, val_split=0.1, test_split=0.1):
    # Extract the relevant information from the tsv files
    entries = process_en_ng_tsvs(data_dir)
    # Generate the manifest files
    # Set the random seed for reproducibility
    random.seed(random_seed)
    random.shuffle(entries)
    num_val_entries  = int(val_split  * len(entries))
    num_test_entries = int(test_split * len(entries))
    ft_manifest_file   = os.path.join(data_dir, 'en_ng_ft_manifest.json')
    val_manifest_file  = os.path.join(data_dir, 'en_ng_val_manifest.json')
    test_manifest_file = os.path.join(data_dir, 'en_ng_test_manifest.json')
    with open(ft_manifest_file, 'w') as fout:
        for m in entries[:-(num_val_entries+num_test_entries)]:
            fout.write(json.dumps(m) + '\n')
    with open(val_manifest_file, 'w') as fout:
        for m in entries[-(num_val_entries+num_test_entries):-num_test_entries]:
            fout.write(json.dumps(m) + '\n')
    with open(test_manifest_file, 'w') as fout:
        for m in entries[-num_test_entries:]:
            fout.write(json.dumps(m) + '\n')

Generate the manifest files for the Nigerian English Speech dataset.

In [None]:
generate_en_ng_manifest(DATA_DIR)

Let's listen to an audio file from the Nigerian English dataset.

In [None]:
# change path of the file here
import os
import IPython.display as ipd
path = os.path.join(DATA_DIR, 'en_ng_male/ngm_02436_00539200207.wav')
ipd.Audio(path)

### Training 

#### Create Tokenizer

Before we can do the actual training, we need to create a tokenizer as this ASR model uses word-piece encoding. Character based models don't need the tokenizer creation as only single characters are regarded as elements in the vocabulary in their cases. We can use NeMo's `process_asr_text_tokenizer.py` script to create the tokenizer that generates the subword vocabulary for us for use in training. The size of the vocabulary (`vocab_size`) should be the same as the vocabulary size in the ASR model. We will clone the NeMo GitHub repository to use the scripts and examples available there.


In [None]:
# Clone NeMo locally
# Change this path if you don't want to clone NeMo to the directory containing this tutorial
# NEMO_DIR = os.path.join(os.getcwd(), "NeMo")
NEMO_DIR = os.path.join(os.path.dirname(os.path.dirname(os.getcwd())), "NeMo")
! git clone https://github.com/NVIDIA/NeMo $NEMO_DIR

# create the tokenizer
!python $NEMO_DIR/scripts/tokenizers/process_asr_text_tokenizer.py \
         --manifest=$DATA_DIR/en_ng_ft_manifest.json \
         --data_root=$DATA_DIR \
         --vocab_size=128 \
         --tokenizer=spe \
         --spe_type=unigram

#### Training Conformer-CTC

NeMo uses `.yml` files to configure the training parameters. You may update them directly by editing the configuration file or from the command-line interface. For example, if the number of epochs needs to be modified, along with a change in the learning rate, you can add `trainer.max_epochs=100` and `optim.lr=0.02` and train the model. 

The following sample command uses the `speech_to_text_ctc_bpe.py` script in the `examples` folder to train/fine-tune a Conformer-CTC ASR model for 1 epoch. For other ASR models like Citrinet, you may find the appropriate config files in the NeMo GitHub repo under [examples/asr/conf/](https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf).

To fully train the model from scratch, you'll need to increase `trainer.max_epochs` from 1. Empirical evidence suggests that around 200 epochs should suffice. Fine-tuning a pre-trained model will likewise typically require more than 1 epoch. 

By default, `speech_to_text_ctc_bpe.py` trains an ASR acoustic model from scratch. 

To fine-tune a pretrained model, add the parameter `+init_from_pretrained_model=<model_name>`. Refer to [this table](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/results.html#speech-recognition-languages) in the NeMo documentation for a list of pretrained speech recognition model checkpoints. 

To continue fine-tuning a local model retroactively, add the parameter `+init_from_nemo_model=<path/to/model_name.nemo>`. 

To restrict NeMo to a particular GPU, place square brackets around the number passed into `trainer.devices`.

In [None]:
# NOTE TO SELF: Remove trainer.val_check_interval (?)
# ANOTHER NOTE TO SELF: Reset trainer.max_epochs to 1
!python $NEMO_DIR/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
    --config-path=../conf/conformer/ --config-name=conformer_ctc_bpe \
    +init_from_pretrained_model=stt_en_conformer_ctc_large \
    model.train_ds.manifest_filepath=$DATA_DIR/en_ng_ft_manifest.json \
    model.validation_ds.manifest_filepath=$DATA_DIR/en_ng_val_manifest.json \
    model.tokenizer.dir=$DATA_DIR/tokenizer_spe_unigram_v128 \
    model.train_ds.batch_size=4 \
    model.validation_ds.batch_size=4 \
    trainer.devices=1 \
    trainer.max_epochs=10 \
    trainer.val_check_interval=0.1 \
    model.optim.name="adamw" \
    model.optim.lr=1.0 \
    model.optim.weight_decay=0.001 \
    model.optim.sched.warmup_steps=2000 \
    ++exp_manager.exp_dir=$MODEL_DIR/checkpoints \
    ++exp_manager.version=en_ng \
    ++exp_manager.use_datetime_version=False

In [None]:
!ls $MODEL_DIR/checkpoints/Conformer-CTC-BPE/en_ng/checkpoints/

In [None]:
nemo_file_path = os.path.join(MODEL_DIR, 'checkpoints/Conformer-CTC-BPE/en_ng/checkpoints/Conformer-CTC-BPE.nemo')

### ASR Evaluation

Now that we have a model trained, we need to check how well it performs.

In [None]:
! ls $NEMO_DIR/examples/asr/speech_to_text_eval.py

In [None]:
! ls $DATA_DIR/en_ng_test_manifest_predictions.json

In [None]:
!python $NEMO_DIR/examples/asr/speech_to_text_eval.py \
    model_path=$nemo_file_path \
    dataset_manifest=$DATA_DIR/en_ng_test_manifest.json \
    output_filename=$DATA_DIR/en_ng_test_manifest_predictions.json \
    batch_size=4 \
    amp=True

### ASR Model Export

With NeMo, you can also export your model in a format that can be deployed using NVIDIA Riva: a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

#### Install `nemo2riva` with `pip`

In [None]:
!pip install nemo2riva

#### Convert to Riva

Convert the downloaded model to the `.riva` format. We will set the encryption key with `--key=tlt_encode`. Choose a different encryption key value when generating `.riva` models for production.

In [None]:
nemo_path_list = nemo_file_path.split('/')
nemo_file_name = nemo_path_list[-1]
riva_file_name = nemo_file_name[:-5] + ".riva"
riva_file_path = os.path.join(MODEL_DIR, "custom-models", "riva", riva_file_name)

!mkdir -p $MODEL_DIR/custom-models/riva

!nemo2riva --out $riva_file_path --key=tlt_encode --onnx-opset 18 $nemo_file_path 

## More Resources
You can find more information about working with NeMo's ASR models in the [ASR section](https://github.com/NVIDIA/NeMo/tree/main/tutorials/asr) of the NeMo tutorials.

## What's Next?

You can use NeMo to build custom models for your own applications, and deploy them with NVIDIA Riva! Refer to the [Conformer-CTC deployment tutorial](https://github.com/nvidia-riva/tutorials/blob/main/asr-deployment-conformer-ctc.ipynb).