<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-finetune-am-conformer-ctc-for-noisy-audio-withtao/nvidia_logo.png" style="width: 90px; float: right;">

# How to improve accuracy on specific speech patterns by fine-tuning the Acoustic Model (Conformer-CTC) in the Riva ASR pipeline 

This tutorial walks you through some of the advanced customization features of the Riva ASR pipeline by fine-tuning the Acoustic Model (Conformer-CTC). These customization features improve accuracy on specific speech patterns, like background noise and different acoustic environments.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will customize the Riva ASR pipeline by fine-tuning the Acoustic Model (Conformer-CTC) with NVIDIA's TAO Toolkit to improve accuracy on audio with background noise.  
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the Riva [product page](https://developer.nvidia.com/riva) and [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/overview.html).

## Fine-tuning Riva Acoustic Model (Conformer-CTC) with NVIDIA TAO

The following flow diagram shows the Riva speech recognition pipeline along with all the possible customizations. 

Raw temporal audio signals first pass through a feature extraction block which segments the data into blocks (for example, of 80 ms each), then converts the blocks from temporal domain to frequency domain (MFCC). This data is then fed into an acoustic model which outputs probabilities over text tokens at each time step. A decoder converts this matrix of probabilities into a sequence of text tokens which is then `detokenized` into an actual sentence (or character sequence). An advanced decoder can also do beam search and score multiple possible hypotheses (i.e. sentences) in conjunction with a language model. The decoder output comes without punctuation and capitalization, which is the job of the Punctuation and Capitalization model. Finally, Inverse Text Normalization (ITN) rules are applied to transform the text in verbal format into a desired written format.

<img src="./imgs/riva-asr-customizations-amfinetuning.PNG" style="float: center;">

For this tutorial, we need to fine-tune the pre-trained Riva acoustic model. 

There are multiple options available for the acoustic model with Riva - Conformer-CTC, Citrinet, Jasper, and Quartznet. In this tutorial we are going to use the Conformer-CTC model and demonstrate how it can be fine-tuned.  
Fine-tuning a Conformer-CTC model is not yet supported. Support for this is planned in a future release.    
For more information about these acoustic models and when to use them, refer to the Riva documentation [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/asr.html).

You can use NVIDIA TAO Toolkit to fine-tune the Conformer-CTC acoustic model in the Riva ASR pipeline.

#### NVIDIA TAO Toolkit Overview

NVIDIA Train Adapt Optimize (TAO) Toolkit is a python-based AI toolkit for transfer learning that takes purpose-built pre-trained AI models and customizes them on your own data. TAO enables developers with limited AI expertise to create highly accurate AI models for production deployments.  
TAO follows zero coding paradigm. There is no need to write any code to train models with TAO. Training can be done by just running a few commands with the TAO command-line interface.  

Riva supports fine-tuning with TAO. The fine-tuned TAO model can easily be deployed for real-time inference on the Riva Speech Skills server.

For more information about the NVIDIA TAO framework, refer to the documentation [here](https://docs.nvidia.com/tao/tao-toolkit/text/overview.html).

### Fine-tune the Conformer-CTC model with NVIDIA TAO:

The process of fine-tuning a Riva Conformer-CTC acoustic model with NVIDIA TAO can be split into three steps:
1. Data preprocessing.
2. Fine-tuning the Conformer-CTC model with TAO.
3. Deploying the fine-tuned Conformer-CTC TAO model on the Riva Speech Skills server.
Let's walk through each of these steps in detail.

### Step 1. Data preprocessing

For fine-tuning we need audio data with background noise. If you already have such data, then you can use it directly.  
In this tutorial, we will take the AN4 dataset and augment it with noise data from the Room Impulse Response and Noise Database from the [openslr database](https://www.openslr.org/28/).
NVIDIA TAO Toolkit does not currently support audio data augmentation. This support will be added in a future release.
In this tutorial, we will be using NVIDIA NeMo for the data preprocessing step.

#### NVIDIA NeMo Overview

NVIDIA NeMo is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures.
For more information about NeMo, refer to the [NeMo product page](https://developer.nvidia.com/nvidia-nemo) and [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/intro.html). The open-source NeMo repository can be found [here](https://github.com/NVIDIA/NeMo).

NVIDIA NeMo and NVIDIA TAO are both training toolkits. TAO abstracts the training details from the user, whereas NeMo exposes them. TAO follows the zero-coding paradigm, therefore, TAO is better suited for users who want to quickly fine-tune models on their custom dataset. NeMo is the preferred option for researches.  
TAO is the preferred training toolkit for Riva because of it's ease-of-use.

In this tutorial, we will be using NeMo only for data preprocessing. We will use the TAO Toolkit for the actual training.

#### Requirements and setup for data preprocessing:

##### Requirements:

We will be using [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) for this data preprocessing step - Easiest way to install and run NeMo is through NVIDIA's PyTorch docker container. If you are not already running this notebook in the NVIDIA PyTorch docker container, please follow instructions [here](./README.md#running-the-nvidia-riva-tutorial-how-to-improve-accuracy-on-specific-speech-patterns-by-fine-tuning-the-acoustic-model-citrinet-in-the-riva-asr-pipeline) to re-run this tutorial from the PyTorch docker container

We will be using the NVIDIA NeMo Docker container, so access to NGC is a must. As of this writing, the latest (22.08) version of the container does not work for the data preprocessing, but the previous version (22.05) does. 

#### Download and process the AN4 dataset
AN4 is a small dataset recorded and distributed by Carnegie Mellon University (CMU). It consists of recordings of people spelling out addresses, names, etc. Information about this dataset can be found on the official CMU site.

Let's download the AN4 dataset tar file.

In [None]:
# This is the working directory for this part of the tutorial. 
working_dir = 'am_finetuning/'
!mkdir -p $working_dir

# Import the necessary dependencies.
import wget
import glob
import os
import subprocess
import tarfile

# The AN4 directory will be created in `data_dir`. It is currently set to the `working_dir`.
data_dir = os.path.abspath(working_dir)

# Download the AN4 dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `an4_path` which points to the downloaded an4 dataset
if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
    an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'
    an4_path = wget.download(an4_url, data_dir)
    print(f"AN4 dataset downloaded at: {an4_path}")
else:
    print("AN4 dataset tarfile already exists. Proceed to the next step.")
    an4_path = data_dir + '/an4_sphere.tar.gz'

Now, let's untar the tar file to give us the dataset audio files in `.sph` format. Then, we'll convert the `.sph` files to 16kHz `.wav` files using the SoX library.

In [None]:
if not os.path.exists(data_dir + '/an4/'):
    # Untar
    tar = tarfile.open(an4_path)
    tar.extractall(path=data_dir)
    print("Completed untarring the an4 tarfile")
    # Convert .sph to .wav (using sox)
    print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        #converting to 16kHz wav
        cmd = f"sox {sph_path} -r 16000 {wav_path}"
        subprocess.call(cmd, shell=True)
    print("Finished converting the .sph files to .wav files")
else:
    print("an4 dataset directory already exists. Proceed to the next step.")

Next, let's build the manifest files for the AN4 dataset. The manifest file is a `.json` file that maps the `.wav` clip to its corresponding text.

Each entry in the AN4 dataset's manifest `.json` file follows the template:  
`{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "text": "<text from the .wav file>"}`  
Example: `{"audio_filepath": "/tutorials/am_finetuning/an4/wav/an4_clstk/fash/an251-fash-b.wav", "duration": 1.0, "text": "yes"}`

In [None]:
!pip install mutagen

In [None]:
# Import the necessary libraries.
import json
from mutagen.wave import WAVE

# Method to build a manifest.
def build_manifest(transcripts_path, manifest_path, wav_path):
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(')-1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(')+1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(
                    data_dir, wav_path,
                    file_id[file_id.find('-')+1 : file_id.rfind('-')],
                    file_id + '.wav')

                duration = WAVE(filename=audio_path).info.length

                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript
                }
                json.dump(metadata, fout)
                fout.write('\n')
                
# Building the manifest files.
print("***Building manifest files***")

# Building manifest files for the training data
train_transcripts = data_dir + '/an4/etc/an4_train.transcription'
train_manifest = data_dir + '/an4/train_manifest.json'
if not os.path.isfile(train_manifest):
    build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')
    print("Training manifest created at", train_manifest)

# Building manifest files for the test data
test_transcripts = data_dir + '/an4/etc/an4_test.transcription'
test_manifest = data_dir + '/an4/test_manifest.json'
if not os.path.isfile(test_manifest):
    build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')
    print("Test manifest created at", test_manifest)

print("***Done***")

#### Download and process the background noise dataset

For background noise, we will use the background noise samples from the Room Impulse Response and Noise database from the openslr database. For each 30 second isotropic noise sample in the dataset we use the first 15 seconds for training and the last 15 seconds for evaluation.

Let's first download this dataset.

In [None]:
# Download the background noise dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `noise_path` which points to the downloaded background noise dataset.

if not os.path.exists(data_dir + '/rirs_noises.zip'):
    slr28_url = 'https://www.openslr.org/resources/28/rirs_noises.zip'
    noise_path = wget.download(slr28_url, data_dir)
    print("Background noise dataset download complete.")
else:
    print("Background noise dataset already exists. Please proceed to the next step.")
    noise_path = data_dir + '/rirs_noises.zip'

Now, we are going to unzip the `.zip` file, which gives us the dataset audio files as 8-channel `.wav` files, sampled at 16kHz. The format and sample rate suit our purposes, but we need to convert these files to mono-channel to match the files in the AN4 dataset. Fortunately, the SoX library provides tools for that as well. 

Note: The conversion will take several minutes.

In [None]:
# Extract noise data
from zipfile import ZipFile
if not os.path.exists(data_dir + '/RIRS_NOISES'):
    try:
        with ZipFile(noise_path, "r") as zipObj:
            zipObj.extractall(data_dir)
            print("Extracting noise data complete")
        # Convert 8-channel audio files to mono-channel
        wav_list = glob.glob(data_dir + '/RIRS_NOISES/**/*.wav', recursive=True)
        for wav_path in wav_list:
            mono_wav_path = wav_path[:-4] + '_mono.wav'
            cmd = f"sox {wav_path} {mono_wav_path} remix 1"
            subprocess.call(cmd, shell=True)
        print("Finished converting the 8-channel noise data .wav files to mono-channel")
    except Exception:
        print("Not extracting. Extracted noise data might already exist.")
else: 
    print("Extracted noise data already exists. Please proceed to the next step.")

Next, let's build the manifest files for the noise data. The manifest file is a `.json` file that maps the `.wav` clip to its corresponding text.

Each entry in the noise data's manifest `.json` file follows the template:  
`{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "offset": <offset value>, "text": "-"}`  
Example: `{"audio_filepath": "/tutorials/am_finetuning/RIRS_NOISES/real_rirs_isotropic_noises/RVB2014_type1_noise_largeroom1_1_mono.wav", "duration": 30.0, "offset": 0, "text": "-"}`

In [None]:
import json
iso_path = os.path.join(data_dir,"RIRS_NOISES/real_rirs_isotropic_noises")
iso_noise_list = os.path.join(iso_path, "noise_list")

# Edit the noise_list file so that it lists the *_mono.wav files instead of the original *.wav files
with open(iso_noise_list) as f:
    if '_mono.wav' in f.read():
        print(f"{iso_noise_list} has already been processed")
    else:
        cmd = f"sed -i 's|.wav|_mono.wav|g' {iso_noise_list}"
        subprocess.call(cmd, shell=True)
        print(f"Finished processing {iso_noise_list}")

In [None]:
# Create the manifest files from noise files
def process_row(row, offset, duration):
  try:
    entry = {}
    wav_f = row['wav_filename']
    newfile = wav_f
    duration = subprocess.check_output('soxi -D {0}'.format(newfile), shell=True)
    entry['audio_filepath'] = newfile
    entry['duration'] = float(duration)
    entry['offset'] = offset
    entry['text'] = row['transcript']
    return entry
  except Exception as e:
    wav_f = row['wav_filename']
    newfile = wav_f
    print(f"Error processing {newfile} file!!!")
    
train_rows = []
test_rows = []

with open(iso_noise_list,"r") as in_f:
    for line in in_f:
        row = {}
        data = line.rstrip().split()
        row['wav_filename'] = os.path.join(data_dir,data[-1])
        row['transcript'] = "-"
        train_rows.append(process_row(row, 0 , 15))
        test_rows.append(process_row(row, 15 , 15))

# Writing manifest files
def write_manifest(manifest_file, manifest_lines):
    with open(manifest_file, 'w') as fout:
      for m in manifest_lines:
        fout.write(json.dumps(m) + '\n')
      print("Writing manifest file to", manifest_file, "complete")

# Writing training and test manifest files
test_noise_manifest = os.path.join(data_dir, "test_noise.json")
train_noise_manifest = os.path.join(data_dir, "train_noise.json")
if not os.path.exists(test_noise_manifest):
    write_manifest(test_noise_manifest, test_rows)
else:
    print('Test noise manifest file already exists. Please proceed to the next step.')
if not os.path.exists(train_noise_manifest):
    write_manifest(train_noise_manifest, train_rows)
else:
    print('Train noise manifest file already exists. Please proceed to the next step.')

#### Create the noise-augmented dataset

Finally, let's create a noise-augmented dataset by adding noise to the the AN4 dataset with the `add_noise.py` NeMo script. This script generates the noise-augmented audio clips as well as the manifest files. 

Each entry in the noise-augmented data's manifest file follows the template:  
`{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "text": "<text from the .wav file>"}`
Example: `{"audio_filepath": "/tutorials/am_finetuning/noise_data/train_manifest/train_noise_0db/an251-fash-b.wav", "duration": 1.0, "text": "yes"}`

##### Setup:

Configure and run the NeMo Docker container. We'll keep it active for the remainder of the data preprocessing section, so as to avoid having to update it with the latest version of the NeMo repo every time we need the container. 

In [None]:
# nemo_container = '<add container name>'
nemo_container = 'nvcr.io/nvidia/nemo:22.08'
# Don't use --rm flag 
# Using the --user=$(id -u):$(id -g) flag appears to cause the following error: 
# ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/.local'
# Check the permissions.
! docker run --gpus=all -it -d -v $data_dir:$data_dir --net=host --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864 --name=nemo_container $nemo_container bash

Ensure that the container is running the latest version of NeMo.

In [None]:
BRANCH = 'main'
run = f"python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]"
# --user=$(id -u):$(id -g) 
! docker exec -it nemo_container $run

##### Training dataset
Let's create a noise-augmented training dataset using the AN4 training dataset. We'll add noise at different SNRs (Signal-to-Noise Ratios) ranging from 0 to 15 dB SNR using a NeMo script. Note that a 0 dB SNR means that the noise and signal in the given audio file are of equal volume. 

In [None]:
final_data_dir = data_dir + '/noise_data'

run = f"python /workspace/nemo/scripts/dataset_processing/add_noise.py \
    --input_manifest={train_manifest} \
    --noise_manifest={train_noise_manifest} \
    --snrs 0 5 10 15 \
    --out_dir={final_data_dir}"

! docker exec -it nemo_container $run

The above script generates a `.json` manifest file each for every SNR value, i.e., one manifest file each for 0, 5, 10 and 15db SNR.  
Let's combine all the manifests into a single manifest for training

In [None]:
run = f"cat {final_data_dir}/manifests/train* >{final_data_dir}/manifests/noisy_train.json"
!{run}

print("Noise-augmented training dataset created at", final_data_dir + "/manifests/noisy_train.json")

##### Test dataset

Let's create a noise-augmented evaluation dataset using the an4 test dataset, by adding noise at 5 dB, using a NeMo script

In [None]:
# Data augmention - Add noise to test set.
run = f"python /workspace/nemo/scripts/dataset_processing/add_noise.py \
    --input_manifest={test_manifest} \
    --noise_manifest={test_noise_manifest} \
    --snrs=5 \
    --out_dir={final_data_dir}"

# For some reason, adding --user=$(id -u):$(id -g) breaks the script
! docker exec -it nemo_container $run

print("Noise-augmented testing dataset created at", final_data_dir+"/test_manifest")

**Noise-augmented training manifest and data are created at `{working_dir}/noise_data/noisy_train.json` and `{working_dir}/noise_data/train_manifest` respectively.**  
**Noise-augmented testing manifest and data are created at `{working_dir}/noise_data/manifests/test_manifest_test_noise_5db.json` and `{working_dir}/noise_data/test_manifest` respectively.**  

With that, step 1 of 3, the data preprocessing step, is complete.  

Now onto the next steps.

##### We don't need the NeMo container anymore, so let's get rid of it

In [None]:
! docker container stop nemo_container
! docker container rm nemo_container

### Step 2. Fine-tuning the Conformer-CTC model with TAO.
Proceed to [this tutorial](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-advanced-finetune-am-conformer-ctc-tao-finetuning.ipynb) to fine-tune the Conformer-CTC model with TAO.

### Installing and Setting up TAO

Install TAO inside a Python virtual environment. We recommend performing this step first and then launching the tutorial from the virtual environment.

In addition to installing the TAO Python package, ensure you meet the following software requirements:

1. `python` >= 3.6.9
2. `docker-ce` > 19.03.5
3. `docker-API` 1.40
4. `nvidia-container-toolkit` > 1.3.0
5. `nvidia-container-runtime` > 3.4.0
6. `nvidia-docker2` > 2.5.0
7. `nvidia-driver` >= 455.23

Installing TAO is a simple `pip` install.

In [None]:
! pip install nvidia-pyindex
! pip install nvidia-tao

In [None]:
# please define these paths on your local host machine
HOST_DATA_DIR = '/path/to/your/host/data'
HOST_SPECS_DIR = '/path/to/your/host/specs'
HOST_RESULTS_DIR = '/path/to/your/host/results'

%env HOST_DATA_DIR=$HOST_DATA_DIR
%env HOST_SPECS_DIR=$HOST_SPECS_DIR
%env HOST_RESULTS_DIR=$HOST_RESULTS_DIR

In [None]:
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping the Local Directories to the TAO Docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
       "shm_size": "16G", 
       "ulimits": {
           "memlock": -1,
           "stack": 67108864
       }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

You can check the Docker image versions and the tasks that it performs. You can also check by issuing `tao --help` or:

In [None]:
! tao info --verbose

### Set Relevant Paths

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here:
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key and use the same key for all commands.
KEY = 'tlt_encode'

The command structure for the TAO interface can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.

#### Make copies of the noisy data manifest files for use with TAO

The paths in the manifest files are currently set w.r.t. the host system running this notebook. TAO requires manifest files with paths set w.r.t. the interior of the TAO container. Thus, we'll make copies and edit them accordingly. 

In [None]:
%%bash 
cd $HOST_DATA_DIR/noise_data/manifests
cp noisy_train.json tao_noisy_train.json
cp test_manifest_test_noise_5db.json tao_noisy_test_5db.json
sed -i "s|$HOST_DATA_DIR|/data|g" tao_noisy_train.json
sed -i "s|$HOST_DATA_DIR|/data|g" tao_noisy_test_5db.json


### Downloading Specs
TAO's conversational AI toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. You may choose to modify/rewrite these specs or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command.<br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` points to an empty folder.**

In [None]:
# delete the specs directory if it is already there to avoid errors
! tao speech_to_text_conformer download_specs \
    -r $RESULTS_DIR/conformer \
    -o $SPECS_DIR/conformer

### ASR Fine-Tuning

After the model is trained, evaluated, and there is a need for fine-tuning, the following command can be used to fine-tune the ASR model. This step can also be used for transfer learning by making changes in the `train.json` and `dev.json` files to add new data.

The list for customizations is the same as the training parameters with the exception for parameters which affect the model architecture. Also, instead of `training_ds` we have `finetuning_ds`.

Note: If you want to proceed with a trained dataset for better inference results, you can find a `.nemo` model [here](
https://ngc.nvidia.com/catalog/collections/nvidia:nemotrainingframework).

Simply rename the `.nemo` file to `.tlt` and pass it through the fine-tune pipeline.

Note: The fine-tune spec files contain specifics to fine-tune the English model we just trained to Russian. If you want to proceed with English, ensure the changes are in the spec file `finetune.yaml` which you can find in the `SPEC_DIR` folder you mapped. Ensure to delete older fine-tuning checkpoints if you choose to change the language after fine-tuning it as-is.

#### Create Tokenizer

Before we can conduct the fine-tuning, we need to pre-process the text. This step is called subword tokenization that creates a subword vocabulary for the text. This is different from Jasper/QuartzNet because only single characters are regarded as elements in the vocabulary in their cases, while in Conformer-CTC, the subword can be one or multiple characters. We can use the `create_tokenizer` command to create the tokenizer that generates the subword vocabulary for us for use in training.

In [None]:
! tao speech_to_text_conformer create_tokenizer \
    -e $SPECS_DIR/conformer/create_tokenizer.yaml \
    -r $RESULTS_DIR/conformer_noisy_audio/tokenizer \
    manifests=$DATA_DIR/noise_data/manifests/noisy_train.json \
    output_root=$RESULTS_DIR/conformer_noisy_audio/tokenizer \
    vocab_size=1024

#### Download a Pre-Trained Conformer-CTC Acoustic Model

In [None]:
%%bash
cd $HOST_RESULTS_DIR
mkdir -p conformer_noisy_audio/pretrained
ngc registry model download-version "nvidia/tao/speechtotext_en_us_conformer:trainable_v4.0"
mv speechtotext_en_us_conformer_vtrainable_v4.0/* conformer_noisy_audio/pretrained/.
rmdir speechtotext_en_us_conformer_vtrainable_v4.0

#### Write a Fine-Tuning Spec File for the Noisy Audio Dataset

Experiment: `change_vocabulary: false`, no tokenizer in `finetune` command

In [None]:
%%bash
tee $HOST_SPECS_DIR/conformer/finetune_noisy_audio.yaml <<'EOF'
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR model.

trainer:
  max_epochs: 3   # This is low for demo purposes

tlt_checkpoint_interval: 1

# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false

tokenizer:
  dir: ???
  type: "bpe"  # Can be either bpe or wpe

# Fine-tuning settings: training dataset
finetuning_ds:
  manifest_filepath: ???
  batch_size: 4
  trim_silence: true
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null

# Fine-tuning settings: validation dataset
validation_ds:
  manifest_filepath: ???
  batch_size: 4
  shuffle: false

# Fine-tuning settings: optimizer
optim:
  name: novograd
  lr: 0.001
EOF

#### Fine-tune Conformer-CTC
Fine-tuning and validating for even a single epoch on the noise-augmented AN4 dataset will take a considerable amount of time. For good model performance, dozens of training epochs may be required. 

Experiment: `change_vocabulary: false` in `.yaml` file, no tokenizer in `finetune` command
I.e., get rid of `tokenizer.dir=$RESULTS_DIR/conformer_noisy_audio/tokenizer/tokenizer_spe_unigram_v1024`

I successfully generated a fine-tuned `.tlt` model with the experiment. Now to check if ONNX inference works. 

SPOILER ALERT: It didn't. There turned out to be a bug in `tao speech_to_text_conformer infer_onnx`. The entire subtask was removed from the most recent TAO release (as of November 2022). 

In [None]:
!tao speech_to_text_conformer finetune \
     -e $SPECS_DIR/conformer/finetune_noisy_audio.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/conformer_noisy_audio/pretrained/speechtotext_en_us_conformer.tlt \
     -r $RESULTS_DIR/conformer_noisy_audio/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/noise_data/manifests/tao_noisy_train.json \
     validation_ds.manifest_filepath=$DATA_DIR/noise_data/manifests/tao_noisy_test_5db.json \
     trainer.max_epochs=25 \
     finetuning_ds.num_workers=20 \
     validation_ds.num_workers=20 \
     trainer.gpus=1 

### ASR Evaluation

Now that we have fine-tuned our model, we need to check how well it performs.

In [None]:
# Replaced 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/checkpoints/finetuned-model.tlt
# with 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_24.tlt
# and later with 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt
!tao speech_to_text_conformer evaluate \
     -e $SPECS_DIR/conformer/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt \
     -r $RESULTS_DIR/conformer_noisy_audio/evaluate \
     test_ds.manifest_filepath=$DATA_DIR/noise_data/manifests/tao_noisy_test_5db.json

In developing this tutorial, we observed that fine-tuning Conformer-CTC for 10 epochs yielded a WER of approximately 0.52% on our noisy test dataset.

#### Export to Riva

In [None]:
# Replaced 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/checkpoints/finetuned-model.tlt
# with 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt
!tao speech_to_text_conformer export \
     -e $SPECS_DIR/conformer/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt \
     -r $RESULTS_DIR/conformer_noisy_audio/riva \
     export_format=RIVA \
     export_to=conformer-ctc-noisy-audio.riva

#### Export to ONNX
Note: Export to ONNX is not needed for Riva.

In [None]:
# Replaced 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/checkpoints/finetuned-model.tlt
# with 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt
!tao speech_to_text_conformer export \
     -e $SPECS_DIR/conformer/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt \
     -r $RESULTS_DIR/conformer_noisy_audio/onnx \
     export_format=ONNX \
     export_to=conformer-ctc-noisy-audio.eonnx

### ASR Inference using TLT Checkpoint

#### ASR Inference with TAO Toolkit

In this section, we are going to run inference on the TLT checkpoint with TAO Toolkit. 
 For real-time inference and best latency, we need to deploy this model on Riva. Refer to the [How to Deploy a Custom Acoustic Model (Conformer-CTC) Trained with TAO Toolkit on Riva](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-advanced-finetune-am-conformer-ctc-tao-deployment.ipynb) tutorial. 
 You might have to work with the `infer.yaml` file to select the files you want for inference.

In [None]:
# Replaced 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/checkpoints/finetuned-model.tlt
# with 
# -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt
!tao speech_to_text_conformer infer \
     -e $SPECS_DIR/conformer/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/conformer_noisy_audio/finetune/finetuned-model_epoch_9.tlt \
     -r $RESULTS_DIR/conformer_noisy_audio/infer \
     file_paths=[$DATA_DIR/noise_data/test_manifest/test_noise_5db/an420-fjlp-b.wav]

### Step 3. Deploying the fine-tuned Conformer-CTC TAO model on the Riva Speech Skills server.
Proceed to [this tutorial](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-advanced-finetune-am-conformer-ctc-tao-deployment.ipynb) to deploy the fine-tuned Conformer-CTC TAO model on the Riva Speech Skills server for inference.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR).
- Text-to-Speech synthesis (TTS).
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will deploy a custom acoustic model (Conformer-CTC) trained with TAO Toolkit on Riva. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Train, Adapt, and Optimize TAO Toolkit
[Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/tao-toolkit) provides the capability to export your model in a format that can be deployed using [NVIDIA Riva](https://developer.nvidia.com/riva), a highly performant application framework for multi-modal conversational AI services using GPUs. 

This tutorial explores taking a `.riva` model, the result of the `tao speech_to_text_conformer train` command (refer to the [fine-tuning tutorial](https://github.com/nvidia-riva/tutorials/blob/stable/sven-asr-python-advanced-finetune-am-conformer-ctc-tao-finetuning.ipynb)) and leveraging the Riva ServiceMaker framework to aggregate all the necessary artifacts for Riva deployment to a target environment. After the model is deployed in Riva, you can issue inference requests to the server. We will demonstrate how quick and straightforward this whole process is.
In this tutorial, you will learn how to:  
- Use Riva ServiceMaker to take a TAO exported `.riva` file and convert it to `.rmir`.
- Deploy the model locally on the Riva server.
- Send inference requests from a demo client using Riva API bindings.

---
## Prerequisites

Before we get started, ensure you have:
- Access to NVIDIA NGC and are able to download the Riva Quick Start [resources](https://ngc.nvidia.com/catalog/resources/nvidia:riva:riva_quickstart).
- A `.riva` model file that you want to deploy. You can obtain this from `tao <task> export` (with `export_format=RIVA`) or download a pre-trained version from the [US English Conformer NGC model page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_conformer). For more information on training and exporting a `.riva` Conformer-CTC acoustic model, refer to the [Speech Recognition with Conformer](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition_with_conformer.html) pages in the [TAO Toolkit Documentation](https://docs.nvidia.com/tao/tao-toolkit/index.html).

---
## Riva ServiceMaker
Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components:

### Riva-Build

This step helps build a Riva-ready version of the model. Its only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. Let's consider a Conformer-CTC ASR model. <br>

`riva-build` is responsible for the combination of one or more exported models (`.riva` files) into a single file containing an intermediate format called Riva Model Intermediate Representation (`.rmir`). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html?highlight=pipeline%20configuration).

In [None]:
!mkdir -p models
!cp $HOST_RESULTS_DIR/conformer_noisy_audio/riva/conformer-ctc-noisy-audio.riva models/.

In [None]:
# IMPORTANT: UPDATE THESE PATHS 

# Delete this import statement from the repo version of the notebook
import os

# ServiceMaker Docker
RIVA_SM_CONTAINER = "<add container name>"

# Directory where the .riva model is stored $MODEL_LOC/*.riva
MODEL_LOC = "<add path to model location>"

# Name of the .riva file
MODEL_NAME = "<add model name>"

# Key that model is encrypted with, while exporting with TAO
KEY = "<add encryption key used for trained model>"

In [None]:
# Get the ServiceMaker Docker container
! docker pull $RIVA_SM_CONTAINER

If it doesn't already exist, create a sub-directory inside `MODEL_LOC` to store your `.rmir` files.

In [None]:
! mkdir -p $MODEL_LOC/rmir

#### Build the `.rmir` file.

**Notes** 
1. If you obtained your `.riva`-formatted acoustic model file from `tao <task> export`, you may need to replace `--nn.fp16_needs_obey_precision_pass` with `--nn.use_trt_fp32` when invoking `riva-build`. 
2. Refer to the [Riva ASR Pipeline Configuration](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-pipeline-configuration.html) documentation page if you wish to build an ASR pipeline for a supported language other than US English. To obtain the proper `riva-build` parameters for your particular application, select the acoustic model, language, and pipeline type (offline for the purposes of this tutorial) from the interactive web menu at the bottom of the first section of the page.

In [None]:
# Syntax: riva-build <task-name> output-dir-for-rmir/model.rmir:key dir-for-riva/model.riva:key
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
    riva-build speech_recognition \
        /data/rmir/asr_offline_conformer_ctc_noisy_audio.rmir:$KEY \
        /data/$MODEL_NAME:$KEY \
        --offline \
        --name=asr_offline_conformer_ctc_noisy_audio_pipeline \
        --decoder_type=greedy \
        --ms_per_timestep=40 \
        --chunk_size=4.8 \
        --left_padding_size=1.6 \
        --right_padding_size=1.6 \
        --max_batch_size=16 \
        --nn.use_trt_fp32 \
        --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 \
        --featurizer.precalc_norm_params=False \
        --featurizer.max_batch_size=512 \
        --featurizer.max_execution_batch_size=512 \
        --language_code=en-US

### Riva-Deploy

The deployment tool takes as input one or more RMIR files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

In [None]:
# Syntax: riva-deploy -f dir-for-rmir/model.rmir:key output-dir-for-repository
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER -- \
    riva-deploy -f  \
        /data/rmir/asr_offline_conformer_ctc_noisy_audio.rmir:$KEY \
        /data/models/

---
## Start the Riva Server
After the model repository is generated, we are ready to start the Riva server. First, download the Riva Quick Start resource from NGC. 
Set the path to the directory here:

In [None]:
# Set the Riva Quick Start directory
RIVA_DIR = "<Path to the uncompressed folder downloaded from quickstart(include the folder name)>"

Next, we modify the `config.sh` file to enable the relevant Riva services (ASR for the Conformer-CTC model), provide the encryption key, and path to the model repository (`riva_model_loc`) generated in the previous step among other configurations. 

For example, if above the model repository is generated at `$MODEL_LOC/models`, then you can specify `riva_model_loc` as the same directory as `MODEL_LOC`. <br>

Pretrained versions of models specified in `models_asr/nlp/tts` are fetched from NGC. Since we are using our custom model, we can comment it in `models_asr` (and any others that are not relevant to your use case). <br>

#### config.sh snippet
```
# Enable or Disable Riva Services 
service_enabled_asr=true                                                      ## MAKE CHANGES HERE
service_enabled_nlp=false                                                      ## MAKE CHANGES HERE
service_enabled_tts=false                                                     ## MAKE CHANGES HERE

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.
gpus_to_use="device=0"

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"                                                  ## MAKE CHANGES HERE

# Locations to use for storing models artifacts
#
# If an absolute path is specified, the data will be written to that location
# Otherwise, a docker volume will be used (default).
#
# riva_init.sh will create a `rmir` and `models` directory in the volume or
# path specified. 
#
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by `riva_init.sh`
# 
# Custom models produced by NeMo or TAO and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
#
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="<add path>"                              ## MAKE CHANGES HERE (Replace with MODEL_LOC)                      
```

In [None]:
# Ensure you have permission to execute these scripts
! cd $RIVA_DIR && chmod +x ./riva_init.sh && chmod +x ./riva_start.sh

In [None]:
# Run Riva Init. This will fetch the containers/models
# YOU CAN SKIP THIS STEP IF YOU DID RIVA DEPLOY
# ! cd $RIVA_DIR && ./riva_init.sh config.sh

In [None]:
# Run Riva Start. This will deploy your model(s).
! cd $RIVA_DIR && ./riva_start.sh config.sh

---
## Run Inference
After the Riva server is up and running with your models, you can send inference requests querying the server. 

To send gRPC requests, install the Riva Python API bindings for the client. This is available as a `pip` `.whl` file with the Quick Start.


In [None]:
# Install the Client API Bindings
! pip install nvidia-riva-client

### Connect to the Riva Server and Run Inference
Now we can actually query the Riva server. The following cell queries the Riva server (using gRPC) to yield a result.

In [None]:
import argparse
import grpc
import time
import riva.client
import wave

# audio_file = "<add path to .wav file>"
audio_file = os.path.join(HOST_DATA_DIR, "noise_data/test_manifest/test_noise_5db/an420-fjlp-b.wav")
server = "localhost:50051"

with open(audio_file, 'rb') as fh:
    data = fh.read()

auth = riva.client.Auth(uri=server)
client = riva.client.ASRService(auth)
config = riva.client.RecognitionConfig(
    encoding=riva.client.AudioEncoding.LINEAR_PCM,
    language_code="en-US",
    max_alternatives=1,
    enable_automatic_punctuation=False,
)
riva.client.add_audio_file_specs_to_config(config, audio_file)

response = client.offline_recognize(data, config)
print(response)

### Stop the Riva Server

In [None]:
! cd $RIVA_DIR && bash riva_stop.sh