<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# How to improve accuracy on specific speech patterns by fine-tuning the Acoustic Model (Citrinet) in the Riva ASR pipeline 

This tutorial walks you through some of the advanced customization features of the Riva ASR pipeline by fine-tuning the Acoustic Model (Citrinet). These customization features improve accuracy on specific speech patterns, like background noise and different acoustic environments.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will customize the Riva ASR pipeline by fine-tuning the Acoustic Model (Citrinet) with NVIDIA's TAO Toolkit to improve accuracy on audio with background noise.  
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/dev/22.04/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## Fine-tuning Riva Acoustic Model (Citrinet) with NVIDIA TAO

The following flow diagram shows the Riva speech recognition pipeline along with all the possible customizations. 

Raw temporal audio signals first pass through a feature extraction block which segments the data into blocks (for example, of 80 ms each), then converts the blocks from temporal domain to frequency domain (MFCC). This data is then fed into an acoustic model which outputs probabilities over text tokens at each time step. A decoder converts this matrix of probabilities into a sequence of text tokens which is then `detokenized` into an actual sentence (or character sequence). An advanced decoder can also do beam search and score multiple possible hypotheses (i.e. sentences) in conjunction with a language model. The decoder output comes without punctuation and capitalization, which is the job of the Punctuation and Capitalization model. Finally, Inverse Text Normalization (ITN) rules are applied to transform the text in verbal format into a desired written format.

<img src="./imgs/riva-asr-customizations-amfinetuning.PNG" style="float: center;">

For this tutorial, we need to fine-tune the pre-trained Riva acoustic model. 

There are multiple options available for the acoustic model with Riva - Conformer-CTC, Citrinet, Jasper, and Quartznet. In this tutorial we are going to use the Citrinet model and demonstrate how it can be fine-tuned.  
Fine-tuning a Conformer-CTC model is not yet supported. Support for this is planned in a future release.    
For more information about these acoustic models and when to use them, refer to the Riva documentation [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/asr.html).

You can use NVIDIA TAO Toolkit to fine-tune the Citrinet acoustic model in the Riva ASR pipeline.

#### NVIDIA TAO Toolkit Overview

NVIDIA Train Adapt Optimize (TAO) Toolkit is a python-based AI toolkit for transfer learning that takes purpose-built pre-trained AI models and customizes them on your own data. TAO enables developers with limited AI expertise to create highly accurate AI models for production deployments.  
TAO follows zero coding paradigm. There is no need to write any code to train models with TAO. Training can be done by just running a few commands with the TAO command-line interface.  

Riva supports fine-tuning with TAO. The fine-tuned TAO model can easily be deployed for real-time inference on the Riva Speech Skills server.

For more information about the NVIDIA TAO framework, refer to the documentation [here](https://docs.nvidia.com/tao/tao-toolkit/text/overview.html).

### Fine-tune the Citrinet model with NVIDIA TAO:

The process of fine-tuning a Riva Citrinet acoustic model with NVIDIA TAO can be split into three steps:
1. Data preprocessing.
2. Fine-tuning the Citrinet model with TAO.
3. Deploying the fine-tuned Citrinet TAO model on the Riva Speech Skills server.
Let's walk through each of these steps in detail.

### Step 1. Data preprocessing

For fine-tuning we need audio data with background noise. If you already have such data, then you can use it directly.  
In this tutorial, we will take the AN4 dataset and augment it with noise data from the Room Impulse Response and Noise Database from the [openslr database](https://www.openslr.org/28/).
NVIDIA TAO Toolkit does not currently support audio data augmentation. This support will be added in a future release.
In this tutorial, we will be using NVIDIA NeMo for the data preprocessing step.

#### NVIDIA NeMo Overview

NVIDIA NeMo is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures.
For more information about NeMo, refer to the [NeMo product page](https://developer.nvidia.com/nvidia-nemo) and [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/intro.html). The open-source NeMo repository can be found [here](https://github.com/NVIDIA/NeMo).

NVIDIA NeMo and NVIDIA TAO are both training toolkits. TAO abstracts the training details from the user, whereas NeMo exposes them. TAO follows the zero-coding paradigm, therefore, TAO is better suited for users who want to quickly fine-tune models on their custom dataset. NeMo is the preferred option for researches.  
TAO is the preferred training toolkit for Riva because of it's ease-of-use.

In this tutorial, we will be using NeMo only for data preprocessing. We will use the TAO Toolkit for the actual training.

#### Requirements and setup for data preprocessing:

##### Requirements:

We will be using [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) for this data preprocessing step - Easiest way to install and run NeMo is through NVIDIA's PyTorch docker container. If you are not already running this notebook in the NVIDIA PyTorch docker container, please follow instructions [here](./README.md#running-the-nvidia-riva-tutorial-how-to-improve-accuracy-on-specific-speech-patterns-by-fine-tuning-the-acoustic-model-citrinet-in-the-riva-asr-pipeline) to re-run this tutorial from the PyTorch docker container

##### Setup:

In [None]:
# 1. Install unicode and wget. We will need wget to download the datasets.
!pip install unidecode
!pip install wget

# 2. Clone and install NeMo.
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]
"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
# exit()

For alternate options to install NeMo, refer to the NeMo repository [here](https://github.com/NVIDIA/NeMo#installation).  

#### Download and process the AN4 dataset
AN4 is a small dataset recorded and distributed by Carnegie Mellon University (CMU). It consists of recordings of people spelling out addresses, names, etc. Information about this dataset can be found on the official CMU site.

Let's download the AN4 dataset tar file.

In [1]:
# This is the working directory for this tutorial. 
working_dir = 'am_finetuning/'
!mkdir -p $working_dir

# Import the necessary dependencies.
import wget
import glob
import os
import subprocess
import tarfile

# The AN4 directory will be created in `data_dir`. It is currently set to the `working_dir`.
data_dir = os.path.abspath(working_dir)

# Download the AN4 dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `an4_path` which points to the downloaded an4 dataset
if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
    an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'
    an4_path = wget.download(an4_url, data_dir)
    print(f"AN4 dataset downloaded at: {an4_path}")
else:
    print("AN4 dataset tarfile already exists. Proceed to the next step.")
    an4_path = data_dir + '/an4_sphere.tar.gz'

100% [........................................................................] 64327561 / 64327561AN4 dataset downloaded at: /tutorials/am_finetuning/an4_sphere.tar.gz


Now, let's untar the tar file to give us the dataset audio files in `.sph` format. Then, we'll convert the `.sph` files to 16kHz `.wav` files using the SoX library.

In [2]:
if not os.path.exists(data_dir + '/an4/'):
    # Untar
    tar = tarfile.open(an4_path)
    tar.extractall(path=data_dir)
    print("Completed untarring the an4 tarfile")
    # Convert .sph to .wav (using sox)
    print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        #converting to 16kHz wav
        cmd = ["sox", sph_path, "-r", "16000", wav_path]
        subprocess.run(cmd)
    print("Finished converting the .sph files to .wav files")
else:
    print("Can't find the an4 dataset directory. Please download the dataset first")

Completed untarring the an4 tarfile
Converting .sph to .wav...
Finished converting the .sph files to .wav files


Next, let's build the manifest files for the AN4 dataset. The manifest file is a `.json` file that maps the `.wav` clip to its corresponding text.

Each entry in the AN4 dataset's manifest `.json` file follows the template:  
`{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "text": "<text from the .wav file>"}`  
Example: `{"audio_filepath": "/tutorials/am_finetuning/an4/wav/an4_clstk/fash/an251-fash-b.wav", "duration": 1.0, "text": "yes"}`

In [3]:
# Import the necessary libraries.
import json
import librosa

# Method to build a manifest.
def build_manifest(transcripts_path, manifest_path, wav_path):
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(')-1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(')+1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(
                    data_dir, wav_path,
                    file_id[file_id.find('-')+1 : file_id.rfind('-')],
                    file_id + '.wav')

                duration = librosa.core.get_duration(filename=audio_path)

                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript
                }
                json.dump(metadata, fout)
                fout.write('\n')
                
# Building the manifest files.
print("***Building manifest files***")

# Building manifest files for the training data
train_transcripts = data_dir + '/an4/etc/an4_train.transcription'
train_manifest = data_dir + '/an4/train_manifest.json'
if not os.path.isfile(train_manifest):
    build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')
    print("Training manifest created at", train_manifest)

# Building manifest files for the test data
test_transcripts = data_dir + '/an4/etc/an4_test.transcription'
test_manifest = data_dir + '/an4/test_manifest.json'
if not os.path.isfile(test_manifest):
    build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')
    print("Test manifest created at", test_manifest)

print("***Done***")

***Building manifest files***
Training manifest created at /tutorials/am_finetuning/an4/train_manifest.json
Test manifest created at /tutorials/am_finetuning/an4/test_manifest.json
***Done***


#### Download and process the background noise dataset

For background noise, we will use the background noise samples from the Room Impulse Response and Noise database from the openslr database. For each 30 second isotropic noise sample in the dataset we use the first 15 seconds for training and the last 15 seconds for evaluation.

Let's first download this dataset.

In [4]:
# Download the background noise dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `noise_path` which points to the downloaded background noise dataset.

if not os.path.exists(data_dir + '/rirs_noises.zip'):
    slr28_url = 'https://www.openslr.org/resources/28/rirs_noises.zip'
    noise_path = wget.download(slr28_url, data_dir)
    print("Background noise dataset download complete.")
else:
    print("Background noise dataset already exists. Please proceed to the next step.")
    noise_path = data_dir + '/rirs_noises.zip'

100% [....................................................................] 1311166223 / 1311166223Background noise dataset download complete.


Now, we are going to untar the tar file, which gives us the dataset audio files in `.sph` format. Then, we convert the `.sph` files to 16kHz `.wav` files using the SoX library.

In [5]:
# Extract noise data
from zipfile import ZipFile
try:
    with ZipFile(noise_path, "r") as zipObj:
        zipObj.extractall(data_dir)
        print("Extracting noise data complete")
except Exception:
    print("Not extracting. Extracted noise data might already exist.")

Extracting noise data complete


Next, let's build the manifest files for the noise data. The manifest file is a `.json` file that maps the `.wav` clip to its corresponding text.

Each entry in the noise data's manifest `.json` file follows the template:  
`{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "offset": <offset value>, "text": "-"}`  
Example: `{"audio_filepath": "/tutorials/am_finetuning/RIRS_NOISES/real_rirs_isotropic_noises/RVB2014_type1_noise_largeroom1_1.wav", "duration": 30.0, "offset": 0, "text": "-"}`

In [6]:
import json
iso_path = os.path.join(data_dir,"RIRS_NOISES/real_rirs_isotropic_noises")
iso_noise_list = os.path.join(iso_path, "noise_list")

# Create the manifest files from noise files
def process_row(row, offset, duration):
  try:
    entry = {}
    wav_f = row['wav_filename']
    newfile = wav_f
    duration = subprocess.check_output('soxi -D {0}'.format(newfile), shell=True)
    entry['audio_filepath'] = newfile
    entry['duration'] = float(duration)
    entry['offset'] = offset
    entry['text'] = row['transcript']
    return entry
  except Exception as e:
    wav_f = row['wav_filename']
    newfile = wav_f
    print(f"Error processing {newfile} file!!!")
    
train_rows = []
test_rows = []

with open(iso_noise_list,"r") as in_f:
    for line in in_f:
        row = {}
        data = line.rstrip().split()
        row['wav_filename'] = os.path.join(data_dir,data[-1])
        row['transcript'] = "-"
        train_rows.append(process_row(row, 0 , 15))
        test_rows.append(process_row(row, 15 , 15))

# Writing manifest files
def write_manifest(manifest_file, manifest_lines):
    with open(manifest_file, 'w') as fout:
      for m in manifest_lines:
        fout.write(json.dumps(m) + '\n')
      print("Writing manifest file to", manifest_file, "complete")

# Writing training and test manifest files
test_noise_manifest = os.path.join(data_dir, "test_noise.json")
train_noise_manifest = os.path.join(data_dir, "train_noise.json")
write_manifest(test_noise_manifest, test_rows)
write_manifest(train_noise_manifest, train_rows)

Writing manifest file to /tutorials/am_finetuning/test_noise.json complete
Writing manifest file to /tutorials/am_finetuning/train_noise.json complete


#### Create the noise-augmented dataset

Finally, let's create a noise-augmented dataset by adding noise to the the AN4 dataset with the `add_noise.py` NeMo script. This script generates the noise-augmented audio clips as well as the manifest files. 

Each entry in the noise-augmented data's manifest file follows the template:  
`{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "text": "<text from the .wav file>"}`
Example: `{"audio_filepath": "/tutorials/am_finetuning/noise_data/train_manifest/train_noise_0db/an251-fash-b.wav", "duration": 1.0, "text": "yes"}`

##### Training dataset
Let's create a noise-augmented training dataset using the AN4 training dataset. We'll add noise at different SNRS ranging from 0 to 15 dB SNR using a NeMo script.

In [None]:
final_data_dir = data_dir + '/noise_data'

run = f"python /NeMo/scripts/dataset_processing/add_noise.py \
    --input_manifest={train_manifest} \
    --noise_manifest={train_noise_manifest} \
    --snrs 0 5 10 15 \
    --out_dir={final_data_dir}"

!{run}

The above script generates a .json manifest file each for every SNR value, i.e., one manifest file each for 0, 5, 10 and 15db SNR.  
Let's combine all the manifest into a single manifest for training

In [8]:
run = f"cat {data_dir}/noise_data/manifests/train* >{final_data_dir}/noisy_train.json"
!{run}

print("Noise-augmented training dataset created at", final_data_dir + "/noisy_train.json")

Noise-augmented training dataset created at /tutorials/am_finetuning/noise_data/noisy_train.json


##### Test dataset

Let's create a noise-augmented evaluation dataset using the an4 test dataset, by adding noise at 5 dB, using a NeMo script

In [None]:
# Data augmention - Add noise to test set.
run = f"python /NeMo/scripts/dataset_processing/add_noise.py \
    --input_manifest={test_manifest} \
    --noise_manifest={test_noise_manifest} \
    --snrs=5 \
    --out_dir={final_data_dir}"

!{run}

print("Noise-augmented testing dataset created at", final_data_dir+"/test_manifest")

**Noise-augmented training manifest and data are created at `{working_dir}/noise_data/noisy_train.json` and `{working_dir}/noise_data/train_manifest` respectively.**  
**Noise-augmented testing manifest and data are created at `{working_dir}/noise_data/manifests/test_manifest_test_noise_5db.json` and `{working_dir}/noise_data/test_manifest` respectively.**  

With that, step 1 of 3, the data preprocessing step, is complete.  

Now onto the next steps.

### Step 2. Fine-tuning the Citrinet model with TAO.
Proceed to [this tutorial](./asr-python-advanced-finetune-am-citrinet-tao-finetuning.ipynb) to fine-tune the Citrinet model with TAO.

### Step 3. Deploying the fine-tuned Citrinet TAO model on the Riva Speech Skills server.
Proceed to [this tutorial](./asr-python-advanced-finetune-am-citrinet-tao-deployment.ipynb) to deploy the fine-tuned Citrinet TAO model on the Riva Speech Skills server for inference.