<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-finetune-am-citrinet-tao-finetuning/nvidia_logo.png" style="width: 90px; float: right;">

# How to fine-tune a Riva ASR Acoustic Model (Citrinet) with TAO Toolkit
This tutorial walks you through how to fine-tune a Riva ASR acoustic model (Citrinet) with TAO Toolkit.

## Overview

In this tutorial, we are going to discuss the Citrinet model, which is an end-to-end ASR model that takes in audio and produces text.

Citrinet is a descendent of QuartzNet that features the squeeze-and-excitation (SE) block and sub-word tokenization and has a better accuracy/performance than QuartzNet.

![CitriNet with CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/_images/citrinet_vertical.png)

---
## ASR using TAO

The TAO launcher uses Docker containers under the hood, and **for our data and results directory to be visible to Docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The following code creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case so these directories are correctly visible to the Docker container.

In [None]:
# Working directory for this tutorial
WORKING_DIR = 'asr_am_finetuning'

# Defining paths on the local host machine
%env HOST_DATA_DIR = {WORKING_DIR}/data
%env HOST_SPECS_DIR = {WORKING_DIR}/specs
%env HOST_RESULTS_DIR = {WORKING_DIR}/results

In [None]:
# Creating directories on the local host machine
! mkdir -p $WORKING_DIR
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tao_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "128G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tao_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

You can check the Docker image versions and the tasks that it performs. You can also check by issuing `tao --help` or:

In [None]:
! tao info --verbose

### Set Relevant Paths

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here:
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set the encryption key and use the same key for all commands.
KEY = 'tlt_encode'

The command structure for the TAO interface can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.

---
### Downloading Specs
TAO's conversational AI toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. You may choose to modify/rewrite these specs or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command.<br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` points to an empty folder.**

In [None]:
# Delete the specs directory if it is already there to avoid errors
! tao speech_to_text_citrinet download_specs \
    -r $RESULTS_DIR/speech_to_text_citrinet \
    -o $SPECS_DIR/speech_to_text_citrinet

---
### Download Data

In this tutorial we will use the Nigerian English speech dataset to evaluate and fine-tune our acoustic model. The Nigerian English speech dataset is available [here](https://www.openslr.org/70/). This data set contains transcribed high-quality audio of Nigerian English sentences recorded by volunteers, in Lagos Nigerian and in London. 

Let's download it.

In [None]:
# Checking if the dataset exists, otherwise download it
if os.path.exists(os.environ["HOST_DATA_DIR"] + '/en_ng_female.zip'):
    print("Dataset exists, skipping download")
else:
    print("Dataset does not exist, downloading")
    !wget 'https://www.openslr.org/resources/70/en_ng_female.zip' -P $HOST_DATA_DIR
    !wget 'https://www.openslr.org/resources/70/en_ng_male.zip'   -P $HOST_DATA_DIR

Untar the dataset

In [None]:
# Extract the finetuning data
# Ensure that the unzip utility is available. If not, install it.
!unzip -nq $HOST_DATA_DIR/en_ng_female.zip -d $HOST_DATA_DIR/en_ng_female
!mv $HOST_DATA_DIR/en_ng_female/line_index.tsv $HOST_DATA_DIR/en_ng_female/line_index_female.tsv
!unzip -nq $HOST_DATA_DIR/en_ng_male.zip -d $HOST_DATA_DIR/en_ng_male
!mv $HOST_DATA_DIR/en_ng_male/line_index.tsv $HOST_DATA_DIR/en_ng_male/line_index_male.tsv

---
### Pre-Processing

The Nigerian-English speech dataset contains transcripts in the format `(fileID) <t> <s> transcript </s>`, where:
1. `(fileID)` - denotes the name of the .wav file corresponding to this transcript
2. `<t>` - denotes tab space
3. `<s>` - denotes the start of the transcript
4. `</s>` - denotes the end of the transcript

The audio files are in `.wav` format. The dataset also needs to be split in train and test set. We'll select a ratio of 90:10 for the split.

Let's define a function to extract the relevant information from the `.tsv` metadata files included with this dataset.

In [None]:
import os
import wave

def process_en_ng_tsvs(host_data_dir, data_dir):
    genders = ['female','male']
    entries = []
    # Extract the relevant information from the tsv files
    for gender in genders: 
        dataset  = f'en_ng_{gender}'
        tsv_name = f'line_index_{gender}.tsv'
        tsv_file = os.path.join(host_data_dir, dataset, tsv_name)
        with open(tsv_file, encoding='utf-8') as fin:
            for line in fin:
                label, text = line[: line.index("\t")], line[line.index("\t") + 1 :]
                speaker_id  = label.split('_')[1]
                host_wav_file = os.path.join(host_data_dir, dataset, label + '.wav')
                wav_file = os.path.join(data_dir, dataset, label + '.wav')
                transcript_text = text.lower().strip()

                # check duration
                wf = wave.open(host_wav_file,'r')
                frames, rate = wf.getnframes(), wf.getframerate()
                duration = round(frames / float(rate), 4)
                
                entry = {}
                entry['audio_filepath'] = wav_file
                entry['duration'] = float(duration)
                entry['text'] = transcript_text
                entry['gender'] = gender
                entry['speaker_id'] = speaker_id
                entries.append(entry)
    return entries

In TAO/NeMo format, the dataset consists of a set of utterances in individual audio files (.wav) and a manifest that describes the dataset, with information about one utterance per line.<br>
Each line of the manifest should be in the following format:

```
{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}
```

The `audio_filepath` field should provide an absolute path to the .wav file corresponding to the utterance. The `text` field should contain the full transcript for the utterance, and the `duration` field should reflect the duration of the utterance in seconds.

Other metadata fields like `gender` and `speaker_id` can be added in the manifest but are not useful for finetuning the acoustic model.

We will define a function to generate `manifest.json` file from the `.tsv` metadata files included with this dataset.

In [None]:
import json
import random

def generate_en_ng_manifest(host_data_dir, data_dir, random_seed=0, val_split=0.1):
    # Extract the relevant information from the tsv files
    entries = process_en_ng_tsvs(host_data_dir, data_dir)
    # Generate the manifest files
    # Set the random seed for reproducibility
    random.seed(random_seed)
    random.shuffle(entries)
    num_val_entries = int(val_split * len(entries))
    ft_manifest_file  = os.path.join(host_data_dir, 'en_ng_ft_manifest.json')
    val_manifest_file = os.path.join(host_data_dir, 'en_ng_val_manifest.json')
    with open(ft_manifest_file, 'w') as fout:
        for m in entries[:-num_val_entries]:
            fout.write(json.dumps(m) + '\n')
    with open(val_manifest_file, 'w') as fout:
        for m in entries[-num_val_entries:]:
            fout.write(json.dumps(m) + '\n')

Generate the manifest files for the Nigerian English Speech dataset.

In [None]:
generate_en_ng_manifest(os.environ["HOST_DATA_DIR"], DATA_DIR)

Let's listen to a sample audio file.

In [None]:
# Change path of the file here to listen to some other audio
import os
import IPython.display as ipd
path = os.environ["HOST_DATA_DIR"] + '/en_ng_female/ngf_05223_00457923143.wav'
ipd.Audio(path)

---
### Finetuning 

#### Create Tokenizer

Before we can do the actual finetuning, we will tokenize the text.
TAO provides implementation of 2 SubWord tokenization techniques - WordPiece (WP) and SentencePiece (SP).<br>
For SentencePiece, TAO also provides the option to select between these different types - unigram, bpe, char & word.

Subword tokenization creates a subword vocabulary for the text. The core concept behind subwords is that frequently occurring words should be in the vocabulary, whereas rare words should be split into frequent sub words. Eg. The word “refactoring” can be split into “re”, “factor”, and “ing”. 

For training Citrinet, we use the `create_tokenizer` command to create the tokenizer that generates the unigram SP subword vocabulary. <br>
The `create_tokenizer.yaml` contains the following specifications for tokenization:
```
vocab_size: 1024
tokenizer:
    tokenizer_type: "spe"
    spe_type: "unigram"
    spe_character_coverage: 1.0
    lower_case: False
```

BPE relies on a pre-tokenizer that splits the training data into words. Pretokenization can be as simple as space tokenization.<br>
WordPiece is similar to BPE since it includes all the characters and symbols into its base vocabulary first. BPE and WordPiece lies in the way the symbol pairs are chosen for adding to the vocabulary.

Unigram tokenization also starts with setting a desired vocabulary size. However, the main difference between unigram and the previous 2 approaches is that we don’t start with a base vocabulary of characters only. Instead, the base vocabulary has all the words and symbols. 

All the tokenizers above assume that space separates words. This is true except for a few languages like Chinese, Japanese etc. SentencePiece does not treat space as a separator, instead, it takes the string as input in its original raw format, i.e. along with all spaces. It then uses BPE or unigram as its tokenizers to construct the vocabulary.

Feel free to read [HuggingFace's blog](https://huggingface.co/docs/transformers/tokenizer_summary) to learn more about tokenization algorithms.

In [None]:
!tao speech_to_text_citrinet create_tokenizer \
-e $SPECS_DIR/speech_to_text_citrinet/create_tokenizer.yaml \
-r $RESULTS_DIR/citrinet/create_tokenizer \
manifests=$DATA_DIR/en_ng_ft_manifest.json \
output_root=$RESULTS_DIR/ \
vocab_size=55 # to create an apt vocab for acoustic model training

Now that we have the data and the tokenizer ready, let's download the pre-trained Citrinet checkpoint that we will use for finetuning. We will download the ASR model, [Citrinet-1024](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet), that is used in Riva ASR Speech skill.


In [None]:
# Checking if the checkpoint exists, otherwise download it
if os.path.exists(os.environ["HOST_RESULTS_DIR"] + '/speechtotext_en_us_citrinet_vtrainable_v3.0/'):
    print("Checkpoint exists, skipping download")
else:
    print("Checkpoint does not exist, downloading")
    ! ngc registry model download-version "nvidia/tao/speechtotext_en_us_citrinet:trainable_v3.0"
    ! mv speechtotext_en_us_citrinet_vtrainable_v3.0/ $HOST_RESULTS_DIR/

Note: The fine-tune spec file (`$SPECS_DIR/finetune.yaml`) contain specifics to fine-tune the English AM model, that we just downloaded, to Russian language (also called as language adaptation). In order to fine-tune the model for English language (Nigerian-English speech dataset is an English ASR dataset), we will modify that spec file.

Here is the minimal spec file that we will use for finetuning.

In [None]:
%%writefile finetune_en.yaml

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR models based on CTC over the MCV Russian dataset.

trainer:
  max_epochs: 3   # This is low for demo purposes

tlt_checkpoint_interval: 1

# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false   # CHANGED TO FALSE

tokenizer:
  dir: ???
  type: "bpe"  # Can be either bpe or wpe

# Fine-tuning settings: training dataset
finetuning_ds:
  manifest_filepath: ???
  sample_rate: 16000
  batch_size: 16
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null

# Fine-tuning settings: validation dataset
validation_ds:
  manifest_filepath: ???
  sample_rate: 16000
  batch_size: 16
  shuffle: false

# Fine-tuning settings: optimizer
optim:
  name: novograd
  lr: 0.001

In [None]:
# Moving the above created specs file
!mv finetune_en.yaml $HOST_SPECS_DIR/

#### Acoustic model

For finetuning an ASR Citrinet model in TAO, we use the `tao speech_to_text_citrinet finetune` command with the following arguments:
<ul>
    <li>`-e`: Path to the spec file </li>
    <li>`-g`: Number of GPUs to use </li>
    <li>`-r`: Path to the results folder </li>
    <li>`-m`: Path to the model </li>
    <li>`-k`: User specified encryption key to use while saving/loading the model </li>
    <li>Any overrides to the spec file. For example, `trainer.max_epochs`. </li>
</ul>

In [None]:
!tao speech_to_text_citrinet finetune \
     -e $SPECS_DIR/finetune_en.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/speechtotext_en_us_citrinet_vtrainable_v3.0/speechtotext_en_us_citrinet.tlt \
     -r $RESULTS_DIR/citrinet/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/en_ng_ft_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/en_ng_val_manifest.json \
     trainer.max_epochs=5 \
     finetuning_ds.num_workers=20 \
     validation_ds.num_workers=20 \
     tokenizer.dir=$RESULTS_DIR/tokenizer_spe_unigram_v55

---
### ASR evaluation

Now that we have a model trained, we need to check how well it performs. Let's first evaluate the pre-trained model on this validation set to check the WER.

In [None]:
!tao speech_to_text_citrinet evaluate \
     -e $SPECS_DIR/speech_to_text_citrinet/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/speechtotext_en_us_citrinet_vtrainable_v3.0/speechtotext_en_us_citrinet.tlt \
     -r $RESULTS_DIR/citrinet/evaluate-pretrained \
     test_ds.manifest_filepath=$DATA_DIR/en_ng_val_manifest.json

The pre-trained model scores **20.01 WER** on the validation set. 

Word Error Rate is a measure of how accurate an ASR system performs. Quite literally, it calculates how many “errors” are in the transcription text produced by an ASR system, compared to a human transcription. The lower the number, the better.

Now, let's evaluate the finetuned model.

In [None]:
!tao speech_to_text_citrinet evaluate \
     -e $SPECS_DIR/speech_to_text_citrinet/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/evaluate \
     test_ds.manifest_filepath=$DATA_DIR/en_ng_val_manifest.json

You will observe that the model scores a **18.89** word error rate (WER) on the Nigerian-English validation set.<br>
We've been able to gain an approx **5%** boost in the WER from 20.01 -> 18.89 by just fine-tuning for 5 epochs.

Feel free to try finetuning for more than 5 epochs to see much better accuracy.

---
### ASR model export

With TAO, you can also export your model in a format that can deployed using NVIDIA Riva; a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

This exported .riva model will be used in the next notebook for deploying the ASR pipeline with this customized accoustic model.

#### Export to Riva

In [None]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/riva \
     export_format=RIVA \
     export_to=asr-model.riva

#### Export to ONNX (Note: Export to ONNX is not needed for Riva)

In [None]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/export \
     export_format=ONNX

---
### ASR Inference using the checkpoint

#### ASR Inference with TAO Toolkit

In this section, we are going to run inference on the tlt checkpoint with TAO Toolkit. 
 For real-time inference and best latency, we need to deploy this model on Riva, which would be covered in the next tutorial. 

In [None]:
# Lets listen to the audio first
path = os.environ["HOST_DATA_DIR"] + '/en_ng_male/ngm_09697_00751039644.wav'
ipd.Audio(path)

In [None]:
# Lets get the ground truth transcript for this sample
import json

def read_manifest(path):
    manifest = []
    with open(path, 'r') as f:
        for line in f:
            line = line.replace("\n", "")
            data = json.loads(line)
            manifest.append(data)
    return manifest

path = os.environ["HOST_DATA_DIR"] + '/en_ng_val_manifest.json'
path_in_manifest = DATA_DIR + '/en_ng_male/ngm_09697_00751039644.wav'

manifest = read_manifest(path)
transcript = [x['text'] for x in manifest if x["audio_filepath"] == path_in_manifest]

print("Ground truth transcript: ", transcript)    

In [None]:
# Predictions using the pre-trained model
!tao speech_to_text_citrinet infer \
     -e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/speechtotext_en_us_citrinet_vtrainable_v3.0/speechtotext_en_us_citrinet.tlt \
     -r $RESULTS_DIR/citrinet/infer-pretrained \
     file_paths=[$DATA_DIR/en_ng_male/ngm_09697_00751039644.wav]

In [None]:
# Predictions using the finetuned model
!tao speech_to_text_citrinet infer \
     -e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/infer \
     file_paths=[$DATA_DIR/en_ng_male/ngm_09697_00751039644.wav]

As you would observe, the predicted transcript from the finetuned model is more closer to the grouth truth transcript. The words "shows" and "all" are wrongly predicted as "shoes" and "old" by the pre-trained model.

You can upload your recorded `.wav` file and provide its path to the `file_paths` argument in the cell above to get the transcribed speech.

---
## What's Next?

Now that we've fine-tuned Citrinet accoustic model, we can now deploy this custom model to NVIDIA Riva.

Make sure to keep the path of `asr-model.riva` handy for deployment i.e. `asr_am_finetuning/results/citrinet/riva/asr-model.riva`