<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

# How to fine-tune a Riva ASR Acoustic Model (Citrinet) with TAO Toolkit
This tutorial walks you through how to fine-tune a Riva ASR acoustic model (Citrinet) with TAO Toolkit. Note, in a different tutorial [here](https://github.com/nvidia-riva/tutorials/blob/dev/22.04/asr-python-advanced-finetune-am-citrinet-for-noisy-audio-withtao.ipynb), this shows the fine-tune step (data-preprocessing, fine-tune, and deploy).

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR)
- Text-to-Speech synthesis (TTS)
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will fine-tune a Riva ASR acoustic model (Citrinet) with TAO Toolkit. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/dev/22.04/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## NVIDIA TAO Toolkit
[Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/tao-toolkit) is a Python-based AI toolkit for transfer learning that takes purpose-built pre-trained AI models and customizing them with your own data. TAO enables developers, researchers, and software partners with limited AI expertise to create highly accurate AI models for production deployments. TAO follows zero coding paradigm where there is no need to write any code to train models. Training can be done by simply running a few commands with the TAO command-line interface.

![Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/sites/default/files/akamai/embedded-transfer-learning-toolkit-software-stack-1200x670px.png)

 Transfer learning extracts learned features from an existing neural network into a new one. Transfer learning is often used when creating a large training dataset is not feasible. The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientists to have considerably more train-test iterations in the same time frame.

Let us see this in action with the use case for the ASR acoustic model.

## Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) is often the first step in building a conversational AI model. An ASR model converts audible speech into text. The main metric for these models is to reduce Word Error Rate (WER) while transcribing the text. Simply put, the goal is to take an audio file and transcribe it.

In this tutorial, we are going to discuss the Citrinet model which is an end-to-end convolutional Connectionist Temporal Classification (CTC) ASR model that takes in audio and produces text. Citrinet is a descendent of QuartzNet that features the 1D time-channel separable convolutions, squeeze-and-excitation (SE) block and sub-word tokenization and has a better accuracy/performance than QuartzNet. The following diagram portrays the Citrinet architecture consisting of prolog layer, mega-blocks combined from residual blocks, and epolog layers. More information about ASR models can be found [here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/asr.html).

![CitriNet with CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/_images/citrinet_vertical.png)

---
## ASR using TAO

### Installing and setting up TAO

Install TAO inside a Python virtual environment. We recommend performing this step first and then launching the tutorial from the virtual environment.

In addition to installing the TAO Python package, ensure you meet the following software requirements:

1. `python` 3.6.9
2. `docker-ce` > 19.03.5
3. `docker-API` 1.40
4. `nvidia-container-toolkit` > 1.3.0-1
5. `nvidia-container-runtime` > 3.4.0-1
6. `nvidia-docker2` > 2.5.0-1
7. `nvidia-driver` >= 455.23

Installing TAO is a simple `pip` install.

In [1]:
! pip install nvidia-pyindex
! pip install nvidia-tao

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
You should consider upgrading via the '/home/ck/miniconda3/envs/py38/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
You should consider upgrading via the '/home/ck/miniconda3/envs/py38/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
[0m

After installing TAO, the next step is to setup the mounts for TAO. The TAO launcher uses Docker containers under the hood, and **for our data and results directory to be visible to Docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The following code creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case so these directories are correctly visible to the Docker container.

In [2]:
# please define these paths on your local host machine
%env HOST_DATA_DIR=data
%env HOST_SPECS_DIR=specs
%env HOST_RESULTS_DIR=results

env: HOST_DATA_DIR=data
env: HOST_SPECS_DIR=specs
env: HOST_RESULTS_DIR=results


In [3]:
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [4]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [5]:
!cat ~/.tao_mounts.json

{
    "Mounts": [
        {
            "source": "data",
            "destination": "/data"
        },
        {
            "source": "specs",
            "destination": "/specs"
        },
        {
            "source": "results",
            "destination": "/results"
        },
        {
            "source": "/home/ck/.cache",
            "destination": "/root/.cache"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        }
    }
}

You can check the Docker image versions and the tasks that it performs. You can also check by issuing `tao --help` or:

In [6]:
! tao info --verbose

Configuration of the TAO Toolkit Instance

dockers: 		
	nvstaging/tao/tao-toolkit-tf: 			
		v3.22.04-tf1.15.4-60-dev-cuda11.4: 				
			docker_registry: nvcr.io
			tasks: 
				1. detectnet_v2
		v3.22.04-tf1.15.5-382-dev-cuda11.6: 				
			docker_registry: nvcr.io
			tasks: 
				1. augment
				2. bpnet
				3. classification
				4. dssd
				5. faster_rcnn
				6. emotionnet
				7. efficientdet
				8. fpenet
				9. gazenet
				10. gesturenet
				11. heartratenet
				12. lprnet
				13. mask_rcnn
				14. multitask_classification
				15. retinanet
				16. ssd
				17. unet
				18. yolo_v3
				19. yolo_v4
				20. yolo_v4_tiny
				21. converter
	nvstaging/tao/tao-toolkit-pyt: 			
		v4.22.03-1263-dev-cuda11.4: 				
			docker_registry: nvcr.io
			tasks: 
				1. speech_to_text
				2. speech_to_text_citrinet
				3. speech_to_text_conformer
				4. action_recognition
				5. pointpillars
				6. pose_classification
				7. spectro_gen
				8. vocoder
	nvidia/tao/t

### Set Relevant Paths

In [7]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here:
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key and use the same key for all commands.
KEY = 'tlt_encode'

The command structure for the TAO interface can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.


### Downloading Specs
TAO's conversational AI toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. You may choose to modify/rewrite these specs or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command.<br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` points to an empty folder.**

In [8]:
# delete the specs directory if it is already there to avoid errors
! tao speech_to_text_citrinet download_specs \
    -r $RESULTS_DIR/speech_to_text_citrinet \
    -o $SPECS_DIR/speech_to_text_citrinet

2022-05-18 15:44:32,027 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:44:32,142 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo I 2022-05-18 20:44:40 tlt_logging:20] Experiment configuration:
    exp_manager:
      task_name: download_specs
      explicit_log_dir: /results/speech_to_text_citrinet
    source_data_dir: /opt/conda/lib/python3.8/site-packages/conv_ai/asr/speech_to_text_ctc/experiment_specs
    target_data_dir: /specs/speech_to_text_citrinet
    workflow: conv_ai
    
Downloading default specs for conv_ai
[NeMo I 2022-05-18 20:44:40 download_specs:82] Default specification fil

### Download Data

In this tutorial we will use the popular AN4 dataset. Let's download it.

In [9]:
! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/ck/.wget-hsts'. HSTS will be disabled.
--2022-05-18 15:44:52--  https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz
Resolving dldata-public.s3.us-east-2.amazonaws.com (dldata-public.s3.us-east-2.amazonaws.com)... 52.219.108.50
Connecting to dldata-public.s3.us-east-2.amazonaws.com (dldata-public.s3.us-east-2.amazonaws.com)|52.219.108.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64327561 (61M) [application/x-gzip]
Saving to: ‘an4_sphere.tar.gz.3’


2022-05-18 15:45:18 (2.40 MB/s) - ‘an4_sphere.tar.gz.3’ saved [64327561/64327561]



After downloading, untar the dataset and move it to the correct directory.

In [10]:
! tar -xvf an4_sphere.tar.gz 
! mv an4 $HOST_DATA_DIR

an4/
an4/README
an4/etc/
an4/etc/an4_test.fileids
an4/etc/an4.ug.lm
an4/etc/an4.ug.lm.DMP
an4/etc/an4_train.fileids
an4/etc/an4_train.transcription
an4/etc/an4_test.transcription
an4/etc/an4.dic
an4/etc/an4.phone
an4/etc/an4.filler
an4/wav/
an4/wav/an4_clstk/
an4/wav/an4_clstk/fash/
an4/wav/an4_clstk/fash/an251-fash-b.sph
an4/wav/an4_clstk/fash/an253-fash-b.sph
an4/wav/an4_clstk/fash/an254-fash-b.sph
an4/wav/an4_clstk/fash/an255-fash-b.sph
an4/wav/an4_clstk/fash/cen1-fash-b.sph
an4/wav/an4_clstk/fash/cen2-fash-b.sph
an4/wav/an4_clstk/fash/cen4-fash-b.sph
an4/wav/an4_clstk/fash/cen5-fash-b.sph
an4/wav/an4_clstk/fash/cen7-fash-b.sph
an4/wav/an4_clstk/fbbh/
an4/wav/an4_clstk/fbbh/an86-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an87-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an88-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an89-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an90-fbbh-b.sph
an4/wav/an4_clstk/fbbh/cen1-fbbh-b.sph
an4/wav/an4_clstk/fbbh/cen2-fbbh-b.sph
an4/wav/an4_clstk/fbbh/cen3-fbbh-b.sph
an4/wav/an4_clstk/fbbh/ce

an4/wav/an4_clstk/fplp/cen4-fplp-b.sph
an4/wav/an4_clstk/fplp/cen5-fplp-b.sph
an4/wav/an4_clstk/fplp/cen6-fplp-b.sph
an4/wav/an4_clstk/fplp/cen7-fplp-b.sph
an4/wav/an4_clstk/fplp/cen8-fplp-b.sph
an4/wav/an4_clstk/fsaf2/
an4/wav/an4_clstk/fsaf2/an296-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/an297-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/an298-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/an299-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/an300-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen1-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen2-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen3-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen4-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen5-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen6-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen7-fsaf2-b.sph
an4/wav/an4_clstk/fsaf2/cen8-fsaf2-b.sph
an4/wav/an4_clstk/fsrb/
an4/wav/an4_clstk/fsrb/an166-fsrb-b.sph
an4/wav/an4_clstk/fsrb/an167-fsrb-b.sph
an4/wav/an4_clstk/fsrb/an168-fsrb-b.sph
an4/wav/an4_clstk/fsrb/an169-fsrb-b.sph
an4/wav/an4_clstk/fsrb/an170-fsrb-b.sph
an4/wav/an4_clstk/

an4/wav/an4_clstk/meht/cen3-meht-b.sph
an4/wav/an4_clstk/meht/cen4-meht-b.sph
an4/wav/an4_clstk/meht/cen5-meht-b.sph
an4/wav/an4_clstk/meht/cen6-meht-b.sph
an4/wav/an4_clstk/meht/cen7-meht-b.sph
an4/wav/an4_clstk/meht/cen8-meht-b.sph
an4/wav/an4_clstk/mema/
an4/wav/an4_clstk/mema/an286-mema-b.sph
an4/wav/an4_clstk/mema/an287-mema-b.sph
an4/wav/an4_clstk/mema/an288-mema-b.sph
an4/wav/an4_clstk/mema/an289-mema-b.sph
an4/wav/an4_clstk/mema/an290-mema-b.sph
an4/wav/an4_clstk/mema/cen1-mema-b.sph
an4/wav/an4_clstk/mema/cen2-mema-b.sph
an4/wav/an4_clstk/mema/cen3-mema-b.sph
an4/wav/an4_clstk/mema/cen4-mema-b.sph
an4/wav/an4_clstk/mema/cen5-mema-b.sph
an4/wav/an4_clstk/mema/cen6-mema-b.sph
an4/wav/an4_clstk/mema/cen7-mema-b.sph
an4/wav/an4_clstk/mema/cen8-mema-b.sph
an4/wav/an4_clstk/mewl/
an4/wav/an4_clstk/mewl/an256-mewl-b.sph
an4/wav/an4_clstk/mewl/an257-mewl-b.sph
an4/wav/an4_clstk/mewl/an258-mewl-b.sph
an4/wav/an4_clstk/mewl/an259-mewl-b.sph
an4/wav/an4_clstk/mewl/an260-mewl-b.sph
an4/wa

an4/wav/an4_clstk/mnfe/cen1-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen2-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen3-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen4-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen5-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen6-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen7-mnfe-b.sph
an4/wav/an4_clstk/mnfe/cen8-mnfe-b.sph
an4/wav/an4_clstk/mnjl/
an4/wav/an4_clstk/mnjl/an81-mnjl-b.sph
an4/wav/an4_clstk/mnjl/an82-mnjl-b.sph
an4/wav/an4_clstk/mnjl/an83-mnjl-b.sph
an4/wav/an4_clstk/mnjl/an84-mnjl-b.sph
an4/wav/an4_clstk/mnjl/an85-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen1-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen2-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen3-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen5-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen6-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen7-mnjl-b.sph
an4/wav/an4_clstk/mnjl/cen8-mnjl-b.sph
an4/wav/an4_clstk/mrab/
an4/wav/an4_clstk/mrab/an71-mrab-b.sph
an4/wav/an4_clstk/mrab/an72-mrab-b.sph
an4/wav/an4_clstk/mrab/an73-mrab-b.sph
an4/wav/an4_clstk/mrab/an74-mrab-b.sph
an4/wav/an4_clst

an4/wav/an4test_clstk/fvap/cen6-fvap-b.sph
an4/wav/an4test_clstk/fvap/cen7-fvap-b.sph
an4/wav/an4test_clstk/fvap/cen8-fvap-b.sph
an4/wav/an4test_clstk/marh/
an4/wav/an4test_clstk/marh/an431-marh-b.sph
an4/wav/an4test_clstk/marh/an432-marh-b.sph
an4/wav/an4test_clstk/marh/an433-marh-b.sph
an4/wav/an4test_clstk/marh/an434-marh-b.sph
an4/wav/an4test_clstk/marh/an435-marh-b.sph
an4/wav/an4test_clstk/marh/cen1-marh-b.sph
an4/wav/an4test_clstk/marh/cen2-marh-b.sph
an4/wav/an4test_clstk/marh/cen3-marh-b.sph
an4/wav/an4test_clstk/marh/cen4-marh-b.sph
an4/wav/an4test_clstk/marh/cen5-marh-b.sph
an4/wav/an4test_clstk/marh/cen6-marh-b.sph
an4/wav/an4test_clstk/marh/cen7-marh-b.sph
an4/wav/an4test_clstk/marh/cen8-marh-b.sph
an4/wav/an4test_clstk/mdms2/
an4/wav/an4test_clstk/mdms2/an401-mdms2-b.sph
an4/wav/an4test_clstk/mdms2/an402-mdms2-b.sph
an4/wav/an4test_clstk/mdms2/an403-mdms2-b.sph
an4/wav/an4test_clstk/mdms2/an404-mdms2-b.sph
an4/wav/an4test_clstk/mdms2/an405-mdms2-b.sph
an4/wav/an4test_clst

### Pre-Processing

This step converts the `.mp3` files into `.wav` files and splits the data into training and testing sets. It also generates a "meta-data" file to be consumed by the data-loader for training and testing.

In [11]:
! tao speech_to_text_citrinet dataset_convert \
    -e $SPECS_DIR/speech_to_text_citrinet/dataset_convert_an4.yaml \
    -r $RESULTS_DIR/citrinet/dataset_convert \
    source_data_dir=$DATA_DIR/an4 \
    target_data_dir=$DATA_DIR/an4_converted

2022-05-18 15:45:19,792 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:45:19,905 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'dataset_convert_an4.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:45:27 tlt_logging:20] Experiment configuration:
    exp_manager:
      task_name: dataset_convert
      explicit_log_dir: /results/citrinet/dataset_convert
    dataset

Let's listen to a sample audio file.

In [12]:
# change path of the file here
import os
import IPython.display as ipd
path = os.environ["HOST_DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

Training commands for Citrinet is similar to those of QuartzNet.

### Training 

#### Create Tokenizer

Before we can do the actual training, we need to pre-process the text. This step is called subword tokenization that creates a subword vocabulary for the text. This is different from Jasper/QuartzNet because only single characters are regarded as elements in the vocabulary in their cases, while in Citrinet the subword can be one or multiple characters. We can use the `create_tokenizer` command to create the tokenizer that generates the subword vocabulary for us for use in training.

In [13]:
!tao speech_to_text_citrinet create_tokenizer \
-e $SPECS_DIR/speech_to_text_citrinet/create_tokenizer.yaml \
-r $RESULTS_DIR/citrinet/create_tokenizer \
manifests=$DATA_DIR/an4_converted/train_manifest.json \
output_root=$DATA_DIR/an4 \
vocab_size=32

2022-05-18 15:45:58,387 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:45:58,501 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'create_tokenizer.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:46:05 tlt_logging:20] Experiment configuration:
    exp_manager:
      task_name: create_token
      explicit_log_dir: /results/citrinet/create_tokenizer
    manifests: /

The TAO interface enables you to configure the training parameters from the command-line interface. <br>

The process of opening the training script, finding the parameters of interest (which might be spread across multiple files), and making the changes needed, is being replaced by a simple command-line interface.

For example, if the number of epochs are needed to be modified along with a change in the learning rate, you can add `trainer.max_epochs=10` and `optim.lr=0.02` and train the model. Sample commands are given below.


<b>A list of some of the customizable parameters along with their default values is as follows:</b>

trainer:<br>
<ul>  
  <li>gpus: 1 </li>
  <li>num_nodes: 1 </li>
  <li>max_epochs: 5 </li>
  <li>max_steps: null </li>
  <li>checkpoint_callback: false </li>
</ul>

training_ds:
<ul>  
  <li>sample_rate: 16000 </li>
  <li>batch_size: 32 </li>
  <li>trim_silence: true </li>
  <li>max_duration: 16.7 </li>
  <li>shuffle: true </li>
  <li>is_tarred: false </li>
  <li>tarred_audio_filepaths: null </li>
</ul>  

validation_ds:
<ul>  
  <li>sample_rate: 16000 </li>
  <li>batch_size: 32 </li>
  <li>shuffle: false </li>
</ul>  
optim:
<ul>
  <li>name: adam </li>
  <li>lr: 0.1 </li>
  <li>betas: [0.9, 0.999] </li>
  <li>weight_decay: 0.0001 </li>
</ul>

The following steps may take a considerable amount of time depending on the GPU being used. For the best experience, we recommend using an A100 GPU.

For training an ASR Citrinet model in TAO, we use the `tao speech_to_text_citrinet train` command with the following arguments:
<ul>
    <li>`-e`: Path to the spec file </li>
    <li>`-g`: Number of GPUs to use </li>
    <li>`-r`: Path to the results folder </li>
    <li>`-m`: Path to the model </li>
    <li>`-k`: User specified encryption key to use while saving/loading the model </li>
    <li>Any overrides to the spec file. For example, `trainer.max_epochs`. </li>
</ul>

#### Training Citrinet

In [14]:
!tao speech_to_text_citrinet train \
     -e $SPECS_DIR/speech_to_text_citrinet/train_citrinet_bpe.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/citrinet/train \
     training_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     training_ds.num_workers=4 \
     validation_ds.num_workers=4 \
     model.tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v32

2022-05-18 15:46:17,060 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:46:17,175 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'train_citrinet_bpe.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:46:25 tlt_logging:20] Experiment configuration:
    exp_manager:
      explicit_log_dir: /results/citrinet/train
      exp_dir: null
      name: trained-model
      ver

[NeMo I 2022-05-18 20:46:25 mixins:146] Tokenizer SentencePieceTokenizer initialized with 32 tokens
[NeMo I 2022-05-18 20:46:25 ctc_bpe_models:206] 
    Replacing placeholder number of classes (-1) with actual number of classes - 32
[NeMo I 2022-05-18 20:46:25 features:255] PADDING: 16
[NeMo I 2022-05-18 20:46:25 features:272] STFT using torch
[NeMo I 2022-05-18 20:46:25 collections:173] Dataset loaded with 948 files totalling 0.71 hours
[NeMo I 2022-05-18 20:46:25 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2022-05-18 20:46:25 collections:173] Dataset loaded with 130 files totalling 0.10 hours
[NeMo I 2022-05-18 20:46:25 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo W 2022-05-18 20:46:25 modelPT:496] The lightning trainer received accelerator: <pytorch_lightning.accelerators.gpu.GPUAccelerator object at 0x7f973a36c4f0>. We recommend to use 'ddp' instead.
[NeMo I 2022-05-18 20:46:25 modelPT:587] Optimizer config = Adam (
    Parameter Group 0

Validation sanity check:   0%|                            | 0/2 [00:00<?, ?it/s][NeMo I 2022-05-18 20:46:31 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:46:31 wer_bpe:204] reference:rubout g m e f three nine
[NeMo I 2022-05-18 20:46:31 wer_bpe:205] predicted:p
Validation sanity check:  50%|██████████          | 1/2 [00:03<00:03,  3.82s/it][NeMo I 2022-05-18 20:46:31 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:46:31 wer_bpe:204] reference:v a n e s s a
[NeMo I 2022-05-18 20:46:31 wer_bpe:205] predicted:p
Epoch 0:  86%|███████████████████▋   | 30/35 [00:02<00:00, 12.52it/s, loss=50.1]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|                                         | 0/5 [00:00<?, ?it/s][A[NeMo I 2022-05-18 20:46:34 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:46:34 wer_bpe:204] reference:rubout g m e f three nine
[NeMo I 2022-05-18 20:46:34 wer_bpe:205] predicted:

Validating:  20%|██████▌                          | 1/5 [00:00<00:01,  3.92it/s][A[NeMo I 2022-05-18 20:46:34 wer_bpe:203]

### ASR evaluation

Now that we have a model trained, we need to check how well it performs.

In [15]:
!tao speech_to_text_citrinet evaluate \
     -e $SPECS_DIR/speech_to_text_citrinet/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/evaluate \
     test_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json

2022-05-18 15:46:48,124 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:46:48,238 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'evaluate.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:46:56 tlt_logging:20] Experiment configuration:
    restore_from: /results/citrinet/train/checkpoints/trained-model.tlt
    exp_manager:
      explicit_log_dir: /results/citrinet

[NeMo I 2022-05-18 20:46:58 collections:173] Dataset loaded with 130 files totalling 0.10 hours
[NeMo I 2022-05-18 20:46:58 collections:174] 0 files were filtered totalling 0.00 hours
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
      rank_zero_warn(
    
Testing: 0it [00:00, ?it/s][NeMo I 2022-05-18 20:47:01 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:01 wer_bpe:204] reference:rubout g m e f three nine
[NeMo I 2022-05-18 20:47:01 wer_bpe:205] predicted:
Testing:  20%|███████▏                            | 1/

### ASR finetuning

After the model is trained, evaluated, and there is a need for fine-tuning, the following command can be used to fine-tune the ASR model. This step can also be used for transfer learning by making changes in the `train.json` and `dev.json` files to add new data.

The list for customizations is the same as the training parameters with the exception for parameters which affect the model architecture. Also, instead of `training_ds` we have `finetuning_ds`.

Note: If you want to proceed with a trained dataset for better inference results, you can find a `.nemo` model [here](
https://ngc.nvidia.com/catalog/collections/nvidia:nemotrainingframework).

Simply re-name the `.nemo` file to `.tlt` and pass it through the fine-tune pipeline.

Note: The fine-tune spec files contain specifics to fine-tune the English model we just trained to Russian. If you want to proceed with English, ensure the changes are in the spec file `finetune.yaml` which you can find in the `SPEC_DIR` folder you mapped. Ensure to delete older fine-tuning checkpoints if you choose to change the language after fine-tuning it as-is.

In [16]:
!tao speech_to_text_citrinet finetune \
     -e $SPECS_DIR/speech_to_text_citrinet/finetune.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     finetuning_ds.num_workers=20 \
     validation_ds.num_workers=20 \
     trainer.gpus=1 \
     tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v32

2022-05-18 15:47:14,019 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:47:14,128 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'finetune.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:47:22 tlt_logging:20] Experiment configuration:
    restore_from: /results/citrinet/train/checkpoints/trained-model.tlt
    save_to: ???
    exp_manager:
      explicit_log_dir: 

[NeMo I 2022-05-18 20:47:22 mixins:146] Tokenizer SentencePieceTokenizer initialized with 32 tokens
[NeMo W 2022-05-18 20:47:22 modelPT:148] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/an4_converted/train_manifest.json
    batch_size: 32
    sample_rate: 16000
    labels: null
    num_workers: 4
    pin_memory: true
    trim_silence: true
    shuffle: true
    max_duration: 16.7
    min_duration: null
    is_tarred: false
    tarred_audio_filepaths: null
    use_start_end_token: false
    shuffle_n: null
    bucketing_strategy: synced_randomized
    bucketing_batch_size: null
    
[NeMo W 2022-05-18 20:47:22 modelPT:155] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation dat

Validation sanity check:   0%|                            | 0/2 [00:00<?, ?it/s][NeMo I 2022-05-18 20:47:27 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:27 wer_bpe:204] reference:rubout g m e f three nine
[NeMo I 2022-05-18 20:47:27 wer_bpe:205] predicted:fivea twoza fivea twoz sixz sixz twoa fivea twoza fivea twoz sixz sixza five
Validation sanity check:  50%|██████████          | 1/2 [00:03<00:03,  3.20s/it][NeMo I 2022-05-18 20:47:27 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:27 wer_bpe:204] reference:v a n e s s a
[NeMo I 2022-05-18 20:47:27 wer_bpe:205] predicted:a fiveaza fiveaz twoa fivea twoz twoa five
Epoch 0:   0%|                                           | 0/35 [00:00<?, ?it/s][NeMo I 2022-05-18 20:47:28 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:28 wer_bpe:204] reference:five ten
[NeMo I 2022-05-18 20:47:28 wer_bpe:205] predicted:a fivea twoz sixzra twoz sixz twoa
Epoch 0:   3%|▋                        | 1/35 [00:01<00:41,  1.22s/it, loss=870][NeMo I 2022-05-18 20:47:29 w

Epoch 0:  74%|█████████████████▊      | 26/35 [00:02<00:00,  9.80it/s, loss=799][NeMo I 2022-05-18 20:47:30 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:30 wer_bpe:204] reference:m o h n k e r n
[NeMo I 2022-05-18 20:47:30 wer_bpe:205] predicted:haz sixzahaz sixzah
Epoch 0:  77%|██████████████████▌     | 27/35 [00:02<00:00,  9.96it/s, loss=799][NeMo I 2022-05-18 20:47:30 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:30 wer_bpe:204] reference:c h r i s
[NeMo I 2022-05-18 20:47:30 wer_bpe:205] predicted:haz sixzaz sixzahaz sixzah
Epoch 0:  80%|███████████████████▏    | 28/35 [00:02<00:00, 10.12it/s, loss=793][NeMo I 2022-05-18 20:47:30 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:30 wer_bpe:204] reference:september twenty seventh nineteen sixty seven
[NeMo I 2022-05-18 20:47:30 wer_bpe:205] predicted:haz sixzahaz sixzezaz sixzah
Epoch 0:  83%|███████████████████▉    | 29/35 [00:02<00:00, 10.27it/s, loss=789][NeMo I 2022-05-18 20:47:30 wer_bpe:203] 
    
[NeMo I 2022-05-18 20:47:30 wer_bpe:20

### ASR model export

With TAO, you can also export your model in a format that can deployed using NVIDIA Riva; a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

#### Export to Riva

In [17]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/riva \
     export_format=RIVA \
     export_to=asr-model.riva

2022-05-18 15:47:45,594 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:47:45,712 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'export.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:47:53 tlt_logging:20] Experiment configuration:
    restore_from: /results/citrinet/train/checkpoints/trained-model.tlt
    export_to: asr-model.riva
    export_format: RIVA
    ex

#### Export to ONNX (Note: Export to ONNX is not needed for Riva)

In [18]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/export \
     export_format=ONNX

2022-05-18 15:48:12,063 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:48:12,179 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'export.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:48:20 tlt_logging:20] Experiment configuration:
    restore_from: /results/citrinet/train/checkpoints/trained-model.tlt
    export_to: exported-model.riva
    export_format: ONNX
 

### ASR Inference using TLT checkpoint

#### ASR Inference with TAO Toolkit

In this section, we are going to run inference on the tlt checkpoint with TAO Toolkit. 
 For real-time inference and best latency, we need to deploy this model on Riva - Refer to [How to deploy custom Acoustic Model (Citrinet) trained with TAO Toolkit on Riva](https://github.com/nvidia-riva/tutorials/blob/dev/22.04-citrinet/asr-python-advanced-finetune-am-citrinet-tao-deployment.ipynb) tutorial. 
 You might have to work with the infer.yaml file to select the files you want for inference.

In [19]:
!tao speech_to_text_citrinet infer \
     -e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/citrinet/infer \
     file_paths=[$DATA_DIR/an4_converted/wavs/an268-mbmg-b.wav]

2022-05-18 15:48:38,376 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:48:38,491 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'infer.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:48:46 tlt_logging:20] Experiment configuration:
    restore_from: /results/citrinet/train/checkpoints/trained-model.tlt
    exp_manager:
      task_name: infer
      explicit_log_di

#### ASR Inference using ONNX

TAO provides the capability to use the exported `.eonnx` model for inference. The command `tao speech_to_text infer_onnx` is very similar to the inference command for `.tlt` models. Again, the inputs in the spec file used is just for demo purposes, you may choose to try out your custom input.

In [20]:
!tao speech_to_text_citrinet infer_onnx \
     -e $SPECS_DIR/speech_to_text_citrinet/infer_onnx_citrinet.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/export/exported-model.eonnx \
     -r $RESULTS_DIR/infer_onnx \
     file_paths=[$DATA_DIR/an4_converted/wavs/an268-mbmg-b.wav]

2022-05-18 15:49:03,636 [INFO] root: Registry: ['nvcr.io']
2022-05-18 15:49:03,751 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvstaging/tao/tao-toolkit-pyt:v4.22.03-1263-dev-cuda11.4
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ck/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
    'infer_onnx_citrinet.yaml' is validated against ConfigStore schema with the same name.
    This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
    See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
    
[NeMo I 2022-05-18 20:49:11 tlt_logging:20] Experiment configuration:
    restore_from: /results/citrinet/export/exported-model.eonnx
    file_paths:
    - /data/an4_converted/wavs/an268-mb

## What's Next?

 You can use TAO to build custom models for your own applications, or you could [deploy the custom model to NVIDIA Riva](https://github.com/nvidia-riva/tutorials/blob/dev/22.04-citrinet/asr-python-advanced-finetune-am-citrinet-tao-deployment.ipynb).