<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 4.0 ASR with NVIDIA TAO Toolkit
## (part of Lab 1)

In this notebook, you'll work with the TAO (Train, Adapt, and Optimize) Toolkit to run speech recognition inference using a pretrained Automatic Speech Recognition (ASR) model and export the model to other formats.

**[4.1 ASR (Automatic Speech Recognition)](#4.1-ASR-(Automatic-Speech-Recognition))<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.1 QuartzNet](#4.1.1-QuartzNet)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.2 Citrinet](#4.1.2-Citrinet)<br>
**[4.2 TAO Toolkit](#4.2-TAO-Toolkit)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.2.1 TAO Launcher](#4.2.1-TAO-Launcher)<br>
**[4.3 TAO `speech_to_text` Task](#4.3-TAO-speech_to_text-Task)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.1 Path Setup](#4.3.1-Path-Setup)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.2 Specification Files](#4.3.2-Specification-Files)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.3 QuartzNet Inference](#4.3.3-QuartzNet-Inference)<br>
**[4.4 Exporting Models with TAO Toolkit](#4.4-Exporting-Models-with-TAO-Toolkit)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.1 ASR Model Export to ONNX](#4.4.1-ASR-Model-Export-to-ONNX)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.2 Inference with an ONNX Model](#4.4.2-Inference-with-an-ONNX-Model)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.3 Exercise: Export the Model to Riva Format](#4.4.3-Exercise:-Export-the-Model-to-Riva-Format)<br>
**[4.5 (Optional) Create Your Own Audio Samples](#4.5-(Optional)-Create-Your-Own-Audio-Samples)**<br>

### Notebook Dependencies
The steps in this notebook assume that you have:

1. **NGC Credentials**<br>Be sure you have added your NGC credential as described in the [NGC Setup notebook](003_Intro_NGC_Setup.ipynb)

---
# 4.1 ASR (Automatic Speech Recognition)

Building an ASR model is often the first step in building a conversational AI application. An ASR model converts audio speech into readable text.
The main metric used to evaluate ASR models is the Word Error Rate (WER).
In this lab, we will focus on two models: [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) and [Citrinet](https://arxiv.org/pdf/2104.01721.pdf), both of which are end-to-end ASR models, meaning we can simply input audio to produce text output.

## 4.1.1 QuartzNet
_QuartzNet_ is a deep neural model for speech recognition developed by NVIDIA Research. The network is divided into:
- Encoder - trains the acoustical features representation
- Decoder - maps those features to the vocabulary (characters or phonemes).  

QuartzNet is a variant of the _NVIDIA Jasper_ model [(Just Another Speech Recognizer)](https://arxiv.org/pdf/1904.03288.pdf).  However, QuartzNet replaces Jasper's 1D convolutions with 1D time-channel separable convolutions, which use fewer parameters while keeping a similar accuracy. QuartzNet uses a non-autoregressive CTC-based (Connectionist Temporal Classification) decoding scheme, which means that it does not require manual alignment between the input and output pairs. Learn more about the [CTC loss](https://www.cs.toronto.edu/~graves/icml_2006.pdf).

<img src="images/asr/quartz_vertical.png">

## 4.1.2 Citrinet

_Citrinet_ is a variant of QuartzNet, developed by NVIDIA Research. Unlike QuartzNet, which predicts characters or phonemes, Citrinet uses subword encoding via WordPiece tokenization. This results in performance improvement of the audio transcripts.

<img src="images/asr/citrinet_vertical.png">

---
# 4.2 TAO Toolkit

NVIDIA TAO Toolkit, formerly Transfer Learning Toolkit, is a low-code solution that simplifies the task of transfer learning and fine-tuning on computer vision and conversational AI models.  It simplifies and accelerates development of AI applications through an easy-to-use workflow-driven toolkit (see the [TAO User's Guide](https://docs.nvidia.com/tao/tao-toolkit). TAO Toolkit is also capable of optimizing models for inference to achieve the highest throughput for deployment.

The technique of transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used in cases where creating a large training dataset is not feasible. 

Developers, researchers, and software partners building intelligent vision and conversational AI apps and services, can bring their own data to fine-tune pretrained models instead of going through the time-consuming inconvenience of training from scratch.

With a basic understanding of deep learning and minimal coding, TAO Toolkit users can:

- Fine-tune models for CV use cases such as object detection, image classification, segmentation, CR, and keypoint estimation using NVIDIA pretrained CV models
- Fine-tune models for conversational AI use cases, such as ASR or NLP, using NVIDIA pretrained conversational AI models
- Add new classes to an existing pretrained model
- Retrain models to adapt them for different use cases
- Use the model pruning capability on CV models to reduce the overall size of the model

<img src="images/asr/tao_components.png">

The goal of this toolkit is to reduce the workload, which can enable data scientists to have considerably more train-test iterations in the same time frame.

You'll see this in action with a use case for ASR!

## 4.2.1 TAO Launcher

TAO is packaged with a [launcher CLI](https://docs.nvidia.com/tao/tao-toolkit/text/tao_launcher.html). This enables users to build AI models using a simple specification file and one of the NVIDIA pretrained models. 
When you use the launcher, relevant TAO docker containers are executed, and mappings are managed implicitly for you.  The available _tasks_ and _subtasks_ follow a cascaded structure:

```text
tao <task> <subtask> <args>
```

You can see a list of tasks with the command `tao --help` or `tao info --verbose` Try it by executing the following cell.

In [1]:
!tao info --verbose

Configuration of the TAO Toolkit Instance

dockers: 		
	nvidia/tao/tao-toolkit-tf: 			
		docker_registry: nvcr.io
		docker_tag: v3.21.08-py3
		tasks: 
			1. augment
			2. bpnet
			3. classification
			4. detectnet_v2
			5. dssd
			6. emotionnet
			7. faster_rcnn
			8. fpenet
			9. gazenet
			10. gesturenet
			11. heartratenet
			12. lprnet
			13. mask_rcnn
			14. multitask_classification
			15. retinanet
			16. ssd
			17. unet
			18. yolo_v3
			19. yolo_v4
			20. converter
	nvidia/tao/tao-toolkit-pyt: 			
		docker_registry: nvcr.io
		docker_tag: v3.21.08-py3
		tasks: 
			1. speech_to_text
			2. speech_to_text_citrinet
			3. text_classification
			4. question_answering
			5. token_classification
			6. intent_slot_classification
			7. punctuation_and_capitalization
	nvidia/tao/tao-toolkit-lm: 			
		docker_registry: nvcr.io
		docker_tag: v3.21.08-py3
		tasks: 
			1. n_gram
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021


---
# 4.3 TAO `speech_to_text` Task

The TAO tasks are broadly divided into computer vision and conversational AI. 

For example, `speech_to_text` is a conversational AI task for [speech recognition](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html), which supports subtasks such as `train`, `infer`, `evaluate`, `export`, and so on. When the user executes a command, for example `tao speech_to_text infer --help`, the TAO launcher does the following:

1. Pulls the required TAO container with the entrypoint for `speech_to_text`
1. Creates an instance of the container
1. Runs the `speech_to_text` entrypoint with the `infer` subtask

A visualization of the user interaction is shown below (in this case with the `train` and `finetune` subtasks):

<img src="images/asr/tao_pt_user_interaction.png" width=800>

The `tao speech_to_text --help` usage information output is as follows:

```
usage: speech_to_text [-h] -r RESULTS_DIR [-k KEY] [-e EXPERIMENT_SPEC_FILE]
                      [-g GPUS] [-m RESUME_MODEL_WEIGHTS]
                      [-o OUTPUT_SPECS_DIR]
                      {dataset_convert,evaluate,export,finetune,infer,infer_onnx,train,download_specs}

Train Adapt Optimize Toolkit

positional arguments:
  {dataset_convert,evaluate,export,finetune,infer,infer_onnx,train,download_specs}
                        Subtask for a given task/model.

optional arguments:
  -h, --help            show this help message and exit
  -r RESULTS_DIR, --results_dir RESULTS_DIR
                        Path to a folder where the experiment outputs should
                        be written. (DEFAULT: ./)
  -k KEY, --key KEY     User specific encoding key to save or load a .tlt
                        model.
  -e EXPERIMENT_SPEC_FILE, --experiment_spec_file EXPERIMENT_SPEC_FILE
                        Path to the experiment spec file.
  -g GPUS, --gpus GPUS  Number of GPUs to use. The default value is 1.
  -m RESUME_MODEL_WEIGHTS, --resume_model_weights RESUME_MODEL_WEIGHTS
                        Path to a pre-trained model or model to continue
                        training.
  -o OUTPUT_SPECS_DIR, --output_specs_dir OUTPUT_SPECS_DIR
                        Path to a target folder where experiment spec files
                        will be downloaded.
```

We are going to run inference on a pretrained model, QuartzNet, and export the model to other formats.  The options for the `speech_to_text` task apply to our test project as follows:
- **subtask**: choose from the list as needed; we will use a few of these for our test project (`infer`, `export`, etc.)
- **`-r` argument**: results directory path; we'll put our results in our workspace
- **`-k` argument**: key; so that we can save and load our work. We will always use the same encoding key (KEY='tlt_encode')
- **`-e` argument**: specification file for a particular subtask such as `infer.yaml` or `export.yaml`; we will fetch starter examples with the `download_specs` subtask into our workspace
- **`-g` argument**: we can be explicit or use the default because in our case, we only have a single GPU available
- **`-m` argument**: pretrained ASR model path; we will locate this in our workspace
- **`-o` argument**: this is only required for downloading spec files with `download_specs` to the `specs` directory in our workspace
- **additional arguments**: additional arguments can be tacked on to the command to override values in the specification file

## 4.3.1 Path Setup
In order for our various data and results directories to be visible when TAO runs its docker container, these directories must be mapped from our workspace to a location inside the container.  We can do this by using the launcher config file `~/.tao_mounts.json`. 

In addition to defining the mounts, we can also configure other docker options, such as the environment variables and the amount of shared memory available to the TAO launcher.

Execute the following cell to look at the configuration set up for this lab.

In [28]:
!cat ~/.tao_mounts.json

{
    "Mounts":[
        {
            "source": "/dli/task/tao",
            "destination": "/workspace/mount"
        }
    ],
    "DockerOptions":{
        "shm_size": "16G",
        "network": "host",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        }
    }
}


Define some folder locations and an encryption key.

In [29]:
import os.path
from shutil import rmtree

# The source mount is our workspace on the host (this lab instance)
source_mount = "/dli/task/tao"
# The destination mount is our mapped workspace within the TAO docker container's file structure
destination_mount = "/workspace/mount"

# The following paths are set relative to the TAO docker container
# The path to the specification yaml files
SPECS_DIR = os.path.join(destination_mount, 'specs')

# The results are saved at this path by default
RESULTS_DIR = os.path.join(destination_mount, 'results')

# The data are located at this path by default
DATA_DIR = os.path.join(destination_mount, 'data')

# The models are located at this path by default
MODELS_DIR = os.path.join(destination_mount, 'models')

# Set your encryption key, and use the same key for all commands. Please use "tlt_encode" if you'd like to deploy the models later with NVIDIA Riva.
KEY = 'tlt_encode'

In [30]:
# Check to see what we already have in our source directory
# Note: some temporary ".tar" files may appear as the DLI environment pre-loads large files
!ls $source_mount

backup_riva  data  models  results  specs


## 4.3.2 Specification Files
We want specification `.yaml` files in a `specs` directory to run the subtasks. We can load example files with the [download_specs subtask](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html#downloading-sample-spec-files), then modify them or override them later:

In [31]:
%%time
# The first time, TAO takes about 3 minutes to load and run

# Delete the speech_to_text specification directory if it already exists
folder = source_mount + '/specs/speech_to_text'
if os.path.exists(folder):
    rmtree(folder)
    
# Download specification files for speech_to_text 
!tao speech_to_text download_specs \
    -o $SPECS_DIR/speech_to_text \
    -r $RESULTS_DIR

2022-03-13 19:06:19,376 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 19:06:22 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 19:06:22 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

In [32]:
# Verify that the specs directory has been populated
!ls $source_mount/specs/speech_to_text

dataset_convert_an4.yaml  export.yaml	   train_jasper.yaml
dataset_convert_en.yaml   finetune.yaml    train_quartznet.yaml
dataset_convert_ru.yaml   infer.yaml
evaluate.yaml		  infer_onnx.yaml


In [33]:
%%time
# Delete the speech_to_text_citrinet specification directory if it already exists
folder = source_mount + '/specs/speech_to_text_citrinet'
if os.path.exists(folder):
    rmtree(folder)
    
# Download specification files for speech_to_text_citrinet
!tao speech_to_text_citrinet download_specs \
    -o $SPECS_DIR/speech_to_text_citrinet \
    -r $RESULTS_DIR

2022-03-13 19:06:44,535 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 19:06:47 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 19:06:47 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

In [34]:
# Verify that the specs directory has been populated
!ls $source_mount/specs/speech_to_text_citrinet

create_tokenizer.yaml	  evaluate.yaml  infer_onnx.yaml
dataset_convert_an4.yaml  export.yaml	 train_citrinet_256.yaml
dataset_convert_en.yaml   finetune.yaml  train_citrinet_bpe.yaml
dataset_convert_ru.yaml   infer.yaml


## 4.3.3 QuartzNet Inference 
For this lab, the speech-to-text English QuartzNet model, `speechtotext_english_quartznet.tlt`, has been downloaded in advance from the [NGC QuartzNet catalog location](https://ngc.nvidia.com/catalog/models/nvidia:tlt-riva:speechtotext_english_quartznet). This model was trained on a combination of seven datasets of English speech, with a total of 7,057 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 300 epochs with [Automatic Mixed Precision (AMP)](https://developer.nvidia.com/automatic-mixed-precision).
It achieves a Word Error Rate (WER) of 4.38% on the [LibriSpeech](https://www.openslr.org/12) dev-clean dataset, and a WER of 11.30% on the LibriSpeech dev-other dataset.

In [35]:
# Verify that the "speechtotext_english_quartznet.tlt" model is in the models directory
!ls $source_mount/models/speechtotext_english_quartznet.tlt

/dli/task/tao/models/speechtotext_english_quartznet.tlt


Listen to the audio files we are going to work with. The following provides a player for each in the notebook.

In [36]:
import librosa
import IPython.display as ipd

SAMPLE_01 = 'baby_audio_1.wav'
SAMPLE_02 = 'baby_audio_2.wav'
SAMPLE_03 = 'audio_sample.wav'

# Import audio sample paths
paths2audio_files=[source_mount + '/data/' + SAMPLE_01,
                   source_mount + '/data/' + SAMPLE_02,
                   source_mount + '/data/' + SAMPLE_03,
                  ]

# Load the audio files into the players
for wav_file in paths2audio_files:
    print(os.path.basename(wav_file))
    audio_0, sample_rate = librosa.load(wav_file)
    ipd.display(ipd.Audio(wav_file, rate=sample_rate))

baby_audio_1.wav


baby_audio_2.wav


audio_sample.wav


Now we can query the model. We'll use the [infer subtask](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html#using-inference-on-a-model), and provide the audio file locations (`file_paths=`) as an argument.  We could have also provided this argument in the [infer.yaml](tao/specs/speech_to_text/infer.yaml) spec file.
You should be able to see results toward the end of the execution output.

In [37]:
# You must have your NGC API key configured
!tao speech_to_text infer \
     -r $RESULTS_DIR/quartznet/infer \
     -k $KEY \
     -e $SPECS_DIR/speech_to_text/infer.yaml \
     -g 1 \
     -m $MODELS_DIR/speechtotext_english_quartznet.tlt \
     file_paths=[$DATA_DIR/$SAMPLE_01,$DATA_DIR/$SAMPLE_02,$DATA_DIR/$SAMPLE_03]

2022-03-13 19:07:44,650 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 19:07:47 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 19:07:47 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

Inference results are also saved in the `results` folder.

In [38]:
!grep transcript $source_mount/results/quartznet/infer/infer.log|  tail -3

[NeMo I 2022-03-13 17:55:32 infer:70] Predicted transcript: june eighteenth nineteen sixty eight
[NeMo I 2022-03-13 17:55:32 infer:70] Predicted transcript: october first nineteen sixty nine
[NeMo I 2022-03-13 17:55:32 infer:70] Predicted transcript: hi my name is dana and i work for in vidia


## 4.3.4 Citrinet Inference 
The speech-to-text English Citrinet model, `speechtotext_english_citrinet_1024.tlt`, has been downloaded in advance from the [NGC Citrinet catalog location](https://ngc.nvidia.com/catalog/models/nvidia:tlt-riva:speechtotext_english_citrinet). 
This model was trained on various NVIDIA proprietary and open-source datasets and including noisy data, multiple sampling rates (including 8khz for call centers), variety of accents, various domain specific data, spontaneous speech and dialogue, all of which contribute to the model’s accuracy.

In [39]:
# You must have your NGC API key configured
!tao speech_to_text_citrinet infer \
     -r $RESULTS_DIR/citrinet/infer \
     -k $KEY \
     -e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
     -g 1 \
     -m $MODELS_DIR/speechtotext_english_citrinet_1024.tlt \
     file_paths=[$DATA_DIR/$SAMPLE_01,$DATA_DIR/$SAMPLE_02,$DATA_DIR/$SAMPLE_03]

2022-03-13 19:09:27,620 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 19:09:30 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 19:09:30 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

In [40]:
!grep transcript $source_mount/results/citrinet/infer/infer.log|  tail -3

[NeMo I 2022-03-13 17:57:13 infer:70] Predicted transcript: june eighteenth nineteen sixty eight
[NeMo I 2022-03-13 17:57:13 infer:70] Predicted transcript: october first nineteen sixty nine
[NeMo I 2022-03-13 17:57:13 infer:70] Predicted transcript: hi my name is dana and i work for nvidia


How do you find the ASR performance of:
- QuartzNet model?
- Citrinet model?

Discuss the performace with the instructor.

To learn more about fine-tuning the QuartzNet model with your custom data, check out the [TAO documentation](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html#fine-tuning-the-model) and the [speech to text notebook on NGC](https://ngc.nvidia.com/catalog/resources/nvidia:tlt-riva:speechtotext_notebook).

---
# 4.4 Exporting Models with TAO Toolkit

Models trained with NVIDIA TAO  have the format suffix `.tlt` 

<img src="images/asr/tao_export_riva.png"><br>

With TAO, you can export those models into the [ONNX](https://onnx.ai/) (Open Neural Network Exchange) `.eonnx` format, or into the `.riva` format, which can be deployed using [NVIDIA Riva](https://developer.nvidia.com/riva).  NVIDIA Riva is a highly performant application framework for multimodal conversational AI services using GPUs! 

Both types of export use the same `export` subtask command. The only small variation is the configuration for the `export_format` setting in the [export.yaml](tao/specs/speech_to_text/export.yaml) spec file.  Alternatively, we can simply override the value in the spec file by adding an additional argument such as `export_format=RIVA` or `export_format=ONNX` to the end of the TAO launcher command. The documentation for the [export subtask](https://docs.nvidia.com/tao/tao-toolkit/text/asr/speech_recognition.html#model-export) provides syntax and parameter details.

## 4.4.1 ASR Model Export to ONNX

ONNX format is an open standard for machine learning and enables AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.  TAO can accept ONNX format in addition to the `.tlt` format we have already worked with.  

We can test this by exporting our model to ONNX format and then running inference again on the ONNX version of the model.

In [15]:
# Export to ONNX (this takes up to 40 sec)
!tao speech_to_text export \
     -r $RESULTS_DIR/quartznet/export \
     -k $KEY \
     -e $SPECS_DIR/speech_to_text/export.yaml \
     -g 1 \
     -m $MODELS_DIR/speechtotext_english_quartznet.tlt \
     export_format=ONNX \
     export_to=asr-model.eonnx

2022-03-13 17:57:58,893 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 17:58:01 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 17:58:02 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

In [16]:
# Verify that our `.eonnx` model was exported as expected.
!ls $source_mount/results/quartznet/export/asr-model.eonnx

/dli/task/tao/results/quartznet/export/asr-model.eonnx


## 4.4.2 Inference with an ONNX Model
TAO can use the exported `.eonnx model` for inference using the `infer_onnx` subtask.  This is very similar to the `infer` subtask for `.tlt` models. Try it! The results from this inference are listed toward the end of the execution output.

In [17]:
# Run inference with the ONNX model
!tao speech_to_text infer_onnx \
     -r $RESULTS_DIR/quartznet/infer_onnx \
     -k $KEY \
     -e $SPECS_DIR/speech_to_text/infer_onnx.yaml \
     -g 1 \
     -m $RESULTS_DIR/quartznet/export/asr-model.eonnx \
     file_paths=[$DATA_DIR/$SAMPLE_01,$DATA_DIR/$SAMPLE_02,$DATA_DIR/$SAMPLE_03]

2022-03-13 17:59:12,936 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 17:59:15 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 17:59:16 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

## 4.4.3 Exercise: Export the Model to Riva Format
Using what you've learned, export the ASR model to a Riva-compatible format.  Name the final model `asr-model.riva` using the `export_to` argument. If you get stuck, you can take a look at the [solution](solutions/ex4.4.3.ipynb).

In [18]:
# TODO Export model to Riva format
!tao speech_to_text export \
     -r $RESULTS_DIR/quartznet/export \
     -k $KEY \
     -e $SPECS_DIR/speech_to_text/export.yaml \
     -g 1 \
     -m $MODELS_DIR/speechtotext_english_quartznet.tlt \
     export_format=RIVA \
     export_to=asr-model.riva

2022-03-13 18:05:56,381 [INFO] root: Registry: ['nvcr.io']
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/root/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[NeMo W 2022-03-13 18:05:59 experimental:27] Module <class 'nemo.collections.asr.losses.ctc.CTCLoss'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
[NeMo W 2022-03-13 18:05:59 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text.AudioToCharDataset'> is experimental, not ready for production and is not fully supporte

Verify that your model was exported as expected.

In [19]:
if(os.path.exists('/dli/task/tao/results/quartznet/export/asr-model.riva')):
   print("You did it!")
else: 
   print("Sorry, the model isn't there.")

You did it!


---
# 4.5 (Optional) Create Your Own Audio Samples

You can upload your own audio samples to try the ASR performance. 
The files should be `.wav` format, resampled to 16kHz. 
Here is a [torchaudio](https://pytorch.org/audio/stable/index.html)-based example for `.wav` file resampling.

```
import torchaudio
input_wav_file = '/path/to/my_audio.wav'
output_wav_file = '/path/to/my_audio_resampled.wav'
y, sr = torchaudio.load(input_wav_file)
y = y.mean(dim=0) # if there are multiple channels, average them to single channel
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    y_resampled = resampler(y).unsqueeze(0)
    torchaudio.save(output_wav_file, y_resampled, 16000)
```

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have:
- Gained an understanding about the QuartzNet and Jasper ASR models
- Launched TAO with an implicit docker container to run inference on audio files with QuartzNet
- Exported the model to both ONNX and NVIDIA Riva formats
- Launched TAO to run inference using the ONNX model you exported

Next, you'll deploy the model on Riva. Move on to [Deployment with Riva](005_ASR_Riva_Deployment.ipynb).

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>