 # NVIDIA NeMo - Building Custom Speech Recognition Model

1.[Introduction](Introduction)

2.[Setup](Setup)

3.[Data](Data)

4.[Model Trainining ](Model-Trainining)

5.[Next Steps](Next-Steps)

 ## Introduction
NVIDIA NeMo is a Conversational AI toolkit. The toolkit is an accelerator, which helps researchers and practitioners to experiments with complex neural network architectures. Speech processing (recognition and synthesis) and Natural Language Processing are the significant capabilities of the platform. As it comes from the NVIDIA, full support to GPU is available. The framework relays on PyTorch as the Deep Learning framework. 

In this notebook, we will try how to create an Automatic Speech Recognition (ASR). In this tutorial, we will use the LibriSpeech dataset. 

## Setup

For this experiment the following software:
Ubuntu 16.04
Anaconda 4.7.11
NeMo - https://github.com/NVIDIA/NeMo 
Kaladi - https://github.com/kaldi-asr/kaldi 
Follow the instructions from the software readme to run the code. 
Make sure that you have PyTorch installed with GPU support. 
Hardware Specification
Minimum six GiG of GPU RAM is required. 

## Data
The LibriSpeech is an open domain speech recognition dataset.
We can download the data from here http://www.openslr.org/12. For this tutorial, we are using the dev-clean dataset - http://www.openslr.org/resources/12/dev-clean.tar.gz .  For making the training easy in a very small GPU footprint, we selected data from the folders 'dev-clean/84/121123/84' and 'dev-clean/84/121550/'. 

The speech files are store in .flac format, and it should be converted to '.wav' format for NeMo to work. The NeMo training requires a 'manifest' file. The 'manifest' file contains the path to '.wav' (speech recordings), duration of the speech, and transcripts for each recording. 

To make life easy, we created a utility to convert '.flac' to '.wav' and metadata files. 


In [None]:
from wavconvert import create_nemo_manifest

#### Create Training Manifest File

In [None]:
flac_path = "/home/jaganadhg/AI_RND/nvidianemo/LibriSpeech/dev-clean/84/121550/"
meta_apth = "meta_train.json"

create_nemo_manifest(flac_path,
    meta_apth)

In [None]:
flac_path = "/home/jaganadhg/AI_RND/nvidianemo/LibriSpeech/dev-clean/84/121123/"
meta_apth = "meta_val.json"

create_nemo_manifest(flac_path,
    meta_apth)

## Model Trainining 

Let's jump to build a model. We will talk about FFT's, spectrum, and language models later. A utility script is created to abstract the process. The QuartzNet15x5 model is used as the base model. Speech recognition results are evaluated with Word Error Rate (WER). The utility script implements a WER calculator. 

#### Note- The epoch values to be adjusted accordingly to get a decent model. 

 

In [None]:
from asrtrainer import (train_model,
        computer_wer)
from ruamel.yaml import YAML

In [None]:
config_path = 'quartznet_15x5.yaml'
train_manfest = "metadata.json"
val_manifest = "metadata_validation.json"

yaml = YAML(typ='safe')
with open(config_path) as f:
    model_params = yaml.load(f)
        
my_asr_model = train_model(model_params,
                            train_manfest,
                            val_manifest,
                            5,
                            False)
    
wer = computer_wer(model_params,
                    my_asr_model)


#### The saved model can be stored to a '.nemo' format. 


In [None]:
my_asr_model.save_to("tutorial.nemo")

## Next Steps

In this tutorial, we created a very simple model, which may not be performing well at all. We can try this in building a larger dataset, maybe the entire LibriSpeech dev-clean. An increase in the epochs (I tried with 1000 epochs and transcriptions were looking good!). 

If you are interested in playing further, the model configurations are available in the 'quartznet_13x5.yaml' file. 