<img src="https://www.nyp.edu.sg/content/dam/nyp/logo.png" width='200'/>

Welcome to the lab! Before we get started here are a few pointers on Jupyter notebooks.

1. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

2. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

3. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.


# Lab 3 - Using NVidia Nemo for Speech Recognition

In this exercise, we will learn how to make use of NVidia's NeMo library to help create our Speech Recognition models.

## Section 3.1 - Download / Import The Necessary Modules

Run the following cell to download the datasets, import the necessary modules and set up the folders.

In [None]:
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/day2-pm/lab3/lab3.zip
!unzip lab3.zip

!pip install ipywebrtc==0.5.0
!pip install ffmpeg-python==0.2.0

# Install all libraries necessary for NeMo
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install unidecode
!pip install matplotlib>=3.3.2

## Install NeMo
BRANCH = 'v1.0.2'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

## Grab the config we'll use in this example
!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml




## IMPORTANT - Restart Runtime

You must restart your Colab session to allow the latest libraries that we just downloaded to take effect. 

To do so, either:

1. Click on the "Restart Runtime" button in the output cell above, OR
2. Click on the menu "Runtime" > "Restart runtime"

In [None]:
from helpers import *
print ("Import helpers complete.")

## Section 3.2 - Using a Pre-Trained Nemo Model

Nvidia provides many pre-trained models for English and Chinese. You can find the list available here: https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels

The model that we will be using is the 'QuartzNet15x5Base-En' model. The QuartzNet model architecture looks like the following:

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/day2-pm/quartznet.png" />

https://arxiv.org/pdf/1910.10261.pdf


According to Nemo's website above, this model has been trained on a large number of speech datasets, including the Singapore National Speech Corpus (English). In theory, this model should work well in recognizing local accents, and you can go ahead to put it to the test!

Update the following cell to pass in the config file's path **"configs/config.yaml"** into the first parameter, and the pre-trained model name  **"QuartzNet15x5Base-En"** into the second parameters.

Run the cell to download the pre-trained model from NVidia.

In [None]:
# TODO:
# Set the path to the configuration file and the model name
# NOTE: The details of the neural network specified in the config file is
#       ignored in this case, because we will be downloading the pre-trained neural network.
#
create_speech_recognition_nemo_model_pretrained("configs/config.yaml", "???")

## Section 3.3 - Using the Pre-Trained Speech Recognition Model for Transcription

Take a look at the "video.mp4" file in the data folder. It is a video available in the public domain at: https://archive.org/details/GoodEati1951

What we will do now is to extract part of the audio clip from the movie file and send it to our pre-trained speech recognition model to transcribe.

To do so, update the following code to:

1. Set the first parameter for the movie file path to the **"data/video.mp4"** file, 
2. Set the second parameter for the start time (in seconds) of the audio to extract, for example, **0.0**
3. Set the third parameter for the end time (in seconds) of the audio to extract, for example, **120.0**.

See if the transcription using NVidia's Speech Recognition model look reasonable based on what you heard from the video.

In [None]:
# TODO:
# Update the parameters to specify the movie file, and the audio start/end to
# transcribe.
#
perform_transcription_on_file(extract_audio_clip(
    "???",            # Path to the movie file.
    0.0,              # Start time of audio (in seconds)
    0.0))             # End time of audio (in seconds)

Now, we are going to try using a recording to test our pre-trained Speech Recognition model. 

Run the following cell to display an audio recorder panel. 

In [None]:
display_audio_recorder()

Then, update the cell below to pass in the **save_recorded_audio()** function into the perform_transcription_on_file() function.

This has the effect of converting the recorded audio into a WAV format and then sending it into the Nemo speech recognition model.

Take a look at the output transcription and see if it matches to what you said.

In [None]:
# TODO:
# Pass the function 'save_recorded_audio()' into the parameter but without
# the single quotation marks "'"
#
perform_transcription_on_file(???)

## Section 3.4 - Processing Our Custom AN4 Dataset

Let's assume that we want to train our own custom speech recognition model that can help us recognize people spelling out words and saying numbers. There is an existing dataset, called the AN4 dataset, that contains a collection of audio files of people spelling persons' names, cities and states, or speaking out numbers, or simple words. This dataset was developed by Carnegie Mellon University for research purposes, and is available here: http://www.speech.cs.cmu.edu/databases/an4/

We have already downloaded the AN4 dataset from the link above and set it up in your Virtual Machine in the **"data/an4"** folder. 

The CMU's AN4 dataset contains audio files recorded in the SPH format. But Nemo only recognizes WAV formats, so we need to first load all the SPH files and individually convert them to the WAV format.

Update the following code by passing in the "folder" variable that we declared above.

In [None]:
# TODO:
# Update the code below to pass in the folder path containing 
# all our SPH files.
#
convert_sph_files_to_wav("???")

## Section 3.5 - Exploring Our Data

Let's take a look at one of the files that we have converted and see how an audio waveform looks like. This is the sound of a person saying the letters "G L E N N".

In [None]:
audio_file = 'data/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'
playback_audio(audio_file)


The following shows how the waveform actually looks like when plotted in a graph. If the audio recording contains distinct words or letters with pauses in between, you will be able to see large segments of "pertubations" where the speaker is speaking, and flat lines when the speaker makes short pauses. The "pertubations" represents vibrations as the sound wave travels through air and gets picked up by the microphone or our ear. 

In [None]:
display_audio_waveform(audio_file)

A spectrogram is a signal processing technique that converts a signal like an audio waveform above into its time-frequency components. Let's take a look at how that piece of audio looks like as a spectrogram. 

It tells you, at each time step, which frequencies (or the pitch / tune) are "playing".  

In [None]:
display_audio_spectrogram(audio_file)

Then, see how the same piece of audio looks like when the y-axis (the frequency axis) is mel-scaled. A mel-scale is a scale of pitches where human listeners perceive them to be "equally" spaced from each other. Interesting fact, the word "mel" actually comes from the word "melody".

In [None]:
display_audio_mel_spectrogram(audio_file)

## Section 3.6 - Generating Our Manifest Files From AN4

To train the Nemo model, we must prepare our training and test dataset with manifest files that indicates where to find the audio files, and what the correct transcriptions for each file is. 

The following cell below lists all the audio files, loads up all the transcripts and then generates the manifest file in the format that is required by the Nemo toolkit.

The original AN4 transcriptions look like this:

```
    <s> YES </s> (an251-fash-b)
    <s> GO </s> (an253-fash-b)
    <s> YES </s> (an254-fash-b)
```

We will have to convert it to the manifest required by Nemo in the following format:

```
    {"audio_filepath": "data/an4/wav/an4_clstk\\fash\\an251-fash-b.wav", "duration": 1.0, "text": "yes"}
    {"audio_filepath": "data/an4/wav/an4_clstk\\fash\\an253-fash-b.wav", "duration": 0.7, "text": "go"}
    {"audio_filepath": "data/an4/wav/an4_clstk\\fash\\an254-fash-b.wav", "duration": 0.9, "text": "yes"}
```

Update the cell below with the folders to original AN4 transcript files, the folders to the audio files, and the output Nemo manifest file. 

*Training File:*
- File is at **'data/train_manifest'**

*Test File:*
- File is at **'data/test_manifest'**

When you are done, run the cell.

In [None]:
# This constructs the manifest for the training data set.
#
# TODO: Update the code below to indicate where our transcripts, WAV files, and
#       and where we should save the manifest file.
#
build_manifest_for_an4_dataset('data/an4/etc/an4_train.transcription','data/an4/wav/an4_clstk', '???')
print("Training manifest created.")

# This constructs the manifest for the test data set.
#
# TODO: Update the code below to indicate where our transcripts, WAV files, and
#       and where we should save the manifest file.
#
build_manifest_for_an4_dataset('data/an4/etc/an4_test.transcription','data/an4/wav/an4test_clstk', '???')
print("Test manifest created.")

print("Completed.")

## Section 3.7 - Training a Model from Scratch

Training an Automatic Speech Recognition model takes a lot of time. Fortunately, the AN4 dataset is small enough for us to train our model quickly for demonstration purposes.

Update the parameters to indicate where to load the model's configuration file (configured to use a Jasper architecture), the training data and validation data manifest files.

The Jasper architecture looks like the following:

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/day2-pm/jasper.png" />

https://arxiv.org/pdf/1904.03288.pdf

Update the cell below to specify the training and test manifest files.

*Training File:*
- File is at **'data/train_manifest'**

*Test File:*
- File is at **'data/test_manifest'**


In [None]:
# Update the paths to the training and test manifest files.
#
create_speech_recognition_nemo_model('configs/config.yaml', '???', '???')

Update the batch size below to **8** and the number of epochs to **30**. 

Smaller batch sizes in training allow your model to converge quickly, but take slightly longer to train per epoch. Large batch sizes run quicker, but your model converges slower.

Then, run the cell below to see how the training happens. 

If you run this with Jupyter Notebook (not Jupyter Lab), you will be able to see the training loss and the training progresses. Pay attention to the reducing loss as the training continues. Typically, for a good Speech Recognition model in Nemo for this AN4 dataset, the loss should be about 10 or below. The lower the loss, the better the prediction confidence.  

If you are in Jupyter Lab, you may not be able to see the training loss, so just wait for all the training epochs to complete.

In [None]:
# TODO:
# Adjust the batch size between 4, 8, or 16, and set the number of epochs to 30.
# 
train_speech_recognition_nemo_model(
    0,                                   # Batch size
    1,                                   # Number of GPUs
    0)                                   # Max number of epochs

You can save the trained model to a file, and reload it up later!

In [None]:
save_speech_recognition_nemo_model('models/first_asr_model.nemo')

You can then load up the model from that saved file at any time later without re-training again.

In [None]:
load_speech_recognition_nemo_model('models/first_asr_model.nemo')

To determine how well our model is working, we compute the Word Error Rate by taking the validation data and getting our new speech recognition model to transcribe all audio files in our validation dataset.

In [None]:
# TODO:
# Compute the Word Error Rate based on the testing data.
# 
compute_wer('data/test_manifest')

## Section 3.8 - Use the Pre-Trained Model

We can improve the model by changing the model parameters or training the model for more epochs.

We can also try to use the pre-trained model (the one we used above) to see if the results are any good on this custom AN4 dataset. After all, the pre-trained model provided by Nvidia has been trained on a much large dataset and is able to capture more variations for the same words. 

Update the following cell to: 
1. Use the configuration file at **"configs/config.yaml"**
1. Download the pre-trained model **"QuartzNet15x5Base-En"**.
2. Call the compute_wer function with the new model and the params.

NOTE: Once again, NVidia provides a list of models that you can use. Refer to the link to find out more: https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels


In [None]:
# TODO:
# Set the path to the configuration file and the model name
# NOTE: The details of the neural network specified in the config file is
#       ignored in this case, because we will be downloading the pre-trained neural network.
#
create_speech_recognition_nemo_model_pretrained('configs/config.yaml', '???')


Run the following cell below to see what happens when we use a pre-trained model on our AN4 validation dataset.

The Word Error Rate for the pre-trained model may average at about 13%, and is not as good as we expected, given that the model has been pre-trained. It could very well be due to the fact that the AN4 "spelling" task is very different from the conversational audio datasets used to train that model. 

In [None]:
# TODO:
# Update the following line to compute the WER on our test dataset
#
compute_wer('data/test_manifest')

## Section 3.9 - Using Transfer Learning a Pre-Trained Model

Since the downloaded model has been previously trained to transcribe English text using a very large speech corpus, it would have learned to model a lot of features about the English language. 

We can take advantage of that to improve the Word Error Rate of our custom AN4 recognition task by further training the downloaded QuartzNet model with the custom AN4 dataset.

In the following cell, set the learning rate to 0.001. This value is actually lower than the default learning rate of 0.01 we used to train a new model from scratch. The rationale for this is that since the model is already pre-trained, we do not need the model to make big adjustments during training.


In [None]:
# TODO:
# Update the learning rate to 0.001
#
set_learning_rate(0)

Update the following cell to set the batch size to **8**, and the number of epochs to **3**, and then run the cell.

You will notice that the training takes a little longer per epoch, compared to our previous model, because this model is large and more trainable parameters.

In [None]:
set_train_test_manifest('data/train_manifest', 'data/test_manifest')

# TODO:
# Update the batch size and the number of epochs.
#
train_speech_recognition_nemo_model(
    0,                                   # Batch size
    1,                                   # Number of GPUs
    0)                                   # Max number of epochs

Once again, update and run the following cell to use our newly-trained QuartzNet model to compute the Word Error Rate.

With the custom training that we have done, the Word Error Rate on your test data drops dramatically. You can try to run the cell above to train the model for a few more epochs to reduce the Word Error Rate. You may find that the Word Error Rate can drop down to about 2%!

Try reducing the learning rate to an even lower value than 0.001 and train your model again. You may see your Word Error Rate fall to near 0%!

In [None]:
# TODO:
# Update the following line to compute the WER on our test dataset
#
compute_wer('data/test_manifest')

Run the following the save the model.

In [None]:
save_speech_recognition_nemo_model('models/improved_quartznet_model.nemo')

Once again, you can load up the improved QuartzNet model trained on the AN4 dataset at any time later.

In [None]:
load_speech_recognition_nemo_model('models/improved_quartznet_model.nemo')

## Section 3.10 - Testing Your Model

This is how we can perform a transcription on a WAV file using our newly trained QuartzNet model.


In [None]:
perform_transcription_on_file('data/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav')

Now, try to record some audio of yourself spelling out words or saying numbers.  

Run the following cell to display an audio recorder panel.

In [None]:
display_audio_recorder()

Then, update the cell below to pass in the **save_recorded_audio()** function into the perform_transcription_on_file() function.

And run the cell to transcribe what you have recorded.

In [None]:
# TODO:
# Pass the function 'save_recorded_audio()' into the parameter but without
# the single quotation marks "'"
#
perform_transcription_on_file(save_recorded_audio())

Try also to record conversational speech.

Discuss about what happened to your newest model when trying to transcribe conversational speech compared to the original unaltered pre-trained model?


## Section 3.11 - Explore helpers.py

Do take a look at the helpers.py file to see how we processed files and used Nemo.


## Section 3.12 - Challenge

Try to use your model to transcribe and perform the sentiment analysis using the Text Classification models that we have created in the earlier labs.


## Section 3.13 - Further Reading

The Nvidia Nemo library provides many neural network models for Natural Language Processing, Speaker Recognition, Text-to-Speech. You can find out more about Nemo here: 
- https://github.com/NVIDIA/NeMo

This exercise was also heavily adapted from the tutorial provided by Nemo that you can find here:
- https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb
