<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 4.0 ASR Pipeline with NVIDIA NeMo
## (part of Lab 1)

In this notebook, you'll work with the NeMo library to a run the speech recognition pipeline and zoom into each of the steps, including preprocessing modules (spectrograms), acoustic models (AMs), predictions, and post processing steps.

**[4.1 Speech Representation](#4.1-Speech-Representation)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.1 Speech in the Temporal Domain](#4.1.1-Speech-in-the-Temporal-Domain)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.2 Speech in the Frequency Domain](#4.1.2-Spectrograms)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.3 Mel Spectrograms](#4.1.3-Mel-Spectrograms)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.4 Exercise: Reduce the Mel Filter Bands](#4.1.4-Exercise:-Reduce-the-Mel-Filter-Bands)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.1.5 Go Further With Speech Representation (Cepstrum and MFC)](#4.1.5-Go-Further-With-Speech-Representation-(Cepstrum-and-MFC))<br>
**[4.2 Acoustic Model Architectures](#4.2-Acoustic-Model-Architectures)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.2.1 QuartzNet](#4.2.1-QuartzNet)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.2.2 Citrinet](#4.2.2-Citrinet)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.2.3 Conformer-CTC](#4.2.3-Conformer-CTC)<br>
**[4.3 Acoustic Models with NeMo](#4.3-Acoustic-Models-with-NeMo)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.1 Load QuartzNet](#4.3.1-Load-QuartzNet)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.2 Load Citrinet](#4.3.2-Load-Citrinet)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.3 Load Conformer-CTC](#4.3.3-Load-Conformer-CTC)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.3.4 Greedy Inference](#4.3.4-Greedy-Inference)<br>
**[4.4 Transcript Decoders](#4.4-Transcript-Decoders)<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.1 Beam Search Decoder](#4.4.1-Beam-Search-Decoder)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[4.4.2 Beam Search Decoder with a Language Model](#4.4.2-Beam-Search-Decoder-with-a-Language-Model)<br>
**[4.5 Punctuation and Capitalization](#4.5-Punctuation-and-Capitalization)<br>**
**[4.6 Inverse Text Normalization (ITN)](#4.6-Inverse-Text-Normalization-(ITN))<br>**
&nbsp;&nbsp;&nbsp;&nbsp;[4.6.1 ITN on Other Languages](#4.6.1-ITN-on-Other-Languages)<br>
**[4.7 Language Identification](#4.7-Language-Identification)<br>**
**[4.8 (Optional) Create Your Own Audio Samples](#4.8-(Optional)-Create-Your-Own-Audio-Samples)<br>**
**[4.9 Shut Down the Kernel](#4.9-Shut-Down-the-Kernel)<br>**

<img src="images/asr/ASR_pipeline.PNG">

Building an Automatic Speech Recognition (ASR) pipeline is often the first step in building a conversational AI application. An ASR model converts audio speech into readable text. The main metric used to evaluate ASR models is the Word Error Rate (WER).

- **Feature Extraction:** This step converts the temporal audio form of speech to the frequency domain and generates a spectrogram or mel spectrogram.
- **Acoustic Model (AM):** The AM is neural network that outputs probabilities over characters, phonemes, or tokens for each time step. In this lab, we will use state-of-the-art AMs: QuartzNet, Citrinet, and Conformer-CTC.
- **Decoder and Language Model (LM):** A decoder converts the probability matrix output from the AM into text. The language model is usually used to rescore the likelihood of its text training corpus. 
- **Punctuation:** Add capitalization and punctuation marks.   
- **Inverse Text Normalization (ITN):** Transform the text into readable format using grammar rules.

---
# 4.1 Speech Representation
To process audio from with Machine Learning techniques, we need to represent it in numerical format. We will see 2 forms of Speech representation: Temporal and frequency domains

## 4.1.1 Speech in the Temporal Domain


The most common representation of speech data uses temporal domain which basically represents amplitude changes (vibrations) through time.

<img src="images/asr/time_domain.png">

Let's have a look at wave file represented in temporal domain. 

In [None]:
# import relevant libraries
import os

import numpy as np
# Import audio processing library
import librosa
# We'll use this to listen to audio
from IPython.display import Audio, display
from plotly import graph_objects as go
import ipywidgets

Let's now load and listen to an audio sample. We will be using the `librosa.load` function. The sampling rate is the number of sampled amplitude values per second. To preserve the sampling rate of the file, we will use `sr=None`.

In [None]:
AUDIO_FILENAME = 'dli_workspace/data/audio_sample.wav'

# load audio signal with librosa
signal, sample_rate = librosa.load(AUDIO_FILENAME, sr=None)
duration=librosa.get_duration(y=signal, sr=sample_rate)
print("Duration:", duration)
print("Native sample rate:", sample_rate)

In [None]:
# Display audio player for the signal
display(Audio(data=signal, rate=sample_rate))

We can check the shape of the loaded signal. It is a one-dimensional tensor with the amplitude values measured through time. The tensor size is $duration \times sample\_rate$

In [None]:
# look at the signal shape
print("Signal shape:", signal.shape)
print("duration x sample_rate =", duration*sample_rate)

Let's now visualize the signal of the previous wave file in the temporal domain.

In [None]:
# define the temporal display layout
temporal_layout={
        'height': 300,
        'xaxis': {'title': 'Time (s)'},
        'yaxis': {'title': 'Amplitude'},
        'title': 'Rebuilt Audio Signal',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }
line={'color': 'green'}


# Plot the signal in time domain
fig_signal = go.Figure(go.Scatter(x=np.arange(signal.shape[0])/sample_rate, y=signal, line=line,name='Waveform',
               hovertemplate='Time: %{x:.2f} s<br>Amplitude: %{y:.2f}<br><extra></extra>'),
                layout=temporal_layout)
fig_signal.show()

## 4.1.2 Speech in the Frequency Domain


Another common way of representing speech is the frequency domain. This consistis in transforming a temporal domain representation to the set of frequencies in the signal using a [Fourier Transform (FT)](https://en.wikipedia.org/wiki/Fourier_transform). The result of a [Discrete Fourier Transform (DFT)](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) algorithm (a discrete version of FT) is called a spectrogram. A [Short Time Fourier transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) represents a signal in the time-frequency domain calculated with DFTs over short overlapping windows.

Let's see the Short-Time-Fourier-Transform in action:
- `n_fft`: Short segments length
- `time_stride`: segments overlap 

In [None]:
# Calculate the STFT
time_stride=0.01
hop_length = int(sample_rate*time_stride)
n_fft = 512

# linear scale spectrogram
s_stft = librosa.stft(y=signal, n_fft=n_fft,hop_length=hop_length)

print("hop_length is {}".format(hop_length))

Visualize the spectrogram generated by the STFT operation.

In [None]:
# Define the frequency display layout
colorscale=[ [0, 'rgb(30,62,62)'], [0.5, 'rgb(30,128,128)'], [1, 'rgb(30,255,30)'],]
colorbar=dict(ticksuffix=' dB')
frequency_layout={     'height': 300,
        'xaxis': {'title': 'Time (s)'},
        'yaxis': {'title': 'Frequency (kHz)'},
        'title': 'Spectrogram',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }

# Convert a power spectrogram (amplitude squared) to decibel (dB) units
s_stft_db = librosa.power_to_db(np.abs(s_stft)**2, ref=np.max, top_db=100)

# Plot the spectrogram
fig_spectrum = go.Figure(go.Heatmap(z=s_stft_db, colorscale=colorscale,colorbar=colorbar, name='Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'),layout=frequency_layout)
fig_spectrum.show()

From the frequency domain, it is possible to roll back to the temporal domain using inverse STFT. 
Let's generate the temporal domain representation of the audio from the spectrogram and listen to it

In [None]:
# Inverse STFT
signal_hat = librosa.istft(s_stft, n_fft=n_fft, hop_length=hop_length)

# Plot the converted signal in time domain
fig_signal = go.Figure( go.Scatter(x=np.arange(signal_hat.shape[0])/sample_rate, y=signal_hat, line=line,
               name='Waveform',hovertemplate='Time: %{x:.2f} s<br>Amplitude: %{y:.2f}<br><extra></extra>'),layout=temporal_layout)
fig_signal.show()

# Listen to the converted signal  
display(Audio(data=signal_hat, rate=sample_rate))

## 4.1.3 Mel Spectrograms

The human ear perceives the differences between frequencies in a non-linear way. Indeed, the difference in low frequencies are more perceptible than difference in high frequencies. The mel scale representation allows us to reproduce this effect by making the signal more discriminative for low frequencies and less discriminative for high frequencies. In speech processing, mel filter banks can be applied to convert spectrograms to match the human perception of the distances between frequencies.

Let's create a mel filter bank with 16 mel bands (filters) using the function `librosa.filters.mel` and visualize them.

In [None]:
# Define the mel spectrogram display layout

mel_colorbar=dict(ticksuffix=' dB')
mel_colorscale=[
                   [0, 'rgb(30,62,62)'],
                   [0.5, 'rgb(30,128,128)'],
                   [1, 'rgb(30,255,30)'],
               ]
mel_layout={
        'height': 500,
        'xaxis': {'title': 'Frequency (kHz)'},
        'yaxis': {'title': 'Mel Filters'},
        'title': 'Mel Filter Bank',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }

In [None]:
# Number of mel bands to generate
n_mels = 16

# Create a mel filter bank.
mel_16 = librosa.filters.mel(sr=sample_rate, n_fft=n_fft, n_mels=n_mels)

# Plot the mel filter bank
fig_spectrum = go.Figure(go.Heatmap(z=mel_16,colorscale =mel_colorscale, colorbar=mel_colorbar, ygap=0.1, name='Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'), layout=mel_layout)
fig_spectrum.show()


Next, convert the spectrogram to the mel scale with 128 filters. We will use the function `librosa.feature.melspectrogram`.

In [None]:
# Mel spectrogram
n_mels = 128
n_fft = 512

# Mel scale spectrogram
S = librosa.feature.melspectrogram(y=signal, sr=sample_rate, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)

# Convert a power spectrogram (amplitude squared) to decibel (dB) units
melspectrogram_DB = librosa.power_to_db(np.abs(S)**2, ref=np.max, top_db=100)

# plot the signal in frequency domain
fig_spectrum = go.Figure(go.Heatmap(z=melspectrogram_DB, colorscale=colorscale,colorbar=dict(ticksuffix=' dB'),name='Mel-Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'),layout=frequency_layout)
fig_spectrum.show()

# Plot the spectrogram
fig_spectrum = go.Figure(go.Heatmap(z=s_stft_db, colorscale=colorscale,colorbar=colorbar, name='Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'),layout=frequency_layout)
fig_spectrum.show()

The distance between low frequencies is stretched.

## 4.1.4 Exercise: Reduce the Mel Filter Bands
Try to apply a mel scale to the previous spectrogram with only 16 mel bands. Convert to decibels and visualize the results. If you get stuck, refer to the [solution](solutions/ex4.1.4.ipynb).


In [None]:
# Mel spectrogram

n_fft = #FIXME
n_mels = #FIXME

# Mel scale spectrogram
melspectrogram = #FIXME

# Convert a power mel spectrogram (amplitude squared) to decibel (dB) units
melspectrogram_DB = #FIXME

# plot the signal in the frequency domain
fig_spectrum = go.Figure(go.Heatmap(z=melspectrogram_DB, colorscale=colorscale,colorbar=dict(ticksuffix=' dB'),name='Mel-Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'),layout=frequency_layout)
fig_spectrum.show()


## 4.1.5 Go Further With Speech Representation (Cepstrum and MFC)
By applying a second FT on the spectrograms, we get the _cepstrum_. This represents the variations between the frequency bands.

Features derived from the cepstrum describe speech better than features taken directly from the frequency spectrum.

_Mel frequency cepstrum (MFC)_ uses a [cosine transform (CT)](https://en.wikipedia.org/wiki/Discrete_cosine_transform) instead of FT applied on the mel spectrogram. The CT is well suited for compression as it extracts only the real part of the signal. So MFCs are cepstrums equal to spectrum-of-a-log-spectrums of signals. 


---
# 4.2 Acoustic Model Architectures

In this lab, we will experiment with three acoustic models: [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) and [Citrinet](https://arxiv.org/pdf/2104.01721.pdf), and Conformer-CTC.  Let's begin by exploring the architectures of each.

## 4.2.1 QuartzNet
_QuartzNet_ is a deep neural model for speech recognition developed by NVIDIA Research. The network is divided into:
- Encoder - trains the acoustic features representation
- Decoder - maps those features to the vocabulary (characters or phonemes).  

QuartzNet is a variant of the _NVIDIA Jasper_ model [(Just Another Speech Recognizer)](https://arxiv.org/pdf/1904.03288.pdf).  However, QuartzNet replaces Jasper's 1D convolutions with 1D time-channel separable convolutions, which use fewer parameters while keeping a similar accuracy. QuartzNet uses a non-autoregressive CTC-based (Connectionist Temporal Classification) decoding scheme, which means that it does not require manual alignment between the input and output pairs. Learn more about the [CTC loss](https://www.cs.toronto.edu/~graves/icml_2006.pdf).

<img src="images/asr/quartz_vertical.png">

## 4.2.2 Citrinet

_Citrinet_ is a variant of QuartzNet, developed by NVIDIA Research. Unlike QuartzNet, which predicts characters or phonemes, Citrinet uses subword encoding via WordPiece tokenization. This results in performance improvement of the audio transcripts.

<img src="images/asr/citrinet_vertical.png">

## 4.2.3 Conformer-CTC

The _Conformer_ model uses the combination of self-attention and convolution modules to achieve the best of the two approaches.  The self-attention layers can learn the global interaction while the convolutions efficiently capture the local correlations. The self-attention modules support both regular self-attention with absolute positional encoding, and also Transformer-XL’s self-attention with relative positional encodings. 

_Conformer-CTC_ has a similar encoder to the original Conformer but uses CTC loss and decoding instead of RNNT/Transducer loss, which makes it a non-autoregressive model. We also drop the LSTM decoder and, instead, use a linear decoder on the top of the encoder. 

Here is the overall architecture of the encoder of Conformer-CTC:

<img src="images/asr/conformer-encoder.png"> 

---
# 4.3 Acoustic Models with NeMo

In the NeMo library, ASR models are defined under the `nemo_asr.models.ASRModel` method. 

To load an acoustic model, it is possible to restore the parameters from a local `.nemo` model. 

Alternatively, pretrained models can be loaded from the NVIDIA Repository, NGC, using the `from_pretrained(...)` method that downloads and initializes models directly from the cloud. To check the list of available pretrained models, please use the `list_available_models` method.


In [None]:
import nemo
import nemo.collections.asr as nemo_asr

def display_list_available_models(model):
    print ( "list of available models:")
    for m in model.list_available_models():
        print ("   ", "\033[1;34m", m.pretrained_model_name)


In [None]:
display_list_available_models(nemo_asr.models.ASRModel)

## 4.3.1 Load QuartzNet

Let's load a base English QuartzNet15x5 model, `stt_en_quartznet15x5`. The speech-to-text English QuartzNet model was trained on a combination of seven datasets of English speech, with a total of 7,057 hours of audio samples. Samples were limited to a minimum duration of 0.1s long, and a maximum duration of 16.7s long. The model was trained for 300 epochs with [Automatic Mixed Precision (AMP)](https://developer.nvidia.com/automatic-mixed-precision).
It achieves a Word Error Rate (WER) of 4.38% on the [LibriSpeech](https://www.openslr.org/12) dev-clean dataset, and a WER of 11.30% on the LibriSpeech dev-other dataset.


In [None]:
am_model_quartznet = nemo_asr.models.ASRModel.from_pretrained(model_name='stt_en_quartznet15x5', strict=False)

Let's now check its vocabulary. QuartzNet is a character-based ASR model. 

In [None]:
# Check vocabulary
am_model_quartznet.decoder.vocabulary

## 4.3.2 Load Citrinet

The English Citrinet-512 model has 36 million parameters. This model was trained on the ASR dataset with over 7000 hours of English speech. It uses the [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a vocabulary size at 1024, and transcribes text into lower case English along with spaces, apostrophes, and a few other characters.

The WER achieved by the English Citrinet-512 model is 3.7% on LibriSpeech (test-clean) and 3.2% on the Wall Street Journal (WSJ) (Eval 92). More details can be found on the [NGC stt_en_citrinet_512 model card](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_citrinet_512).

Let's load the model, `stt_en_citrinet_512`:

In [None]:
am_model_citrinet = nemo_asr.models.ASRModel.from_pretrained(model_name="stt_en_citrinet_512")

Take a look at Citrinet's vocabulary. Citrinet is a subword-based ASR model. This particular model can decode speech into 1024 subwords.

In [None]:
am_model_citrinet.decoder.vocabulary

## 4.3.3 Load Conformer-CTC

The base English Conformer-CTC Large model has around 120M parameters. This model was trained on about 24500 hours of English speech. It uses the [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a vocabulary size of 128, and transcribes text in lower case English along with spaces, apostrophes, and a few other characters.
The WER achieved by English Conformer-CTC large model is 2.1% on LibriSpeech (test-clean) and 1.7% on the WSJ (Eval 92). More details can be found on the [NGC stt_en_conformer_ctc_large model card](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_large).

Let's load the model `stt_en_conformer_ctc_large`:


In [None]:
# takes few seconds
am_model_conformer = nemo_asr.models.ASRModel.from_pretrained(model_name="stt_en_conformer_ctc_large")

Look at the Conformer-CTC vocabulary.  Similar to Citrinet, Conformer-CTC model is a subword-based ASR model. This particular Conformer-CTC vocabulary size is 128.

In [None]:
am_model_conformer.decoder.vocabulary

## 4.3.4 ASR Inference with Greedy Decoder

If we have an entire audio clip available, we can perform offline inference with the models previously loaded (QuartzNet, Citrinet, and Conformer) to transcribe it in greedy mode. A _greedy decoder_ simply chooses the token with the highest probability at each time step to build the transcriptions. 
 
The easiest way to do this is to call the `transcribe(...)` method, which transcribes multiple files in a batch, applying a CTC greedy decoder to raw probability distributions over the alphabet characters from the ASR model.

In [None]:
files = [AUDIO_FILENAME]

In [None]:
# QuartzNet transcription 
transcript = am_model_quartznet.transcribe(paths2audio_files=files)[0]
print(f'QuartzNet Transcript: "{transcript}"')

In [None]:
# Citrinet transcription
transcript = am_model_citrinet.transcribe(paths2audio_files=files)[0]
print(f'Citrinet Transcript: "{transcript}"')

In [None]:
# Conformer-CTC transcription
files = [AUDIO_FILENAME]
transcript = am_model_conformer.transcribe(paths2audio_files=files)[0]
print(f'Conformer-CTC Transcript: "{transcript}"')

How do you find the performance of AMs in greedy mode for the:
- QuartzNet model?
- Citrinet model?
- Conformer-CTC model?
Discuss the performance with the instructor.

#### Visualize the Acoustic Model Outputs
To get log probabilities output by the Acoustic Model, we can set the argument `logprobs=True` when querying for transcriptions.

In [None]:
logits= am_model_conformer.transcribe(files, logprobs=True)
print(logits[0])

Let's now visualize the log probabilities output by AM models for each time step. 

First, run the next cell to define the display setup. And then, run the following cells to visualize the log probabilities per time step for QuartzNet, Citrinet, and Conformer. For the QuartzNet AM, you should see:

<img src="images/asr/log_prob_quartznet.PNG">


In [None]:
prob_layout={
            'height': 2000,
            'xaxis': {'title': 'Time, s'},
            'yaxis': {'title': 'Vocabulary'},
            'title': 'Vocabulary Probabilities',
            'margin': dict(l=0, r=0, t=40, b=0, pad=0),
        }
prob_colorscale=[[0, 'rgb(220,220,220)'], [1, 'rgb(255,0,0)'],]

# softmax implementation in NumPy
def softmax(logits):
    e = np.exp(logits - np.max(logits))
    return e / e.sum(axis=-1).reshape([logits.shape[0], 1])

def display_logprobs(asr_model, files):
    # let's do inference once again but without decoder
    logits = asr_model.transcribe(files, logprobs=True)[0]
    probs = softmax(logits)

    # 20ms is duration of a timestep at output of the model
    time_stride = 0.01

    # get model's alphabet
    labels = list(asr_model.decoder.vocabulary) + ['blank']
    labels[0] = 'space'

    # plot probability distribution over characters for each timestep
    fig_probs = go.Figure(
        go.Heatmap(z=probs.transpose(),colorscale=prob_colorscale, y=labels, dx=time_stride, name='Probs', ygap=1,xgap=0.1,
                   hovertemplate='Time: %{x:.2f} s<br>Character: %{y}<br>Probability: %{z:.2f}<extra></extra>' ),layout = prob_layout)
    fig_probs.show()


In [None]:
# Visualize the log probabilities per time step for QuartzNet
display_logprobs(am_model_quartznet, files)

In [None]:
# Visualize the log probabilities per time step for Citrinet
display_logprobs(am_model_citrinet, files)

In [None]:
# Visualize the log probabilities per time step for Conformer
display_logprobs(am_model_conformer, files)

It is easy to identify time steps for the `blank` character.

---
# 4.4 Transcript Decoders

Acoustic models output log probabilities across their vocabularies at each time step. Decoders are applied on top of this matrix to find the best transcription candidates.

## 4.4.1 Beam Search Decoder

While the greedy decoder selects the highest probability at each time step, the _beam search decoder_ keeps track of several best sequences at each step. The _beam size_ limits the number of candidates to consider by fixed number of beams to keep and expand upon.  The selection of beams is based on the highest probabilities of the sequence.

The probability of a sequence composed on three words $W_1 W_2 W_3$ is computed as follows: $P(W_1 W_2 W_3) = p(W_1) * p(W_2/W_1) * p(W_3/W_1 W_2) $

With NeMo, we can set up the beam search decoder using the `BeamSearchDecoderWithLM` module.
 
Run the next three cells to: 
- Set up a beam search decoder with beam size of 16
- Recompute the log probabilities of the previous audio sample using QuartzNet
- Run the beam search decoder using the `BeamSearchDecoderWithLM.forward()`

In [None]:
# Define number of CPUs to use. Set to the max when processing large batches of log probabilities
num_cpus=max(os.cpu_count(), 1)

# Set the beam size
beam_size=16

# Get the vocabulary size 
vocab=list(am_model_quartznet.decoder.vocabulary)

# Beam search
beam_search = nemo_asr.modules.BeamSearchDecoderWithLM(beam_width=beam_size, lm_path=None, alpha=None, beta=None, vocab=vocab, num_cpus=num_cpus, input_tensor=False)

In [None]:
# Recompute the log probabilities of the previous audio sample using QuartzNet
files = [AUDIO_FILENAME]
logits = am_model_quartznet.transcribe(files, logprobs=True)[0]
probs = softmax(logits)

In [None]:
# Run the beam search decoder
best_sequences=beam_search.forward(log_probs = np.expand_dims(probs, axis=0), log_probs_length=None)
print( "Number of best sequences :", len(best_sequences[0]))
print( "Best sequences :")
best_sequences[0]

How do you find the performance of QuartzNet acoustic model with a beam search decoder? Discuss that with the instructor.

## 4.4.2 Beam Search Decoder with a Language Model

A vanilla beam search decoder allows us to catch some spelling errors, but it doesn't take into account the language modeling. A language model (LM) measures the correctness of a sentence by scoring its probability based on the text corpus used to train the LM. 

A LM can be used with the beam search decoders to produce more accurate candidates. The score calculation formula is:

$final\_score = acoustic\_score + beam\_a\mathbf{}lpha*lm\_score + beam\_beta*seq\_length$

where:
- `acoustic_score` is the score predicted by the acoustic encoder
- `lm_score` is the score predicted by the language model
- `beam_alpha` specifies the importance of the language model
- `beam_beta` is a penalty term according to the sequence length


It is possible to use an external [KenLM](https://kheafield.com/code/kenlm/)-based N-GRAM language model to rescore multiple transcription candidates. 

In the next, we will:
- Download an n-gram language model from NGC. This is a simple 4-gram language model trained with Kneser-Ney smoothing using KenLM library
- Load the n-gram language model with the `kenlm` library
- Score several options for nvidia spelling
- Run the beam search decoder with the n-gram LM on the QuartzNet acoustic model output

In [None]:
# Download the model from NGC 
!ngc registry model download-version "nvidia/tao/speechtotext_en_gb_lm:deployable_v1.0" \
    --dest dli_workspace

In [None]:
# Check the downloaded n-gram language models
! ls dli_workspace/speechtotext_en_gb_lm_vdeployable_v1.0/

In [None]:
# Install kenlm package
!pip install https://github.com/kpu/kenlm/archive/master.zip

In [None]:
import kenlm

# Load n-gram Language Model
EN_GB_LM='/dli_workspace/speechtotext_en_gb_lm_vdeployable_v1.0/en_gb_comp_norm_2.1_3gram.bin'
model = kenlm.Model(EN_GB_LM)

In [None]:
# Score several options for nvidia
print("Language Model score for 'in vidia' is:", model.score('in vidia'))
print("Language Model score for 'an vidia' is:", model.score('an vidia'))
print("Language Model score for 'invidia' is:", model.score('invidia'))
print("Language Model score for 'anvidia' is:", model.score('anvidia'))
print("Language Model score for 'nvidia' is:", model.score('nvidia'))

Let's now instantiate the Beam Search decoder with a Language Model using `BeamSearchDecoderWithLM` module by provising $lm_path$, $alpha = 2$ and $beta = 1.5$.

In [None]:
# Beam search with LM rescoring
beam_search_lm= nemo_asr.modules.BeamSearchDecoderWithLM(beam_width=beam_size, lm_path=EN_GB_LM, alpha=2, beta=1.5, vocab=vocab, num_cpus=num_cpus, input_tensor=False)

In [None]:
# Apply beam search with LM
hypothesis=beam_search_lm.forward(log_probs = np.expand_dims(probs, axis=0), log_probs_length=None)

# Get the best hypothesis
best_hypothesis=hypothesis[0][0]
print("Best hypothesis using the Beam Search decoder with a Language Model: ", best_hypothesis[1])
print("Best hypothesis score :", best_hypothesis[0] )

How do you find the performance when applying a beam search decoder with the n-gram LM?

---
# 4.5 Punctuation and Capitalization

So far, our Automatic Speech Recognition (ASR) pipeline generated text with no punctuation and capitalization of the words. 
In conversational AI applications, the ASR output could be used as input to Natural Language Understanding (NLU) modules. Punctuation and capitalization add more information for the NLU modules and could potentially boost their performance.

We will be using a BERT base uncased language model fine-tuned with two token classification heads for:
- Predicting a punctuation mark that should follow the word (if any). The model supports commas, periods, and question marks.
- predicting if the word should be capitalized or not.

Let's check available NeMo `PunctuationCapitalizationModel` models.


In [None]:
# load relevant libraries
import nemo
import nemo.collections.nlp as nemo_nlp

In [None]:
display_list_available_models(nemo_nlp.models.PunctuationCapitalizationModel)

Let's load the [punctuation_en_bert](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/punctuationcapitalization_english_bert) model.

In [None]:
# load a punctuation and capitalization NeMo model
model = nemo_nlp.models.PunctuationCapitalizationModel.from_pretrained(model_name="punctuation_en_bert")

Let's run the punctuation and capitalization model using the `add_punctuation_capitalization` function that can take in a list of non-capitalized and punctuated text.

In [None]:
# Run punctuation and capitalization on some text
list_text=['really','how are you doing', 'great how about you']

list_text_PC=model.add_punctuation_capitalization(list_text)

In [None]:
# Show punctuation and capitalization results
print("Capitalization and Punctuation Results:")
results = "\n".join("      {} --> {}".format(x, y) for x, y in zip(list_text, list_text_PC))
print(results)

Now, let's go back to our transcriptions problem and run the punctuation and capitalization on the best transcription hypothesis resulted from the Beam Search Decoder with a LM:

In [None]:
# Run punctuation and capitalization of our best transcription hypothesis
best_hypothesis_cp=model.add_punctuation_capitalization([best_hypothesis[1]])

In [None]:
print("Best transcription with capitalization and punctuation: ", best_hypothesis_cp[0])

How do you find the performance of the ASR pipeline built so far?

# 4.6 Inverse Text Normalization (ITN)

_Inverse text normalization (ITN)_ converts spoken transcriptions into their written format. For example, the ITN module converts the spoken transcription, "one hundred and twenty three dollars" into the written format, "$123".

State-of-the-art ITN uses grammar rules for several concepts such as dates, measures, time, telephone numbers, electronics, money, and so on. Learn more about NeMo ITN at [NeMo Inverse Text Normalization: From Development To Production 2021.](https://arxiv.org/abs/2104.05055)

In the cells below, we load ITN grammar rules for English and inverse normalize some text using the `inverse_normalizer_en.inverse_normalize()` module.

In [None]:
# create inverse text normalization instance
from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
inverse_normalizer_en = InverseNormalizer(lang='en')

In [None]:
transcriptions_sample1  = "She was born on the sixth of september two thousand and one in London."

# Run ITN on example string input
transcriptions_normalized_sapmle1 = inverse_normalizer_en.inverse_normalize(transcriptions_sample1, verbose=False)

# plot the results
print("Reference transcription:", transcriptions_sample1)
print("Normalized transcription:", transcriptions_normalized_sapmle1)

It is also possible to output the details of the annotated text provided by the ITN model by setting the `verbose=True` argument. 
Bellow an example output for the sentence "one hundred and twenty three dollars":
```
tokens { money { integer_part: "123" currency: "$" } }
```

Let's have a look at the classifications on our previous sample:  

In [None]:
transcriptions_normalized_sapmle1 = inverse_normalizer_en.inverse_normalize(transcriptions_sample1, verbose=True)

Let's try another example for ITN of phone number and email transcriptions. The English ITN is set for a standard American telephone number with ten digits.

In [None]:
transcriptions_sample2 = "My phone number is five five five five five five one two three four and my email is dana at nvidia dot com"

# run ITN on example string input
transcriptions_normalized_sapmle2 = inverse_normalizer_en.inverse_normalize(transcriptions_sample2 , verbose=False)

# plot the results
print("Reference transcription:", transcriptions_sample2)
print("Normalized transcription:", transcriptions_normalized_sapmle2)

## 4.6.1 ITN on Other Languages

In addition to the English model, NeMo ITN offers rules for German (de), Spanish (es), French (fr), Portuguese (pt), Russian (ru), and Vietnamese (vi). Let's try to inverse normalize a French sentence.

In [None]:
# create inverse text normalization instance for french
from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
inverse_normalizer_fr = InverseNormalizer(lang='fr')

In [None]:
transcriptions = "Le coup total pour l'année deux mille vingt-deux est de quatre cent millions d'euro"

# run ITN on example string input
transcriptions_normalized = inverse_normalizer_fr.inverse_normalize(transcriptions , verbose=False)

# plot the results
print("Reference transcription:", transcriptions)
print("Normalized transcription:", transcriptions_normalized)

It is possible to customize the ITN grammar rules. Learn more in the dedicated  [NeMo tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/text_processing/WFST_Tutorial.ipynb).

---
# 4.7 Language Identification

[Speech classification](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html) refers to a set of tasks to automatically classify input utterances or audio segments into categories.  Speech classification tasks include _speech command recognition_ (multi-class), _VAD or voice activity detection_ (binary or multi-class), and _audio sentiment classification_ (typically multi-class), etc.  To learn more, see the [list of available models](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/results.html#speech-classification-models).

_Spoken language identification_ (Lang ID), also known as _spoken language recognition_, is the task of recognizing the language of the spoken utterance automatically. It typically is used in preprocessing ASR, determining which ASR model should be activated based on the language.  

In [None]:
import nemo
import nemo.collections.asr as nemo_asr
import torch

In [None]:
display_list_available_models(nemo_asr.models.EncDecSpeakerLabelModel)

Let's load the language detector model [langid_ambernet](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/langid_ambernet). This model is based on AmberNet architecture trained on 6628 hours. The average amount of data per language is 62 hours. Model achieves 5.22% error rate on official evaluation set which contains 1609 verified utterances of 33 languages.

In [None]:
langid_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="langid_ambernet")

In [None]:
AUDIO_SAMPLES = "/opt/nvidia-riva/tutorials/audio_samples"
!ls $AUDIO_SAMPLES

In [None]:
# Use get_label() to identify the language (available with NeMo 1.14.0 and later)
wav_file = 'en-US_sample.wav'
lang = langid_model.get_label(AUDIO_SAMPLES + '/' + wav_file)
print("The language code for '{}' is: {}".format(wav_file, lang))

In [None]:
# Spanish
wav_file = 'es-US_sample.wav'
lang = langid_model.get_label(AUDIO_SAMPLES + '/' + wav_file)
print("The language code for '{}' is: {}".format(wav_file, lang))

In [None]:
# Russian
wav_file = 'ru-RU_sample.wav'
lang = langid_model.get_label(AUDIO_SAMPLES + '/' + wav_file)
print("The language code for '{}' is: {}".format(wav_file, lang))

---
# 4.8 (Optional) Create Your Own Audio Samples

You can upload your own audio samples to try the ASR performance. 
The files should be `.wav` format, resampled to 16kHz. 
Here is a [torchaudio](https://pytorch.org/audio/stable/index.html)-based example for `.wav` file resampling.

```
import torchaudio
input_wav_file = '/path/to/my_audio.wav'
output_wav_file = '/path/to/my_audio_resampled.wav'
y, sr = torchaudio.load(input_wav_file)
y = y.mean(dim=0) # if there are multiple channels, average them to single channel
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    y_resampled = resampler(y).unsqueeze(0)
    torchaudio.save(output_wav_file, y_resampled, 16000)
```

---
# 4.9 Shut Down the Kernel
<h3 style="color:red;">Important!</h3>

From the menu above, choose ***Kernel->Shut Down Kernel*** to fully clear GPU memory before moving on.

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have:
- Gained an understanding about the ASR pipeline and the various acoustic models
- Used transcript decoders to select the best transcriptions
- Used ITN, capitalization, and punctuation to improve written transcriptions
- Used a language identification model to identify what language was spoken

Next, you'll deploy the model on Riva. Move on to [Deployment with Riva](005_ASR_Deployment.ipynb).

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>