# **NLP Summer School: Notebook in ASR**

This notebook will walk you through the implementation of an offline **end-to-end attention-based speech recognizer** on Speechbrain.

For simplicity, we are not training any model, but rather using the models (AM/LM/Tokenizer) available from huggingface hub. The models are trained in an open-source dataset called [librispeech](https://www.openslr.org/12/) with 960 hours of train data.

**PLEASE! CREATE YOUR ENVIRONMENT/DOWNLOAD THE REQUIRED LIBRARIES FIRST** (It takes some time: only torch+speechbrain+additional libraries require more than 3GB)

You can run the next commands and follow the instructiones there:
+ **Github Tutorial**: `!git clone https://github.com/maelfabien/NLP_Summer_School-2021_Speech_Tutorial`
+ **Github Demo** (not reviewed here): `!git clone https://github.com/maelfabien/NLP_Summer_School-2021_Speech_Demo`

In this tutorial, we will refer to the code in ```NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/{ASR,LM,Tokenizer}```. 
You could follow up a more detailed Colab Notebook about training ASR from Scratch: [Colab Notebook - Train from Scratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing)

In [3]:
# Let's first import some libraries
import time
import speechbrain as sb
from speechbrain.pretrained import EncoderDecoderASR

# Getting 
from ipywebrtc import AudioRecorder, CameraStream
import torchaudio
from IPython.display import Audio

## **Modules of my ASR system**

An ASR system is normally composed of:
+ **Tokenizer** --> 5000 units, in this case
+ **Language Model** --> Transformer
+ **Acoustic Model** --> Encoder-Decoder
+ **Decoding technique** --> Beam Search

Let's fetch the model and print some useful information about it:

In [5]:
# Creating (fetch or use the pre-trained models) the ASR object:

# Source = Source folder | hparams_file = hyperparameters file in 'Source'
asr_model = EncoderDecoderASR.from_hparams(source="pretrained_models", hparams_file='hyperparams.yaml')

### **Tokenizer**asr_model = EncoderDecoderASR.from_hparams(source="pretrained_models", hparams_file='hyperparams.yaml')


Let's check the tokenizer and what it does

In [6]:
# Modify the following phrase to check how the Tokenizer works:

phrase = 'THIS IS THE MEXICAN NLP SUMMER SCHOOL, THANK YOU FOR ATTENDING IT!!!'
tokenize_as_pieces = asr_model.tokenizer.encode(phrase, out_type=str)
tokenize_as_ids = asr_model.tokenizer.encode_as_ids(phrase)

print(f"Encoded as pieces: {tokenize_as_pieces}")
print("Encoded as ids: {}".format(tokenize_as_ids))

Encoded as pieces: ['▁THIS', '▁IS', '▁THE', '▁ME', 'X', 'IC', 'AN', '▁', 'N', 'L', 'P', '▁SUMMER', '▁SCHOOL', ',', '▁THANK', '▁YOU', '▁FOR', '▁ATTEND', 'ING', '▁IT', '!!!']
Encoded as ids: [44, 33, 3, 48, 477, 198, 187, 78, 36, 134, 102, 1321, 761, 0, 868, 24, 25, 1465, 13, 17, 0]


In [7]:
# do you want to know the size of the Tokenizer? 
print("The number of different units in your Tokenizer is: {} units".format(asr_model.tokenizer.vocab_size()))

The number of different units in your Tokenizer is: 5000 units


In [8]:
# Nevertheles, there could be several ways how this phrase could be represented:
for n in range(3):
    print("Version {}: {}".format(n,asr_model.tokenizer.encode(phrase, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

Version 0: ['▁THIS', '▁I', 'S', '▁THE', '▁', 'ME', 'X', 'IC', 'AN', '▁', 'N', 'L', 'P', '▁S', 'U', 'M', 'ME', 'R', '▁SCHOOL', ',', '▁THAN', 'K', '▁', 'Y', 'O', 'U', '▁F', 'O', 'R', '▁', 'A', 'T', 'T', 'EN', 'D', 'ING', '▁I', 'T', '!!!']
Version 1: ['▁THIS', '▁I', 'S', '▁', 'T', 'HE', '▁M', 'E', 'X', 'IC', 'A', 'N', '▁', 'N', 'L', 'P', '▁SUMMER', '▁SCHOOL', ',', '▁T', 'HA', 'N', 'K', '▁YOU', '▁FOR', '▁A', 'T', 'TEN', 'D', 'ING', '▁I', 'T', '!!!']
Version 2: ['▁T', 'HI', 'S', '▁IS', '▁THE', '▁', 'M', 'EX', 'IC', 'AN', '▁', 'N', 'L', 'P', '▁S', 'UM', 'M', 'E', 'R', '▁', 'S', 'CH', 'O', 'OL', ',', '▁T', 'HA', 'N', 'K', '▁YOU', '▁', 'F', 'OR', '▁AT', 'TEN', 'D', 'ING', '▁', 'I', 'T', '!!!']


### **Language Model and Acoustic Model**

Let's check the Language and Acoustic model and some generalities:

In [9]:
print("Our system has the following modules: {}".format(asr_model.modules.keys()))

Our system has the following modules: odict_keys(['compute_features', 'pre_transformer', 'transformer', 'asr_model', 'normalize', 'lm_model', 'encoder', 'decoder'])


In [10]:
# Some key details about the LM

dim_embedding = asr_model.modules.lm_model.__dict__['d_embedding']
num_encoder_layers = asr_model.modules.lm_model.__dict__['num_encoder_layers']
params_lm = int(sum(p.numel() for p in asr_model.modules.lm_model.parameters()))

print(f"The Language Model has a embedding dimension of: {dim_embedding}")
print(f"The Language Model has {num_encoder_layers} encoder layers")
print(f"The Language Model has {int(params_lm/1e6)}M parameters")

The Language Model has a embedding dimension of: 768
The Language Model has 12 encoder layers
The Language Model has 93M parameters


In [12]:
# Some key details about the acoustic model

asr_model.modules.transformer.__dict__['_parameters']
params_asr_model = int(sum(p.numel() for p in asr_model.modules.asr_model.parameters()))
params_encoder = int(sum(p.numel() for p in asr_model.modules.encoder.parameters()))
params_decoder = int(sum(p.numel() for p in asr_model.modules.decoder.parameters()))

print(f"The Encoder has {int(params_encoder/1e6)}M parameters")
print(f"The ASR model has {int(params_asr_model/1e6)}M parameters")
print(f"The remaining {int(params_asr_model/1e6) - int(params_encoder/1e6)}M parameters are in the output/normalization/FrontEnd layers")
print(f"\n---Decoder = ASR model + LM model---")
print(f"The Decoder has {int(params_decoder/1e6)}M parameters")

The Encoder has 153M parameters
The ASR model has 161M parameters
The remaining 8M parameters are in the output/normalization/FrontEnd layers

---Decoder = ASR model + LM model---
The Decoder has 254M parameters


### **Getting Ready...**

Now that you know some basics about your ASR modules, let's do some inference and test the system

- First, we need to either make a speech recording or use one file (only English and wav file)
- Then, we create a 'wav' file that contains the speech recording needed during inference

In [13]:
# Create and Activate the recording widget

camera_on = True
if camera_on == True:
    print("You should be able to see a recording button just below, click on it to start, and again to stop the recording \n")
    camera = CameraStream(constraints={'audio': True,'video':False})
    recorder = AudioRecorder(stream=camera)
else:
    recorder = "Using the example wav file from pretrained_models\n"
recorder

You should be able to see a recording button just below, click on it to start, and again to stop the recording 



AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

In [14]:
if camera_on:
    with open('recording.webm', 'wb') as f:
        f.write(recorder.audio.value)
    !ffmpeg -i recording.webm -ac 1 -ar 16000 -f wav file.wav -y -hide_banner -loglevel panic
    audio = "file.wav"
else:
    audio = "pretrained_models/example.wav"

sig, sr = torchaudio.load(audio)

print(sig.shape)
Audio(data=sig, rate=sr)

torch.Size([1, 76800])


In [15]:
# Let's transcribe the recorded file or the pre-downloaded file

start_time = time.time()
transcript = asr_model.transcribe_file(audio)
end_time = time.time()
transcript_time = int(end_time-start_time)
print(f"The ASR system is generating the transcript of your recording (recording length: {len(sig[0])/sr} seconds):\n")

The ASR system is generating the transcript of your recording (recording length: 4.8 seconds):



In [16]:
print(f"Inference with a LM ({transcript_time} seconds): {transcript}")

Inference with a LM (30 seconds): HELLO I HAVE A REAL NICE WEEK END TO DAY WITH MY FRIENDS


### **What about avoiding a Language Model?...**

We can quickly perform **inference** (speech recognition) without a language model:

- First, Modify the hyperparams file
- Then, load the new model and perform inference!!

In [18]:
# Loading the model in your machine and performing speech recognition

asr_model_noLM = EncoderDecoderASR.from_hparams(source="pretrained_models", hparams_file='hyperparams_noLM.yaml')

start_time = time.time()
transcript_2 = asr_model_noLM.transcribe_file(audio)
end_time = time.time()
transcript_2_time = int(end_time-start_time)

In [19]:
# Some nice and useful information :)
print(f"Inference without a Language Model ({transcript_2_time} seconds): {transcript_2}")
print(f"Inference with a Language Model ({transcript_time} seconds):    {transcript}")

Inference without a Language Model (22 seconds): HELLO I HAVE A REAL NICE WE CAN TO DAY WITH MY FRIENDS
Inference with a Language Model (30 seconds):    HELLO I HAVE A REAL NICE WEEK END TO DAY WITH MY FRIENDS


**But...**

Pretty similar results, right? 
Normally, a **LM increase the performance** of the whole recognizer --> you can model longer sequences (words vs acoustic features).

Below there are the main differences (only a few) between the hyperparameters files:

In [20]:
!diff pretrained_models/hyperparams_noLM.yaml pretrained_models/hyperparams.yaml

76d75
< # Decoder - commented
86,87c85,86
<     # lm_weight: !ref <lm_weight>
<     # lm_modules: !ref <lm_model>
---
>     lm_weight: !ref <lm_weight>
>     lm_modules: !ref <lm_model>
89c88
<     # temperature_lm: 1.15
---
>     temperature_lm: 1.15
136d134
< # Modules - commented
143c141
<    # lm_model: !ref <lm_model>
---
>    lm_model: !ref <lm_model>
149d146
< # Pretrainer - commented
153c150
<       # lm: !ref <lm_model>
---
>       lm: !ref <lm_model>


### **What about a faster system?...**

Let's do speech recognition with a faster system that is including a **Language Model!!**


In [21]:
# Loading the model in your machine and performing speech recognition

asr_model_FastDecoding = EncoderDecoderASR.from_hparams(source="pretrained_models", hparams_file='hyperparams_FastDecoding.yaml')

start_time = time.time()
transcript_3 = asr_model_FastDecoding.transcribe_file(audio)
end_time = time.time()
transcript_3_time = int(end_time-start_time)


In [24]:
# Some nice and useful information :)

print(f"Inference without a Language Model ({transcript_2_time} seconds): {transcript_2}")
print(f"Inference with a Language Model ({transcript_time} seconds):    {transcript}")
print(f"\nInference with a Language Model + Faster Decoding ({transcript_3_time} seconds):    {transcript_3}")

Inference without a Language Model (22 seconds): HELLO I HAVE A REAL NICE WE CAN TO DAY WITH MY FRIENDS
Inference with a Language Model (30 seconds):    HELLO I HAVE A REAL NICE WEEK END TO DAY WITH MY FRIENDS

Inference with a Language Model + Faster Decoding (5 seconds):    HELLO I HAVE A REAL NICE WE CAN TO DAY WITH MY FRIENDS


### **Why so fast?...**

Almost 6 times faster than the former model, also employing a Language Model: 

We achieved this by changing one **AND** only one **hyper-parameter**!! and it is...

In [23]:
!diff pretrained_models/hyperparams.yaml pretrained_models/hyperparams_FastDecoding.yaml 

41c41
< test_beam_size: 66
---
> test_beam_size: 10


## **Conclusion**

In this short tutorial, we tested an ASR systems composed of a Tokenizer, Language Model and Acoustic Model with Speechbrain. 

Additionally, we learned how to use the `EncoderDecoderASR` interface which allows you to perform speech recognition with less than 4 lines of code! These models are almost state-of-the-art! 

Special thanks to the [Speechbrain](https://github.com/speechbrain/speechbrain) team and [Huggingface](https://github.com/huggingface/transformers)!

Here are some recipes developed in Speechbrain that you can try out easily.

- [LibriSpeech recipes](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech)
- [CommonVoice](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice)
- [AISHELL-1](https://github.com/speechbrain/speechbrain/tree/develop/recipes/AISHELL-1)
- [TIMIT](https://github.com/speechbrain/speechbrain/tree/develop/recipes/TIMIT)

## Related Tutorials

These are some related tutorials if you want to further explore the ASR field:

0. [ASRfromScratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing)
1. [YAML hyperpatameter specification](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz?usp=sharing)
2. [Brain Class](https://colab.research.google.com/drive/1fdqTk4CTXNcrcSVFvaOKzRfLmj4fJfwa?usp=sharing)
3. [Checkpointing](https://colab.research.google.com/drive/1VH7U0oP3CZsUNtChJT2ewbV_q1QX8xre?usp=sharing)
4. [Data-io](https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing)
5. [Tokenizer](https://colab.research.google.com/drive/12yE3myHSH-eUxzNM0-FLtEOhzdQoLYWe?usp=sharing)
6. [Speech Features](https://colab.research.google.com/drive/1CI72Xyay80mmmagfLaIIeRoDgswWHT_g?usp=sharing)
7. [Speech Augmentation](https://colab.research.google.com/drive/1JJc4tBhHNXRSDM2xbQ3Z0jdDQUw4S5lr?usp=sharing)
8. [Environmental Corruption](https://colab.research.google.com/drive/1mAimqZndq0BwQj63VcDTr6_uCMC6i6Un?usp=sharing)
9. [MultiGPU Training](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15?usp=sharing)
10. [Pretrain and Fine-tune](https://colab.research.google.com/drive/1LN7R3U3xneDgDRK2gC5MzGkLysCWxuC3?usp=sharing)




