# **Notebook to uncover the key parts of ASR**

This tutorial will walk you through all the modules needed to implement an offline **end-to-end attention-based speech recognizer** on Speechbrain.

For simplicity, we are not training any model, but rather using the models (AM/LM/Tokenizer) available from huggingface hub. The models are trained in an open-source dataset called [librispeech](https://www.openslr.org/12/) with 960 hours of train data.

In this tutorial, we will refer to the code in ```NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/{ASR,LM,Tokenizer}```. 
You could follow up a more detailed Colab Notebook about training ASR from Scratch: [Colab Notebook - Train from Scratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing)

## **Which modules are we covering today?**

In order to train the LM and AM, you would need to prepare LibriSpeech folder + download all the required material. Training could take days in cluster with several GPUs. 

0. **Data preparation**.
For this tutorial we won't need any data preparation step, because we will take adavantage of the Speechbrain class `EncoderDecoderASR`; which could apply inference in one simple `wav` file.  

1. **Tokenizer**.
The tokenizer decides which basic units are allocated during ASR training/infernce (e.g, characters, phonemes, sub-words, words).

```
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/Tokenizer
python train.py tokenizer.yaml
```

2. **The language model**.
After that, the language model could be trained (we just used during inference). In this example, however, we don't train it (rather download a pre-trained version)

We need an additional Python (Huggingface) library: `datasets`
```
pip install datasets
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/LM
python train.py hparams/transformer.yaml
```

3. **Automatic speech recognizer - Speech-to-text system**.
At this point, we are ready to train our speech recognizer. In this tutorial, we will use the CRDNN model with an autoregressive GRU decoder. An attention mechanism is employed between encoding and decoder. The final sequence of words is retrieved with beamsearch coupled with the Transformer LM fetched in the previous stes:
```
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/ASR/transformer
python train.py hparams/transformer.yaml
```

4. **Use the speech recognizer (inference)**:
After training, we can use the speech recognizer for inference. We will use the `EncoderDecoderASR` class available in SpeechBrain to make inference.

(Most of this tutorial is based on the [ASRfromScratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing) Google Colab! Thanks!)

We will go through each of these 

## **Step 0: Prepare your data** 

**!! You don't need to do anything here for the NLP summer school speech Tutorial. In case you'd like to continue training your own ASR engine, you could follow the notebooks' links at the end of this one.**

The goal of data preparation is to create the data manifest files. 
These files tell SpeechBrain where to find the audio data and their corresponding transcriptions. They are text files written in the popular CSV and JSON formats.

### **Data manifest files**
Let's take a look into how a data manifest file in JSON format looks like:


```json
{
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },
  "1867-154075-0001": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac",
    "length": 14.9,
    "words": "THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION"
  },
}
```
As you can see, we have a hierarchical structure in which: 

- Key: **unique identifier** of the spoken sentence,
- First item: **path of the speech recording**,
- Second item: **length**, if we have a segments file we might need to change this, 
- Third item: **sequence of words** for the given train/test sample.

### **Preparation Script**
Every dataset is formatted in a different way. The script that parses your own dataset and creates the JSON or the CSV files is something that you are supposed to write. Most of the time, this is very straightforward. 

For the mini-librispeech dataset, for instance, we wrote this simple data preparation script called [mini_librispeech_prepare.py](https://github.com/speechbrain/speechbrain/blob/develop/templates/speech_recognition/mini_librispeech_prepare.py).


## **Step 1: Tokenizer** 
An important decision to make when designing a speech recognizer concerns the basic tokens that our system has to predict (e.g, characters, phonemes, sub-words, words).

### **Using characters as tokens**
One way is to predict characters. In this case, we simply convert the sequence of words into its corresponding sequence of characters (using the space '_' as an additional character):

`THE CITY OF BOGOTA IN COLOMBIA => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'B','O','G','O','T','A','_','I','N','_','C','O','L','O','M','B','I''A']`

Key information about using characters as tokens:
+ Enough training data for each token, our system would need to predict between 20-30 tokens (depending on the language),
+ Out system might generalize to words never seen during training.

### **Using words as tokens**
Why not predicting full words then? 

`THE CITY OF BOGOTA IN COLOMBIA => ['THE','CITY','OF','BOGOTA', 'IN', 'COLOMBIA']`

Key information about using words as tokens:
+ Output sequence is short (only words) and some symbols if defined.
+ The system, however, cannot anymore generalize to new words 

### **Byte Pair Encoding (BPE)**
What about something in between? 
This is what we are trying to do with BPE tokens. BPE is a simple technique inherited from data compression. The basic idea is to allocate tokens for the most frequent sequences of characters. For instance:

`THE CITY OF BOGOTA IN COLOMBIA => ['▁TH', 'E', '▁C', 'I', 'TY', '▁OF', '▁BO', 'G', 'O', 'TA', '▁I', 'N', '▁C', 'O', 'L', 'OM', 'B', 'IA']`

The [algorithm that finds these tokens](https://en.wikipedia.org/wiki/Byte_pair_encoding) is very simple: we start from the characters and we count how many times two consecutive characters are observed together. We allocate a token for the most frequent pair and we iterate over and over until a specified number of tokens is reached. For more information, you can take a look at [our tutorial on the tokenizers](https://colab.research.google.com/drive/12yE3myHSH-eUxzNM0-FLtEOhzdQoLYWe?usp=sharing).

#### *How many BPE tokens should I use?*
The number of tokens is one of the hyperparameters of your system.
Its optimal value depends on the amount of speech data available. Just to give you an idea, for LibriSpeech (i.e., 1000 hours of sentences in English) a reasonable number of tokens ranges between 1k and 10k.

### **Train a Tokenizer**
SpeechBrain relies on the popular [SentencePiece](https://github.com/google/sentencepiece) for tokenization. To find the tokens to allocate (given the training transcriptions), run the following code:

```
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/Tokenizer
python train.py tokenizer.yaml
```

### **Testing the Tokenizer fetched from Speechbrain (HuggingFace hub)**

You should be able to fetch the models that you downloaded and then unzipped (they should be in `./pretrained_models/tokenizer.ckpt`)


In [None]:
import torch
import torchaudio

import speechbrain as sb
from speechbrain.pretrained import EncoderDecoderASR

In [3]:
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-transformerlm-librispeech", savedir="pretrained_models/")

In [14]:
# Modify the following phrase to check how the Tokenizer works:

phrase = 'THIS IS THE NLP SUMMER SCHOOL, THANK YOU FOR ATTENDING IT'
print("Encoded as pieces: {}".format(asr_model.tokenizer.encode(phrase, out_type=str)))
print("Encoded as ids: {}".format(asr_model.tokenizer.encode_as_ids(phrase)))

Encoded as pieces: ['▁THIS', '▁IS', '▁THE', '▁', 'N', 'L', 'P', '▁SUMMER', '▁SCHOOL', ',', '▁THANK', '▁YOU', '▁FOR', '▁ATTEND', 'ING', '▁IT']
Encoded as ids: [44, 33, 3, 78, 36, 134, 102, 1321, 761, 0, 868, 24, 25, 1465, 13, 17]


The Tokenizer also assigns a unique index to each token. These indexes will correspond to the output of our neural networks for LM and ASR. It's important to keep that in mind (i.e., defining the number of outputs in the ASR/LM neural networks) 

In [25]:
# do you want to know the size of the Tokenizer? 
print("The number of different units in your Tokenizer is: {} units".format(asr_model.tokenizer.vocab_size()))

The number of different units in your Tokenizer is: 5000 units


In [26]:
# Nevertheles, there could be several ways how this phrase could be represented:
for n in range(3):
    print("Version {}: {}".format(n,asr_model.tokenizer.encode(phrase, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

Version 0: ['▁T', 'H', 'IS', '▁', 'IS', '▁', 'TH', 'E', '▁', 'N', 'L', 'P', '▁S', 'UM', 'M', 'ER', '▁SCHOOL', ',', '▁', 'TH', 'AN', 'K', '▁YOU', '▁F', 'OR', '▁A', 'T', 'TEN', 'D', 'I', 'NG', '▁IT']
Version 1: ['▁', 'T', 'HI', 'S', '▁', 'I', 'S', '▁', 'TH', 'E', '▁', 'N', 'L', 'P', '▁SUMMER', '▁SCHOOL', ',', '▁THAN', 'K', '▁YOU', '▁FOR', '▁AT', 'TEN', 'D', 'ING', '▁I', 'T']
Version 2: ['▁T', 'HI', 'S', '▁I', 'S', '▁THE', '▁', 'N', 'L', 'P', '▁SU', 'M', 'M', 'E', 'R', '▁SCH', 'O', 'O', 'L', ',', '▁', 'T', 'HA', 'N', 'K', '▁YOU', '▁FOR', '▁', 'AT', 'TEN', 'D', 'ING', '▁IT']


## Tokenizer insights
It is pretty evident that there is a lot of flexibility in how we can represent words (and word sequences) with a BPE-based Tokenizer!

As mentioned before, we are not training any model in this Tutorial. Nevertheless, we want to share some key insights regarding the training scripts that you might want to know.

First, we need a hyperparameters YAML file: 
- Hyperparameter file: `tokenizer.yaml`:

```yaml
# ############################################################################
# Tokenizer: subword BPE with unigram 5K
# Training: Librispeech 960h
# Authors:  Abdel Heba 2021
# ############################################################################

output_folder: !ref results/5K_subword_unigram_960h_LM/
train_log: !ref <output_folder>/train_log.txt

# Data files
data_folder: !PLACEHOLDER # e.g., /path/to/LibriSpeech
train_splits: ["train-clean-100", "train-clean-360", "train-other-500"]
dev_splits: ["dev-clean"]
test_splits: ["test-clean", "test-other"]
train_csv: !ref <output_folder>/train.csv
valid_csv: !ref <output_folder>/dev-clean.csv

# Training parameters
token_type: unigram  # ["unigram", "bpe", "char"]
token_output: 5000  # index(blank/eos/bos/unk) = 0
character_coverage: 1.0
csv_read: words
bos_index: 1 # Begining of sentence index
eos_index: 2 # End of sentence index


tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
   model_dir: !ref <output_folder>
   vocab_size: !ref <token_output>
   annotation_train: !ref <train_csv>
   annotation_read: wrd
   model_type: "unigram" # ["unigram", "bpe", "char"]
   character_coverage: 1.0
   bos_id: !ref <bos_index> # Define bos_id/eos_id if different from blank_id
   eos_id: !ref <eos_index>
   annotation_list_to_check: [!ref <train_csv>, !ref <valid_csv>]
```

The training script will take the item() 'words' from each training sample of the JSON file and train the Tokenizer:

```json
{
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },...,
    n_samples
}
```

The tokenizer is trained on training annotation only. We set here a vocabulary size of 5000. Instead of using the standard BPE algorithm, we use a variation of it based on unigram smoothing. See [sentencepiece](https://github.com/google/sentencepiece) for more info.
The tokenizer will be saved in the specified `output_folder`. 

- Training script: `train.py`:

Essentially, we prepare the data with the `prepare_mini_librispeech` script and we then run the sentencepiece tokenizer wrapped in the class:
!name:`speechbrain.tokenizers.SentencePiece.SentencePiece`.


### Output files
The Tokenizer script will generate a binary file containing all the information needed for tokenizing an input text and a file reporting the model's list of tokens and their log probabilities
+ *5000_unigram.model*
+ *5000_unigram.vocab*


## **Step 2: Language Model**
A Language Model (LM) can be used within a speech recognizer in different ways. In this tutorial, we perform the so-called **shallow fusion** where the language information is used within the beam searcher of the speech recognizer to rescore the partial hypothesis. In practice, for every time step, we rescore the partial hypothesis provided by the speech recognizer with the language scores (that penalize sequences of tokens that are "unlikely" to be observed).

The following image gives more details about shallow fusion:

<img src="Figures/shallow_fusion.png">

**Shallow fusion of LM + ASR**


Some recent studies have shown that a speech recognizer trained on a very large dataset can achieve impressive performance even without a language. However, for medium-scale speech recognition tasks like Librispeech 1000h, the language model still plays a role in improving the final performance.

## **Basic information about our ASR model**

As explained before, the speechbrain ASR system fetched from Huggingface is composed of three main elements: 
+ Tokenizer
+ Language Model
+ Acoustic Model 


In [37]:
# Let's get some information from our LM
# we can access the modules of the ASR class with `.modules.`

# Get the model in `asr_model`
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-transformerlm-librispeech", savedir="pretrained_models/")
print("Our system has the following modules: {}".format(asr_model.modules.keys()))

Our system has the following modules: odict_keys(['compute_features', 'pre_transformer', 'transformer', 'asr_model', 'normalize', 'lm_model', 'encoder', 'decoder'])


In [39]:
dim_embedding = asr_model.modules.lm_model.__dict__['d_embedding']
num_encoder_layers = asr_model.modules.lm_model.__dict__['num_encoder_layers']
params_lm = int(sum(p.numel() for p in asr_model.modules.lm_model.parameters()))

print(f"The Language Model has a embedding dimension of: {dim_embedding}")
print(f"The Language Model has {num_encoder_layers} encoder layers")
print(f"The Language Model has {int(params_lm/1e6)}M parameters")

The Language Model has a embedding dimension of: 768
The Language Model has 12 encoder layers
The Language Model has 93M parameters


In [40]:
asr_model.modules.transformer.__dict__['_parameters']
params_asr_model = int(sum(p.numel() for p in asr_model.modules.asr_model.parameters()))
params_encoder = int(sum(p.numel() for p in asr_model.modules.encoder.parameters()))
params_decoder = int(sum(p.numel() for p in asr_model.modules.decoder.parameters()))

print(f"The Encoder has {int(params_encoder/1e6)}M parameters")
print(f"The ASR model has {int(params_asr_model/1e6)}M parameters")
print(f"The remaining {int(params_asr_model/1e6) - int(params_encoder/1e6)}M parameters are in the output/normalization/FrontEnd layers")
print(f"\n---Decoder = ASR model + LM model---")
print(f"The Decoder has {int(params_decoder/1e6)}M parameters")

The Encoder has 153M parameters
The ASR model has 161M parameters
The remaining 8M parameters are in the output/normalization/FrontEnd layers

---Decoder = ASR model + LM model---
The Decoder has 254M parameters


## Training a LM

When you train a LM (next commands) some files/folders are generated in the specified `output_folder` from the `hparams/transformer.yaml`

```
pip install datasets
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/LM
python train.py hparams/transformer.yaml
```

Such as: 

*   `train_log.txt`: contains the statistics (e.g, train_loss, valid_loss) computed at each epoch. 
*   `log.txt`: is a more detailed logger containing the timestamps for each basic operation.
*  `env.log`: shows all the dependencies used with their corresponding version (useful for replicability).

*  `train.py`, `hyperparams.yaml`:  are a copy of the experiment file along with the corresponding hyperparameters (for replicability).

* `save`:  is the place where we store the learned model.

In the `save` folder:
+ Subfolders containing the checkpoints saved during training (in the format `CKPT+data+time`),
+ Normally: two checkpoints - the best (i.e, the oldest one) and the latest (i.e, the most recent one).

Inside each checkpoint, you can find all the information needed to resume training (e.g, models, optimizers, schedulers, epoch counter, etc.). The parameters of the transformer model are reported in `model.ckpt` file. This is just a binary format readable with `torch.load`.

### **Experiment file**
Let's now take a look into how the objects, functions, and hyperparameters declared in the yaml file are used in `train.py` to implement the language model.


```python
# Recipe begins!
if __name__ == "__main__":

    # Reading command line arguments
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    # Load hyperparameters file with command-line overrides
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )
```

We here do some preliminary operations such as:
+ Parsing the command line,
+ Initializing the distributed data-parallel (needed if multiple GPUs are used),
+ Creating the output folder and, 
+ Reading the configuration file (YAML format).

```python
    # Initialize the Brain object to prepare for LM training.
    lm_brain = LM(
        modules=hparams["modules"],
        opt_class=hparams["optimizer"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )
```
The brain class implements all the functionalities needed for supporting the training and validation loops.  Its `fit` and `evaluate` methods perform training and test, respectively:

```python
    lm_brain.fit(
        lm_brain.hparams.epoch_counter,
        train_data,
        valid_data,
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = lm_brain.evaluate(
        test_data,
        min_key="loss",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )
```
The training and validation data loaders are given in input to the fit method, while the test dataset is fed into the evaluate method.

## Key methods in the Brain class

#### **1. Forward Computations**

Defines all the computations needed to transform the input text into the output predictions.

What we do here? 
+ Put the batch on the right device (CPU/GPU),
+ Forward pass: encoded tokens --> model --> output predictions

<img src="Figures/forward_pass.png" width="400">

**Forward/backward pass in a network with fully connected layers**


#### **2. Compute Objectives**

Takes the targets, the predictions, and estimates a loss function:

How? 
+ Get predictions from Forward pass
+ Compute the loss function: compare predictions with target tokens

<img src="Figures/loss_function.png" width="400">

**How to compute the loss function between predictions and ground truth labels**


## **Step 4: Speech Recognizer**
At this point, we can train our speech recognizer. In this tutorial, we are
going to train an **attention-based end-to-end speech recognizer** (offline).
The encoder relies on a combination of convolutional, recurrent, and fully connected models. The decoder is an autoregressive GRU decoder. An attention mechanism is employed between encoding and decoder. The final sequence of words is retrieved with beamsearch coupled with the RNNLM trained in the previous step. 
The attention-based system is jointly trained with CTC (applied on the top of the encoder).
The system uses data augmentation techniques to improve its performance.

### **Train the speech recognizer**
To train the speech recognizer, run the following code:

<img src="Figures/end-to-end.png" width="500">

**Standard end-to-end system trained with CTC+Attention loss**


### **Hyperparameters**


The hyperparameter file starts with the definition of basic things, such as seed and path settings:

#### **Data related**

```yaml
# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]

data_folder: ../data # In this case, data will be automatically downloaded here.
data_folder_rirs: !ref <data_folder> # noise/ris dataset will automatically be downloaded here
output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech

# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>
```

The `data_folder` corresponds to the path where the mini-librispeech is stored. 

We also have to specify the data manifest files for training, validation, and test. If not available, these files will be created by the data preparation script called in `train.py`.

After that, we define a bunch of parameters for training, feature extraction, model definition, and decoding:

#### **Model related**

```yaml
# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 1.0
ctc_weight: 0.5
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0
unk_index: 0
```

For instance, we define the number of epochs, the initial learning rate, the batch size, the weight of the CTC loss, and many others. 

We can also add: 

+ Different feature extraction 'layers' (FBANK, MFCC, or wav2vec)
+ Normalization layer
+ Environmental corruption
+ Data augmentation for Speech --> SpecAugment, Speed Perturbation, etc
+ Beam search algorithm + hyperparams


```yaml
# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
        model: !ref <model>
    paths:
        lm: !ref <pretrained_path>/lm.ckpt
        tokenizer: !ref <pretrained_path>/tokenizer.ckpt
        model: !ref <pretrained_path>/asr.ckpt

```
 Additionally, we can add pre-trained models such as L:M
The final object is the pretrainer that links the language model, the tokenizer, and the acoustic speech recognition model with their corresponding files used for pre-training.  We here pre-train the acoustic model as well. One such a small dataset, it is very hard to make an end-to-end speech recognizer converging and we thus use another model to pre-trained it (you should skip this part when training on a larger dataset).

### **What about inference and outputs?**

Performing ASR inference in a bunch of wav files would provide a detailed report such as: 


```
%WER 3.09 [ 1622 / 52576, 167 ins, 171 del, 1284 sub ]
%SER 33.66 [ 882 / 2620 ]
Scored 2620 sentences, 0 not present in hyp.
================================================================================
ALIGNMENTS

Format:
<utterance-id>, WER DETAILS
<eps> ; reference  ; on ; the ; first ;  line
  I   ;     S      ; =  ;  =  ;   S   ;   D  
 and  ; hypothesis ; on ; the ; third ; <eps>
================================================================================
672-122797-0033, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]
A ; STORY
= ;   =  
A ; STORY
================================================================================
2094-142345-0041, %WER 0.00 [ 0 / 1, 0 ins, 0 del, 0 sub ]
DIRECTION
    =    
DIRECTION
================================================================================
2830-3980-0026, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]
VERSE ; TWO
  S   ;  = 
FIRST ; TWO
```

## **Step 5: Inference**

At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrainspeechbrain/blob/develop/speechbrain/pretrained/interfaces.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:


In [1]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-transformerlm-librispeech", savedir="pretrained_models/")
audio_file = "pretrained_models/example.wav"
transcript = asr_model.transcribe_file(audio_file)

print(transcript)

THE BIRCH CANOE SLID ON THE SMOOTH PLANKS


## **Customize your speech recognizer**
In a general case, you might have your own data and you would like to use your own model. Let's comment a bit more on how you can customize your recipe. 

**Suggestion**:  start from a recipe that is working (like the one used for this template) and only do the minimal modifications needed to customize it. Test your model step by step. Make sure your model can overfit on a tiny dataset composed of few sentences. If it doesn't overfit there is likely a bug in your model.

## **Conclusion**

In this short tutorial, we reviewed the main parts for building an ASR system (e.g. Tokenizer, Language Model and Acoustic Model) with Speechbrain. Additionally, we learned how to use the `EncoderDecoderASR` interface which allows you to perform speech recognition with less than 4 lines of code! 

Special thanks to the [Speechbrain](https://github.com/speechbrain/speechbrain) team and [Huggingface](https://github.com/huggingface/transformers)!


Here are some recipes developed in Speechbrain:

- [LibriSpeech recipes](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech)
- [CommonVoice](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice)
- [AISHELL-1](https://github.com/speechbrain/speechbrain/tree/develop/recipes/AISHELL-1)
- [TIMIT](https://github.com/speechbrain/speechbrain/tree/develop/recipes/TIMIT)

## Related Tutorials

These are some related tutorials if you want to further explore the ASR field:

0. [ASRfromScratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing)
1. [YAML hyperpatameter specification](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz?usp=sharing)
2. [Brain Class](https://colab.research.google.com/drive/1fdqTk4CTXNcrcSVFvaOKzRfLmj4fJfwa?usp=sharing)
3. [Checkpointing](https://colab.research.google.com/drive/1VH7U0oP3CZsUNtChJT2ewbV_q1QX8xre?usp=sharing)
4. [Data-io](https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing)
5. [Tokenizer](https://colab.research.google.com/drive/12yE3myHSH-eUxzNM0-FLtEOhzdQoLYWe?usp=sharing)
6. [Speech Features](https://colab.research.google.com/drive/1CI72Xyay80mmmagfLaIIeRoDgswWHT_g?usp=sharing)
7. [Speech Augmentation](https://colab.research.google.com/drive/1JJc4tBhHNXRSDM2xbQ3Z0jdDQUw4S5lr?usp=sharing)
8. [Environmental Corruption](https://colab.research.google.com/drive/1mAimqZndq0BwQj63VcDTr6_uCMC6i6Un?usp=sharing)
9. [MultiGPU Training](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15?usp=sharing)
10. [Pretrain and Fine-tune](https://colab.research.google.com/drive/1LN7R3U3xneDgDRK2gC5MzGkLysCWxuC3?usp=sharing)




