# **Notebook to uncover the key parts of ASR**

This tutorial will walk you through all the modules needed to implement an offline **end-to-end attention-based speech recognizer** on Speechbrain.

For simplicity, we are not training any model, but rather using the models (AM/LM/Tokenizer) available from huggingface hub. The models are trained in an open-source dataset called [librispeech](https://www.openslr.org/12/) with 960 hours of train data.

In this tutorial, we will refer to the code in ```NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/{ASR,LM,Tokenizer}```. 
You could follow up a more detailed Colab Notebook about training ASR from Scratch: [Colab Notebook - Train from Scratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing)

## **Which modules are we covering today?**

In order to train the LM and AM, you would need to prepare LibriSpeech folder + download all the required material. Training could take days in cluster with several GPUs. 

0. **Data preparation**.
For this tutorial we won't need any data preparation step, because we will take adavantage of the Speechbrain class `EncoderDecoderASR`; which could apply inference in one simple `wav` file.  

1. **Tokenizer**.
The tokenizer decides which basic units are allocated during ASR training/infernce (e.g, characters, phonemes, sub-words, words).

```
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/Tokenizer
python train.py tokenizer.yaml
```

2. **The language model**.
After that, the language model could be trained (we just used during inference). In this example, however, we don't train it (rather download a pre-trained version)

We need an additional Python (Huggingface) library: `datasets`
```
pip install datasets
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/LM
python train.py hparams/transformer.yaml
```

3. **Automatic speech recognizer - Speech-to-text system**.
At this point, we are ready to train our speech recognizer. In this tutorial, we will use the CRDNN model with an autoregressive GRU decoder. An attention mechanism is employed between encoding and decoder. The final sequence of words is retrieved with beamsearch coupled with the Transformer LM fetched in the previous stes:
```
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/ASR/transformer
python train.py hparams/transformer.yaml
```

4. **Use the speech recognizer (inference)**:
After training, we can use the speech recognizer for inference. We will use the `EncoderDecoderASR` class available in SpeechBrain to make inference.

(Most of this tutorial is based on the [ASRfromScratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing) Google Colab! Thanks!)

We will go through each of these 

## **Step 0: Prepare your data** 

**!! You don't need to do anything here for the NLP summer school speech Tutorial. In case you'd like to continue training your own ASR engine, you could follow the notebooks' links at the end of this one.**

The goal of data preparation is to create the data manifest files. 
These files tell SpeechBrain where to find the audio data and their corresponding transcriptions. They are text files written in the popular CSV and JSON formats.

### **Data manifest files**
Let's take a look into how a data manifest file in JSON format looks like:


```json
{
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },
  "1867-154075-0001": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac",
    "length": 14.9,
    "words": "THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION"
  },
}
```
As you can see, we have a hierarchical structure in which: 

- Key: **unique identifier** of the spoken sentence,
- First item: **path of the speech recording**,
- Second item: **length**, if we have a segments file we might need to change this, 
- Third item: **sequence of words** for the given train/test sample.

### **Preparation Script**
Every dataset is formatted in a different way. The script that parses your own dataset and creates the JSON or the CSV files is something that you are supposed to write. Most of the time, this is very straightforward. 

For the mini-librispeech dataset, for instance, we wrote this simple data preparation script called [mini_librispeech_prepare.py](https://github.com/speechbrain/speechbrain/blob/develop/templates/speech_recognition/mini_librispeech_prepare.py).


## **Step 1: Tokenizer** 
An important decision to make when designing a speech recognizer concerns the basic tokens that our system has to predict (e.g, characters, phonemes, sub-words, words).

### **Using characters as tokens**
One way is to predict characters. In this case, we simply convert the sequence of words into its corresponding sequence of characters (using the space '_' as an additional character):

`THE CITY OF BOGOTA IN COLOMBIA => ['T','H','E', '_', 'C','I','T','Y','_', 'O', 'F', '_, 'B','O','G','O','T','A','_','I','N','_','C','O','L','O','M','B','I''A']`

Key information about using characters as tokens:
+ Enough training data for each token, our system would need to predict between 20-30 tokens (depending on the language),
+ Out system might generalize to words never seen during training.

### **Using words as tokens**
Why not predicting full words then? 

`THE CITY OF BOGOTA IN COLOMBIA => ['THE','CITY','OF','BOGOTA', 'IN', 'COLOMBIA']`

Key information about using words as tokens:
+ Output sequence is short (only words) and some symbols if defined.
+ The system, however, cannot anymore generalize to new words 

### **Byte Pair Encoding (BPE)**
What about something in between? 
This is what we are trying to do with BPE tokens. BPE is a simple technique inherited from data compression. The basic idea is to allocate tokens for the most frequent sequences of characters. For instance:

`THE CITY OF BOGOTA IN COLOMBIA => ['▁TH', 'E', '▁C', 'I', 'TY', '▁OF', '▁BO', 'G', 'O', 'TA', '▁I', 'N', '▁C', 'O', 'L', 'OM', 'B', 'IA']`

The [algorithm that finds these tokens](https://en.wikipedia.org/wiki/Byte_pair_encoding) is very simple: we start from the characters and we count how many times two consecutive characters are observed together. We allocate a token for the most frequent pair and we iterate over and over until a specified number of tokens is reached. For more information, you can take a look at [our tutorial on the tokenizers](https://colab.research.google.com/drive/12yE3myHSH-eUxzNM0-FLtEOhzdQoLYWe?usp=sharing).

#### *How many BPE tokens should I use?*
The number of tokens is one of the hyperparameters of your system.
Its optimal value depends on the amount of speech data available. Just to give you an idea, for LibriSpeech (i.e., 1000 hours of sentences in English) a reasonable number of tokens ranges between 1k and 10k.

### **Train a Tokenizer**
SpeechBrain relies on the popular [SentencePiece](https://github.com/google/sentencepiece) for tokenization. To find the tokens to allocate (given the training transcriptions), run the following code:

```
cd NLP_Summer_School-2021_Speech_Tutorial/ASR/LibriSpeech/Tokenizer
python train.py tokenizer.yaml
```

### **Testing the Tokenizer fetched from Speechbrain (HuggingFace hub)**

You should be able to fetch the models that you downloaded and then unzipped (they should be in `./pretrained_models/tokenizer.ckpt`)


In [2]:
import torch
import torchaudio

import speechbrain as sb
from speechbrain.pretrained import EncoderDecoderASR

In [3]:
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-transformerlm-librispeech", savedir="pretrained_models/")

In [14]:
# Modify the following phrase to check how the Tokenizer works:

phrase = 'THIS IS THE NLP SUMMER SCHOOL, THANK YOU FOR ATTENDING IT'
print("Encoded as pieces: {}".format(asr_model.tokenizer.encode(phrase, out_type=str)))
print("Encoded as ids: {}".format(asr_model.tokenizer.encode_as_ids(phrase)))

Encoded as pieces: ['▁THIS', '▁IS', '▁THE', '▁', 'N', 'L', 'P', '▁SUMMER', '▁SCHOOL', ',', '▁THANK', '▁YOU', '▁FOR', '▁ATTEND', 'ING', '▁IT']
Encoded as ids: [44, 33, 3, 78, 36, 134, 102, 1321, 761, 0, 868, 24, 25, 1465, 13, 17]


In [18]:
# do you want to know the size of the Tokenizer? 
print("The number of different units in your Tokenizer is: {}".format(asr_model.tokenizer.vocab_size))

5000

In [10]:
# Nevertheles, there could be several ways how this phrase could be represented:
for n in range(3):
    print("Version {}: {}".format(n,asr_model.tokenizer.encode(phrase, out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)))

Version 0: ['▁THIS', '▁I', 'S', '▁T', 'HE', '▁', 'N', 'L', 'P', '▁S', 'UM', 'M', 'ER', '▁SCHOOL', ',', '▁TH', 'A', 'N', 'K', '▁', 'Y', 'O', 'U', '▁F', 'OR', '▁AT', 'TE', 'N', 'D', 'IN', 'G', '▁', 'I', 'T']
Version 1: ['▁THIS', '▁IS', '▁THE', '▁', 'N', 'L', 'P', '▁SUM', 'M', 'ER', '▁', 'S', 'C', 'H', 'O', 'O', 'L', ',', '▁', 'T', 'HA', 'N', 'K', '▁YOU', '▁FO', 'R', '▁A', 'T', 'TEN', 'D', 'ING', '▁I', 'T']
Version 2: ['▁T', 'H', 'I', 'S', '▁I', 'S', '▁T', 'H', 'E', '▁', 'N', 'L', 'P', '▁SUMMER', '▁SC', 'H', 'O', 'O', 'L', ',', '▁THANK', '▁YOU', '▁F', 'O', 'R', '▁ATTEND', 'ING', '▁IT']


It is pretty evident that there is a lot of flexibility in how we can represent words (and words sequences) with a BPE-based Tokenizer!

As mentioned before, we are not training any model in this Tutorial. Nevertheless, we want to share some key insights regarding the training scripts that you might want to know.

- Training script: `train.py`,
- Hyperparameter file: `tokenizer.yaml`


```yaml
# ############################################################################
# Tokenizer: subword BPE tokenizer with unigram 1K
# Training: Mini-LibriSpeech
# Authors:  Abdel Heba 2021
#           Mirco Ravanelli 2021
# ############################################################################


# Set up folders for reading from and writing to
data_folder: ../data
output_folder: ./save

# Path where data-specification files are stored
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json

# Tokenizer parameters
token_type: unigram  # ["unigram", "bpe", "char"]
token_output: 1000  # index(blank/eos/bos/unk) = 0
character_coverage: 1.0
annotation_read: words # field to read

# Tokenizer object
tokenizer: !name:speechbrain.tokenizers.SentencePiece.SentencePiece
   model_dir: !ref <output_folder>
   vocab_size: !ref <token_output>
   annotation_train: !ref <train_annotation>
   annotation_read: !ref <annotation_read>
   model_type: !ref <token_type> # ["unigram", "bpe", "char"]
   character_coverage: !ref <character_coverage>
   annotation_list_to_check: [!ref <train_annotation>, !ref <valid_annotation>]
   annotation_format: json
```

The tokenizer is trained on training annotation only. We set here a vocabulary size of 1000. Instead of using the standard BPE algorithm, we use a variation of it based on unigram smoothing. See [sentencepiece](https://github.com/google/sentencepiece) for more info.
The tokenizer will be saved in the specified `output_folder`. 

Let's now take a look into the training script `train.py`:



```python
if __name__ == "__main__":

    # Load hyperparameters file with command-line overrides
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )

    # Data preparation, to be run on only one process.
    prepare_mini_librispeech(
        data_folder=hparams["data_folder"],
        save_json_train=hparams["train_annotation"],
        save_json_valid=hparams["valid_annotation"],
        save_json_test=hparams["test_annotation"],
    )

    # Train tokenizer
    hparams["tokenizer"]()
```

Essentially, we prepare the data with the `prepare_mini_librispeech` script and we then run the sentencepiece tokenizer wrapped in 
`speechbrain.tokenizers.SentencePiece.SentencePiece`.

Let's take a look at the files generated by the tokenizer. If you go into the specified output folder (`Tokenizer/save`), you can find two files:
+ *1000_unigram.model*
+ *1000_unigram.vocab*

The first is a binary file containing all the information needed for tokenizing an input text. The second is a text file reporting the list of tokens allocated (with their log probabilities):

```
▁THE  -3.2458
S -3.36618
ED  -3.84476
▁ -3.91777
E -3.92101
▁AND  -3.92316
▁A  -3.97359
▁TO -4.00462
▁OF -4.08116
....
```

Let me now show how we can use the learned model to tokenize a text:


In [None]:
import torch
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("/content/speechbrain/templates/speech_recognition/Tokenizer/save/1000_unigram.model")

# Encode as pieces
print(sp.encode_as_pieces('THE CITY OF MONTREAL'))

# Encode as ids
print(sp.encode_as_ids('THE CITY OF MONTREAL'))


['▁THE', '▁CITY', '▁OF', '▁MO', 'NT', 'RE', 'AL']
[1, 667, 9, 211, 251, 80, 57]


Note that the sentencepiece tokenizers also assign a unique index to each allocated token. These indexes will correspond to the output of our neural networks for language models and ASR.

## **Step 3: Train a Language Model**
A Language Model (LM) can be used within a speech recognizer in different ways. In this tutorial, we perform the so-called **shallow fusion** where the language information is used within the beam searcher of the speech recognizer to rescore the partial hypothesis. In practice, for every time step, we rescore the partial hypothesis provided by the speech recognizer with the language scores (that penalize sequences of tokens that are "unlikely" to be observed).

Some recent studies have shown that a speech recognizer trained on a very large dataset can achieve impressive performance even without a language. However, for medium-scale speech recognition tasks like Librispeech 1000h, the language model still plays a role in improving the final performance.

### **Text Corpus**
A language model is normally trained on **large text corpora** and it is designed to predict the most probable next token.
If you do not have a large text corpus of in-domain data for your application, you might want to skip this part. 

Another thing to remark is that training a language model on a large text corpus is very **computationally demanding**. You should consider using an available pre-trained model (and maybe fine-tune it). 

In this tutorial, we train the language model on the training transcriptions of mini-librispeech. This is just to show you how we can train it in a little amount of time. 

### **Train a LM**

We are going to train a simple RNN-based language model that estimates the next tokens given the previous ones.

To train it, run the following code:

In [None]:
!pip install datasets
%cd /content/speechbrain/templates/speech_recognition/LM
!python train.py RNNLM.yaml #--device='cpu'

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/da/d6/a3d2c55b940a7c556e88f5598b401990805fc0f0a28b2fc9870cf0b8c761/datasets-1.6.0-py3-none-any.whl (202kB)
[K     |████████████████████████████████| 204kB 19.5MB/s eta 0:00:01
Collecting tqdm<4.50.0,>=4.27
[?25l  Downloading https://files.pythonhosted.org/packages/73/d5/f220e0c69b2f346b5649b66abebb391df1a00a59997a7ccf823325bd7a3e/tqdm-4.49.0-py2.py3-none-any.whl (69kB)
[K     |████████████████████████████████| 71kB 9.2MB/s 
[?25hCollecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 54.3MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/e9/91/2ef649137816850fa4f4c97c6f2eabb1a79bf0aa2c8ed198e387e373455e/fsspec-2021.4.0-py3-none-any.whl (108kB)
[K     |███████████████████████████

As you can see from the prints, both the validation and training losses are decreasing. 
Before diving into the code, let's see which files/folders are generated in the specified `output_folder`:

*   `train_log.txt`: contains the statistics (e.g, train_loss, valid_loss) computed at each epoch. 
*   `log.txt`: is a more detailed logger containing the timestamps for each basic operation.
*  `env.log`: shows all the dependencies used with their corresponding version (useful for replicability).

*  `train.py`, `hyperparams.yaml`:  are a copy of the experiment file along with the corresponding hyperparameters (for replicability).

* `save`:  is the place where we store the learned model.

In the `save` folder, you find subfolders containing the checkpoints saved during training (in the format `CKPT+data+time`). Typically, you find here two checkpoints: the best (i.e, the oldest one) and the latest (i.e, the most recent one). If you find only a single checkpoint it means that the last epoch is also the best.

Inside each checkpoint, you can find all the information needed to resume training (e.g, models, optimizers, schedulers, epoch counter, etc.). The parameters of the RNNLM model are reported in `model.ckpt` file. This is just a binary format readable with `torch.load`.


As usual, we have a `train.py` and a hyperparameter file called `RNNLM.yaml`. 

### **Hyperparameters**
[You can take a look into the full RNNLM.yaml file here](https://github.com/speechbrain/speechbrain/blob/develop/templates/speech_recognition/LM/RNNLM.yaml).

In the first part, we specify some basic settings, such as the seed, the path of the output folders and the training logger:

```yaml
seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/RNNLM/
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
```

We then specify the path of the text corpora used for training, validation, and test:

```yaml
lm_train_data: data/train.txt
lm_valid_data: data/valid.txt
lm_test_data: data/test.txt
```

Different from all the other recipes, the LM one directly reads big corpora in raw text format (without the need for the JSON/CSV files). This is done with the [HuggingFace dataset](https://huggingface.co/), that turned out to be very efficient and easy to use.

Next, we set up the train_logger and we specify which tokenizer we use to transform the input words into a sequence of tokens. In this case, we have to use the tokenizer trained at the previous step:

```yaml
# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

# Tokenizer model (you must use the same tokenizer for LM and ASR training)
tokenizer_file: ../Tokenizer/save/1000_unigram.model
```


We can now specify some training hyperparameters such as the number of epochs, the batch size, and the learning rate. We also define the most important architectural hyperparameters (e.g, number of layers, number of neurons per layer, output dimensionality).




```yaml
# Training parameters
number_of_epochs: 20
batch_size: 80
lr: 0.001
accu_steps: 1 # Gradient accumulation to simulate large batch training
ckpt_interval_minutes: 15 # save checkpoint every N min

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>
    shuffle: True

valid_dataloader_opts:
    batch_size: 1

test_dataloader_opts:
    batch_size: 1

# Model parameters
emb_dim: 256 # dimension of the embeddings
rnn_size: 512 # dimension of hidden layers
layers: 2 # number of hidden layers

# Outputs
output_neurons: 1000 # index(eos/bos) = 0
```

Next, we define the objects that we will use to train our language model. We thus declare objects for the RNN model, the cost function, the optimizer, and the learning rate scheduler:


```yaml
model: !new:templates.speech_recognition.LM.custom_model.CustomModel
    embedding_dim: !ref <emb_dim>
    rnn_size: !ref <rnn_size>
    layers: !ref <layers>


# Cost function used for training the model
compute_cost: !name:speechbrain.nnet.losses.nll_loss

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
optimizer: !name:torch.optim.Adam
    lr: !ref <lr>
    betas: (0.9, 0.98)
    eps: 0.000000001

# This function manages learning rate annealing over the epochs.
# We here use the NewBoB algorithm, that anneals the learning rate if
# the improvements over two consecutive epochs is less than the defined
# threshold.
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0
```
The model that we used in this example is defined in the `custom_model.py` file. As mentioned, this is just a simple RNN but users can easily plug here their custom models (e.g .convolutional models or Transformers). 

We conclude the hyperparameter specification with the declaration of the epoch counter, tokenizer, and checkpointer:


```yaml
# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class.
modules:
    model: !ref <model>

# Tokenier initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        counter: !ref <epoch_counter>

# Pretrain the tokenizer
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        tokenizer: !ref <tokenizer>
    paths:
        tokenizer: !ref <tokenizer_file>
```

The last class is the pre-trainer, which connects the tokenizer object with the specified pre-trained tokenizer.


### **Experiment file**
Let's now take a look into how the objects, functions, and hyperparameters declared in the yaml file are used in `train.py` to implement the language model.

Let's start from the main of the `train.py`:


```python
# Recipe begins!
if __name__ == "__main__":

    # Reading command line arguments
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    # Load hyperparameters file with command-line overrides
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )
```

We here do some preliminary operations such as parsing the command line, initializing the distributed data-parallel (needed if multiple GPUs are used), creating the output folder, and reading the yaml file.

After reading the yaml file with `load_hyperpyyaml`, all the objects declared in the hyperparameter files are initialized and available in a dictionary form (along with the other functions and parameters reported in the yaml file).
For instance,  we will have `hparams['model']`, `hparams['optimizer']`, `hparams['batch_size']`, etc.


#### **Data-IO Pipeline**
We then call a special function that creates the dataset objects for training, validation, and test.

```python
    # Create dataset objects "train", "valid", and "test"
    train_data, valid_data, test_data = dataio_prepare(hparams)
```

Let's take a closer look into that.


```python
def dataio_prepare(hparams):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions.

    The language model is trained with the text files specified by the user in
    the hyperparameter file.

    Arguments
    ---------
    hparams : dict
        This dictionary is loaded from the `train.yaml` file, and it includes
        all the hyperparameters needed for dataset construction and loading.

    Returns
    -------
    datasets : list
        List containing "train", "valid", and "test" sets that correspond
        to the appropriate DynamicItemDataset object.
    """

    logging.info("generating datasets...")

    # Prepare datasets
    datasets = load_dataset(
        "text",
        data_files={
            "train": hparams["lm_train_data"],
            "valid": hparams["lm_valid_data"],
            "test": hparams["lm_test_data"],
        },
    )

    # Convert huggingface's dataset to DynamicItemDataset via a magical function
    train_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(
        datasets["train"]
    )
    valid_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(
        datasets["valid"]
    )
    test_data = sb.dataio.dataset.DynamicItemDataset.from_arrow_dataset(
        datasets["test"]
    )

    datasets = [train_data, valid_data, test_data]
    tokenizer = hparams["tokenizer"]

    # Define text processing pipeline. We start from the raw text and then
    # encode it using the tokenizer. The tokens with bos are used for feeding
    # the neural network, the tokens with eos for computing the cost function.
    @sb.utils.data_pipeline.takes("text")
    @sb.utils.data_pipeline.provides("text", "tokens_bos", "tokens_eos")
    def text_pipeline(text):
        yield text
        tokens_list = tokenizer.encode_as_ids(text)
        tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
        yield tokens_bos
        tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
        yield tokens_eos

    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)

    # 4. Set outputs to add into the batch. The batch variable will contain
    # all these fields (e.g, batch.id, batch.text, batch.tokens.bos,..)
    sb.dataio.dataset.set_output_keys(
        datasets, ["id", "text", "tokens_bos", "tokens_eos"],
    )
    return train_data, valid_data, test_data
```

The first part is just a conversion from the HuggingFace dataset to the DynamicItemDataset used in SpeechBrain. 

You can notice that we expose the text processing function `text_pipeline`, which takes in input the text of one sentence and processes it in different ways. 

The text processing function converts the raw text into the corresponding tokens (in index form). We also create other variables such as the version of the sequence with the beginning of the sentence `<bos>`  token in front and the one with the end of sentence `<eos>` as the last element. Their usefulness will be clear later.

Before returning the dataset objects, the `dataio_prepare` specifies which keys we would like to output. As we will see later, these keys will be available in the brain class as `batch.id`, `batch.text`, `batch.tokens_bos`, etc.
[For more information on the data loader, please take a look into this tutorial](https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing)


After the definition of the datasets, the main function can go ahead with the  initialization of the brain class:

```python
    # Initialize the Brain object to prepare for LM training.
    lm_brain = LM(
        modules=hparams["modules"],
        opt_class=hparams["optimizer"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )
```
The brain class implements all the functionalities needed for supporting the training and validation loops.  Its `fit` and `evaluate` methods perform training and test, respectively:

```python
    lm_brain.fit(
        lm_brain.hparams.epoch_counter,
        train_data,
        valid_data,
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = lm_brain.evaluate(
        test_data,
        min_key="loss",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )
```
The training and validation data loaders are given in input to the fit method, while the test dataset is fed into the evaluate method.

Let's now take a look into the most important methods defined in the brain class.

#### **Forward Computations**

Let's start with the `forward` function, which defines all the computations needed to transform the input text into the output predictions.


```python
    def compute_forward(self, batch, stage):
        """Predicts the next word given the previous ones.

        Arguments
        ---------
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        predictions : torch.Tensor
            A tensor containing the posterior probabilities (predictions).
        """
        batch = batch.to(self.device)
        tokens_bos, _ = batch.tokens_bos
        pred = self.hparams.model(tokens_bos)
        return pred
```

In this case, the chain of computation is very simple. We just put the batch on the right device and feed the encoded tokens into the model. We feed the tokens with `<bos>` into the model.
When adding the `<bos>` token, in fact, we shift all the tokens by one element. This way, our input corresponds to the previous token while our model tries to predict the current one.

#### **Compute Objectives**

Let's take a look now into the `compute_objectives` method that takes in input the targets, the predictions, and estimates a loss function:

```python
    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss given the predicted and targeted outputs.

        Arguments
        ---------
        predictions : torch.Tensor
            The posterior probabilities from `compute_forward`.
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        loss : torch.Tensor
            A one-element tensor used for backpropagating the gradient.
        """
        batch = batch.to(self.device)
        tokens_eos, tokens_len = batch.tokens_eos
        loss = self.hparams.compute_cost(
            predictions, tokens_eos, length=tokens_len
        )
        return loss
```
The predictions are those computed in the forward method. The cost function is evaluated by comparing these predictions with the target tokens. We here use the tokens with the special `<eos>` token at the end because we want to predict when the sentence ends as well.

####**Other methods**
Beyond these two important functions, we have some other methods that are used by the brain class. In particular, the `fit_batch` trains each batch of data (by computing the gradient with the backward method and the updates with step one). The `on_stage_end`, is called at the end of each stage (e.g, at the end of each training epoch) and mainly takes care of statistic management, learning rate annealing, and checkpointing. [For a more detailed description of the brain class, please take a look into this tutorial](https://colab.research.google.com/drive/12bg3aUdr9mTfOGqcB5pSMABoIKPgiwcM?usp=sharing). For more information on checkpointing, [take a look here](https://colab.research.google.com/drive/1VH7U0oP3CZsUNtChJT2ewbV_q1QX8xre?usp=sharing)





## **Step 4: Speech Recognizer**
At this point, we can train our speech recognizer. In this tutorial, we are
going to train an **attention-based end-to-end speech recognizer** (offline).
The encoder relies on a combination of convolutional, recurrent, and fully connected models. The decoder is an autoregressive GRU decoder. An attention mechanism is employed between encoding and decoder. The final sequence of words is retrieved with beamsearch coupled with the RNNLM trained in the previous step. 
The attention-based system is jointly trained with CTC (applied on the top of the encoder).
The system uses data augmentation techniques to improve its performance.

### **Train the speech recognizer**
To train the speech recognizer, run the following code:

In [None]:
%cd /content/speechbrain/templates/speech_recognition/ASR
!python train.py train.yaml --batch_size=2 #--device='cpu'

/content/speechbrain/templates/speech_recognition/ASR
Downloading http://www.openslr.org/resources/28/rirs_noises.zip to ../data/rirs_noises.zip
rirs_noises.zip: 1.31GB [00:42, 30.8MB/s]                
Extracting ../data/rirs_noises.zip to ../data
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/CRDNN_BPE_960h_LM/2602
mini_librispeech_prepare - Preparation completed in previous run, skipping.
speechbrain.pretrained.fetching - Fetch lm.ckpt: Delegating to Huggingface hub, source speechbrain/asr-crdnn-rnnlm-librispeech.
filelock - Lock 140130605497168 acquired on /root/.cache/huggingface/hub/651df066b5d0b2efef7208f51df93d3a0a65bedc3a3a2500cd7b8faf064e631e.b438b9af3f549a23c4458bb066c11cd51dc1cfe9bfef30d3eb66b472e93b1e8c.lock
huggingface_hub.file_download - downloading https://huggingface.co/speechbrain/asr-crdnn-rnnlm-librispeech/resolve/main/lm.ckpt to /root/.cache/huggingface/hub/tmpps__p7vx
Downloading: 100% 212M/212M [00:04<00:00, 50.2MB/s]
huggi

Running this code might take quite a bit on google Colab. As you can see from the log,  the loss is progressively improving after each epoch. 
The specified  `output_folder` will contain the same files and folders already discussed in the RNNLM part. In addition to that, we save a file called `wer.txt` that reports the word-error-rate achieved for every test sentence (along with the corresponding alignment with the true transcription):


```
%WER 3.09 [ 1622 / 52576, 167 ins, 171 del, 1284 sub ]
%SER 33.66 [ 882 / 2620 ]
Scored 2620 sentences, 0 not present in hyp.
================================================================================
ALIGNMENTS

Format:
<utterance-id>, WER DETAILS
<eps> ; reference  ; on ; the ; first ;  line
  I   ;     S      ; =  ;  =  ;   S   ;   D  
 and  ; hypothesis ; on ; the ; third ; <eps>
================================================================================
672-122797-0033, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]
A ; STORY
= ;   =  
A ; STORY
================================================================================
2094-142345-0041, %WER 0.00 [ 0 / 1, 0 ins, 0 del, 0 sub ]
DIRECTION
    =    
DIRECTION
================================================================================
2830-3980-0026, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]
VERSE ; TWO
  S   ;  = 
FIRST ; TWO
================================================================================
237-134500-0025, %WER 50.00 [ 1 / 2, 0 ins, 0 del, 1 sub ]
OH ;  EMIL
=  ;   S  
OH ; AMIEL
================================================================================
7127-75947-0012, %WER 0.00 [ 0 / 2, 0 ins, 0 del, 0 sub ]
INDEED ; AH
  =    ; = 
INDEED ; AH
================================================================================

```



Let's now take a closer look into the hyperparameter (`train.yaml`)  and experiment script (`train.py`).


### **Hyperparameters**

The hyperparameter file starts with the definition of basic things, such as seed and path settings:

```yaml
# Seed needs to be set at top of yaml, before objects with parameters are instantiated
seed: 2602
__set_seed: !apply:torch.manual_seed [!ref <seed>]

data_folder: ../data # In this case, data will be automatically downloaded here.
data_folder_rirs: !ref <data_folder> # noise/ris dataset will automatically be downloaded here
output_folder: !ref results/CRDNN_BPE_960h_LM/<seed>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

pretrained_path: speechbrain/asr-crdnn-rnnlm-librispeech

# Path where data manifest files will be stored. The data manifest files are created by the
# data preparation script
train_annotation: ../train.json
valid_annotation: ../valid.json
test_annotation: ../test.json

# The train logger writes training statistics to a file, as well as stdout.
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>
```

The `data_folder` corresponds to the path where the mini-librispeech is stored. If not available, the mini-librispeech dataset will be downloaded here. As mentioned, the script also supports data augmentation. To do it, we use the impulse responses and noise sequences of the open rir dataset (again, if not available it will be downloaded here).

We also specify the folder where the language model is saved. In this case, we use the official pre-trained language model available on HuggingFace, but you can change it and use the one trained at the previous step (you should point to the checkpoint in the folder where the best `model.cpkt` is stored).
What is important is that the set of tokens used for the LM and the one used for training the speech recognizer match exactly. 

We also have to specify the data manifest files for training, validation, and test. If not available, these files will be created by the data preparation script called in `train.py`.

After that, we define a bunch of parameters for training, feature extraction, model definition, and decoding:

```yaml
# Training parameters
number_of_epochs: 15
number_of_ctc_epochs: 5
batch_size: 8
lr: 1.0
ctc_weight: 0.5
sorting: ascending
ckpt_interval_minutes: 15 # save checkpoint every N min
label_smoothing: 0.1

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>

valid_dataloader_opts:
    batch_size: !ref <batch_size>

test_dataloader_opts:
    batch_size: !ref <batch_size>


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # Number of tokens (same as LM)
blank_index: 0
bos_index: 0
eos_index: 0
unk_index: 0

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
valid_beam_size: 8
test_beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
ctc_weight_decode: 0.0
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25
```

For instance, we define the number of epochs, the initial learning rate, the batch size, the weight of the CTC loss, and many others. 

By setting sorting to `ascending`, we sort all the sentences in ascending order before creating the batches. This minimizes the need for zero paddings and thus makes training faster without losing performance (at least in this task with this model). 

Many other parameters are defined. For the exact meaning of all of them, you can refer to the docstring of the function/class using this hyperparameter.

In the next block, we define the most important classes that are needed to implement the speech recognizer:


```yaml
# The first object passed to the Brain class is this "Epoch Counter"
# which is saved by the Checkpointer so that training can be resumed
# if it gets interrupted at any point.
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

# Feature extraction
compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

# Feature normalization (mean and std)
normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

# Added noise and reverb come from OpenRIR dataset, automatically
# downloaded and prepared with this Environmental Corruption class.
env_corrupt: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <data_folder_rirs>
    babble_prob: 0.0
    reverb_prob: 0.0
    noise_prob: 1.0
    noise_snr_low: 0
    noise_snr_high: 15

# Adds speech change + time and frequnecy dropouts (time-domain implementation).
augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [95, 100, 105]

# The CRDNN model is an encoder that combines CNNs, RNNs, and DNNs.
encoder: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>
    use_rnnp: False

# Embedding (from indexes to an embedding space of dimension emb_size).
embedding: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

# Attention-based RNN decoder.
decoder: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

# Linear transformation on the top of the encoder.
ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

# Linear transformation on the top of the decoder.
seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

# Final softmax (for log posteriors computation).
log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

# Cost definition for the CTC part.
ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
    blank_index: !ref <blank_index>

# Tokenizer initialization
tokenizer: !new:sentencepiece.SentencePieceProcessor

# Objects in "modules" dict will have their parameters moved to the correct
# device, as well as having train()/eval() called on them by the Brain class
modules:
    encoder: !ref <encoder>
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    ctc_lin: !ref <ctc_lin>
    seq_lin: !ref <seq_lin>
    normalize: !ref <normalize>
    env_corrupt: !ref <env_corrupt>
    lm_model: !ref <lm_model>

# Gathering all the submodels in a single model object.
model: !new:torch.nn.ModuleList
    - - !ref <encoder>
      - !ref <embedding>
      - !ref <decoder>
      - !ref <ctc_lin>
      - !ref <seq_lin>

# This is the RNNLM that is used according to the Huggingface repository
# NB: It has to match the pre-trained RNNLM!!
lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference
```

For instance, we define the function for computing features and normalizing them. We define the class for environmental corruption and data augmentation ([please, see this tutorial](https://colab.research.google.com/drive/1mAimqZndq0BwQj63VcDTr6_uCMC6i6Un?usp=sharing)), the architecture of the encoder, decoder, and the other models need by the speech recognizer.


We then report the parameters for beasearch:

```yaml
valid_search: !new:speechbrain.decoders.S2SRNNBeamSearcher
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    ctc_linear: !ref <ctc_lin>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <valid_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    coverage_penalty: !ref <coverage_penalty>
    temperature: !ref <temperature>

# The final decoding on the test set can be more computationally demanding.
# In this case, we use the LM + CTC probabilities during decoding as well.
# Please, remove this part if you need a faster decoder.
test_search: !new:speechbrain.decoders.S2SRNNBeamSearchLM
    embedding: !ref <embedding>
    decoder: !ref <decoder>
    linear: !ref <seq_lin>
    ctc_linear: !ref <ctc_lin>
    language_model: !ref <lm_model>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    blank_index: !ref <blank_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    coverage_penalty: !ref <coverage_penalty>
    lm_weight: !ref <lm_weight>
    ctc_weight: !ref <ctc_weight_decode>
    temperature: !ref <temperature>
    temperature_lm: !ref <temperature_lm>
```
We here employ different hyperparameters for validation and test beamsearch. In particular, a smaller beam size is used for the validation stage. The reason is that validation is done at the end of each epoch and should thus be done quickly. Evaluation, instead, is done only once at the end and we can be more accurate.


Finally, we declare the last objects needed by the training recipes, such as  lr_annealing, optimizer, checkpointer, etc:


```yaml
lr_annealing: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr>
    improvement_threshold: 0.0025
    annealing_factor: 0.8
    patient: 0

# This optimizer will be constructed by the Brain class after all parameters
# are moved to the correct device. Then it will be added to the checkpointer.
opt_class: !name:torch.optim.Adadelta
    lr: !ref <lr>
    rho: 0.95
    eps: 1.e-8

# Functions that compute the statistics to track during the validation step.
error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
    split_tokens: True

# This object is used for saving the state of training both so that it
# can be resumed if it gets interrupted, and also so that the best checkpoint
# can be later loaded for evaluation or inference.
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        model: !ref <model>
        scheduler: !ref <lr_annealing>
        normalizer: !ref <normalize>
        counter: !ref <epoch_counter>

# This object is used to pretrain the language model and the tokenizers
# (defined above). In this case, we also pretrain the ASR model (to make
# sure the model converges on a small amount of data)
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
        model: !ref <model>
    paths:
        lm: !ref <pretrained_path>/lm.ckpt
        tokenizer: !ref <pretrained_path>/tokenizer.ckpt
        model: !ref <pretrained_path>/asr.ckpt

```

The final object is the pretrainer that links the language model, the tokenizer, and the acoustic speech recognition model with their corresponding files used for pre-training.  We here pre-train the acoustic model as well. One such a small dataset, it is very hard to make an end-to-end speech recognizer converging and we thus use another model to pre-trained it (you should skip this part when training on a larger dataset).

### **Experiment file**
Let's now see how the different elements declared in the yaml files are connected in the train.py.
The training script closely follows the one already described for the language model. 

The `main` function starts with the implementation of basic functionalities such as parsing the command line, initializing the distributed data-parallel (needed for multiple GPU training), and reading the yaml file.



```python
if __name__ == "__main__":

    # Reading command line arguments
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])

    # Initialize ddp (useful only for multi-GPU DDP training)
    sb.utils.distributed.ddp_init_group(run_opts)

    # Load hyperparameters file with command-line overrides
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )

    # Data preparation, to be run on only one process.
    sb.utils.distributed.run_on_main(
        prepare_mini_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "save_json_train": hparams["train_annotation"],
            "save_json_valid": hparams["valid_annotation"],
            "save_json_test": hparams["test_annotation"],
        },
    )
```

The yaml file is read with the `load_hyperpyyaml` function. After reading it,  we will have all the declared object initialized and available with the hparams dictionary along with the other functions and variables (e.g, `hparams['model']`, `hparams['test_search']`,`hparams['batch_size']`).

After that, we run the data preparation that has the goal of creating the data manifest file (if not already available). This operation requires writing some files on a disk. For this reason, we have to use the `sb.utils.distributed.run_on_main` to make sure that this operation is executed by the main process only. This avoids possible conflicts when using multiple GPUs with DDP. For more info on multi-gpu training in Speechbrai, [please see this tutorial](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15?usp=sharing).

#### **Data-IO Pipeline**
At this point, we can create the dataset object that we will use for training, validation, and test loops:

```python
    # We can now directly create the datasets for training, valid, and test
    datasets = dataio_prepare(hparams)
```

This function allows users to fully customize the data reading pipeline. Let's take a closer look into it:

```python
def dataio_prepare(hparams):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions.


    Arguments
    ---------
    hparams : dict
        This dictionary is loaded from the `train.yaml` file, and it includes
        all the hyperparameters needed for dataset construction and loading.

    Returns
    -------
    datasets : dict
        Dictionary containing "train", "valid", and "test" keys that correspond
        to the DynamicItemDataset objects.
    """
    # Define audio pipeline. In this case, we simply read the path contained
    # in the variable wav with the audio reader.
    @sb.utils.data_pipeline.takes("wav")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(wav):
        """Load the audio signal. This is done on the CPU in the `collate_fn`."""
        sig = sb.dataio.dataio.read_audio(wav)
        return sig

    # Define text processing pipeline. We start from the raw text and then
    # encode it using the tokenizer. The tokens with BOS are used for feeding
    # decoder during training, the tokens with EOS for computing the cost function.
    # The tokens without BOS or EOS is for computing CTC loss.
    @sb.utils.data_pipeline.takes("words")
    @sb.utils.data_pipeline.provides(
        "words", "tokens_list", "tokens_bos", "tokens_eos", "tokens"
    )
    def text_pipeline(words):
        """Processes the transcriptions to generate proper labels"""
        yield words
        tokens_list = hparams["tokenizer"].encode_as_ids(words)
        yield tokens_list
        tokens_bos = torch.LongTensor([hparams["bos_index"]] + (tokens_list))
        yield tokens_bos
        tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
        yield tokens_eos
        tokens = torch.LongTensor(tokens_list)
        yield tokens

    # Define datasets from json data manifest file
    # Define datasets sorted by ascending lengths for efficiency
    datasets = {}
    data_folder = hparams["data_folder"]
    for dataset in ["train", "valid", "test"]:
        datasets[dataset] = sb.dataio.dataset.DynamicItemDataset.from_json(
            json_path=hparams[f"{dataset}_annotation"],
            replacements={"data_root": data_folder},
            dynamic_items=[audio_pipeline, text_pipeline],
            output_keys=[
                "id",
                "sig",
                "words",
                "tokens_bos",
                "tokens_eos",
                "tokens",
            ],
        )
        hparams[f"{dataset}_dataloader_opts"]["shuffle"] = False

    # Sorting traiing data with ascending order makes the code  much
    # faster  because we minimize zero-padding. In most of the cases, this
    # does not harm the performance.
    if hparams["sorting"] == "ascending":
        datasets["train"] = datasets["train"].filtered_sorted(sort_key="length")
        hparams["train_dataloader_opts"]["shuffle"] = False

    elif hparams["sorting"] == "descending":
        datasets["train"] = datasets["train"].filtered_sorted(
            sort_key="length", reverse=True
        )
        hparams["train_dataloader_opts"]["shuffle"] = False

    elif hparams["sorting"] == "random":
        hparams["train_dataloader_opts"]["shuffle"] = True
        pass

    else:
        raise NotImplementedError(
            "sorting must be random, ascending or descending"
        )
    return datasets
```

Within `dataio_prepare` we define subfunctions for processing the entries defined in the JSON files. 
The first function, called `audio_pipeline` takes the path of the audio signal (`wav`) and reads it. It returns a tensor containing the read speech sentence. The entry in input to this function (i.e, `wav`) must have the same name of the corresponding key in the data manifest file:

```json
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },
```

Similarly, we define another function called `text_pipeline` for processing the signal transcriptions and put them in a format usable by the defined model. The function reads the string `words` defined in the JSON file and tokenizes it (outputting the index of each token). It return the sequence of tokens with the special begin-of-sentence `<bos>` token in front, and the version with the end-of-sentence `<eos>` token at the end aswell. We will see later why these additional elements are needed.

We then create the `DynamicItemDataset` and connect it with the processing functions defined above. We define the desired output keys. These keys will be available in the brain class within the batch variable as:
- batch.id
- batch.sig
- batch.words
- batch.tokens_bos
- batch.tokens_eos
- batch.tokens

The last part of the `dataio_prepare` function manages data sorting. In this case, we sort data in ascending order to minimize zero paddings and speeding training up. For more information on the dataloaders, [please see this tutorial](https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing)


After the definition of the dataio function, we perform pre-training of the language model, ASR model, and tokenizer:


```python
    run_on_main(hparams["pretrainer"].collect_files)
    hparams["pretrainer"].load_collected(device=run_opts["device"])
```
We here use the `run_on_main` wrapper because the ` collect_files` method might need to download the pre-trained model from the web. This operation should be done by a single process only even when using multiple GPUs with DDP).

At this point we initialize the Brain class and use it for running training and evaluation:


```python

    # Trainer initialization
    asr_brain = ASR(
        modules=hparams["modules"],
        opt_class=hparams["opt_class"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )

    # Training
    asr_brain.fit(
        asr_brain.hparams.epoch_counter,
        datasets["train"],
        datasets["valid"],
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )

    # Load best checkpoint for evaluation
    test_stats = asr_brain.evaluate(
        test_set=datasets["test"],
        min_key="WER",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )
```

For more information on how the Brain class works, [please see this tutorial](https://colab.research.google.com/drive/1fdqTk4CTXNcrcSVFvaOKzRfLmj4fJfwa?usp=sharing) 
Note that the `fit` and `evaluate` methods take in input the dataset objects as well. From this dataset, a pytorch dataloader is created automatically. The latter creates the batches used for training and evaluation. 

When speech sentences with **different lengths** are sampled, zero-padding is performed. To keep track of the real length of each sentence within each batch, the dataloader returns a special tensor containing **relative lengths** as well. For instance, let's assume `batch.sig[0]` to be variable that contains the input waveform as a [batch, time] tensor:

```
tensor([[1, 1, 0, 0],
        [1, 1, 1, 0],
        [1, 1, 0, 0]])
```
The `batch.sig[1]` will contain the following relative lengths:

```
tensor([0.5000, 0.7500, 1.0000])
```

With this information, we can exclude zero-padded steps from some computations (e.g feature normalization, statistical pooling, loss, etc). 

### Why relative lengths instead of absolute lengths?

The reason is that the **time resolution can change** within a neural network. There are operations such as pooling, stride convolution, transposed convolution, FFT computation, and many others that change the number of time steps. With the relative position trick, we can compute the number of actual time steps in each stage of the neural computations just by multiplying the relative length by the length of the tensor.


#### **Forward Computations**
In the Brain class we have to define some important methods such as:
- `compute_forward`, that specifies all the computations needed to transform the input waveform into the output posterior probabilities)
- `compute_objective`, which computes the loss function given the labels and the predictions performed by the model.

Let's take a look into `compute_forward` first:


```python
    def compute_forward(self, batch, stage):
        """Runs all the computation of the CTC + seq2seq ASR. It returns the
        posterior probabilities of the CTC and seq2seq networks.

        Arguments
        ---------
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        predictions : dict
            At training time it returns predicted seq2seq log probabilities.
            If needed it also returns the ctc output log probabilities.
            At validation/test time, it returns the predicted tokens as well.
        """
        # We first move the batch to the appropriate device.
        batch = batch.to(self.device)
        feats, self.feat_lens = self.prepare_features(stage, batch.sig)
        tokens_bos, _ = self.prepare_tokens(stage, batch.tokens_bos)

        # Running the encoder (prevent propagation to feature extraction)
        encoded_signal = self.modules.encoder(feats.detach())

        # Embed tokens and pass tokens & encoded signal to decoder
        embedded_tokens = self.modules.embedding(tokens_bos)
        decoder_outputs, _ = self.modules.decoder(
            embedded_tokens, encoded_signal, self.feat_lens
        )

        # Output layer for seq2seq log-probabilities
        logits = self.modules.seq_lin(decoder_outputs)
        predictions = {"seq_logprobs": self.hparams.log_softmax(logits)}

        if self.is_ctc_active(stage):
            # Output layer for ctc log-probabilities
            ctc_logits = self.modules.ctc_lin(encoded_signal)
            predictions["ctc_logprobs"] = self.hparams.log_softmax(ctc_logits)
        elif stage == sb.Stage.VALID:
            predictions["tokens"], _ = self.hparams.valid_search(
                encoded_signal, self.feat_lens
            )
        elif stage == sb.Stage.TEST:
            predictions["tokens"], _ = self.hparams.test_search(
                encoded_signal, self.feat_lens
            )

        return predictions
```


The function takes the batch variable and the current stage (that can be `sb.Stage.TRAIN`, `sb.Stage.VALID`, or `sb.Stage.TEST`). We then put the batch on the right device, compute the features, and encode them with our CRDNN encoder. 
For more information on feature computation, [take a look into this tutorial](https://colab.research.google.com/drive/1CI72Xyay80mmmagfLaIIeRoDgswWHT_g?usp=sharing), while for more details on the speech augmentation [take a look here](https://colab.research.google.com/drive/1JJc4tBhHNXRSDM2xbQ3Z0jdDQUw4S5lr?usp=sharing).
After that, we feed our encoded states into an autoregressive attention-based decoder that performs some predictions over the tokens.
At validation and test stages, we apply beamsearch on the top of the token predictions. 
Our system applies an additional CTC loss on the top of the encoder. The CTC can be turned off after N epochs if desired.


#### **Compute Objectives**

Let's take a look now into the compute_objectives function:



```python
 def compute_objectives(self, predictions, batch, stage):
        """Computes the loss given the predicted and targeted outputs. We here
        do multi-task learning and the loss is a weighted sum of the ctc + seq2seq
        costs.

        Arguments
        ---------
        predictions : dict
            The output dict from `compute_forward`.
        batch : PaddedBatch
            This batch object contains all the relevant tensors for computation.
        stage : sb.Stage
            One of sb.Stage.TRAIN, sb.Stage.VALID, or sb.Stage.TEST.

        Returns
        -------
        loss : torch.Tensor
            A one-element tensor used for backpropagating the gradient.
        """
        # Compute sequence loss against targets with EOS
        tokens_eos, tokens_eos_lens = self.prepare_tokens(
            stage, batch.tokens_eos
        )
        loss = sb.nnet.losses.nll_loss(
            log_probabilities=predictions["seq_logprobs"],
            targets=tokens_eos,
            length=tokens_eos_lens,
            label_smoothing=self.hparams.label_smoothing,
        )

        # Add ctc loss if necessary. The total cost is a weighted sum of
        # ctc loss + seq2seq loss
        if self.is_ctc_active(stage):
            # Load tokens without EOS as CTC targets
            tokens, tokens_lens = self.prepare_tokens(stage, batch.tokens)
            loss_ctc = self.hparams.ctc_cost(
                predictions["ctc_logprobs"], tokens, self.feat_lens, tokens_lens
            )
            loss *= 1 - self.hparams.ctc_weight
            loss += self.hparams.ctc_weight * loss_ctc

        if stage != sb.Stage.TRAIN:
            # Converted predicted tokens from indexes to words
            predicted_words = [
                self.hparams.tokenizer.decode_ids(prediction).split(" ")
                for prediction in predictions["tokens"]
            ]
            target_words = [words.split(" ") for words in batch.words]

            # Monitor word error rate and character error rated at
            # valid and test time.
            self.wer_metric.append(batch.id, predicted_words, target_words)
            self.cer_metric.append(batch.id, predicted_words, target_words)

        return loss
```

Based on the predictions and the target we compute the Negative Log Likelihood  loss (NLL) and, if needed, the Connectionist Temporal Classification (CTC) one as well. The two losses are combined with a weight (ctc_weight). At validation or test stages,  we compute the word-error-rate (WER) and the character-error-rate (CER). 

### **Other Methods**
Beyond `forward and `compute_objective` you can find other functions such as `on_stage_start` and `on_stage_end`. The first just initializes the  statistic objects (e.g, WER and CER), while the second manages:
- statistics updates
- learning rate annealing
- logging
- checkpointing

That's all. You can just run the code and train your speech recognizer.


The current code implements all the needed functionalities to train a state-of-the-art speech recognition system.  In a real case, however, you have to train the model with a much larger dataset to reach acceptable performance. As an example, [you can see our LibriSpeech recipes here](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR). For more information on checkpointing, [take a look here](https://colab.research.google.com/drive/1VH7U0oP3CZsUNtChJT2ewbV_q1QX8xre?usp=sharing).

## **Pretrain and Fine-tune**
In some cases, instead of training the mode from scratch you might wanna start from a pre-trained model and fine-tune it. Note that to make it possible, the architecure of your model must match exactly with the pre-trained one. 

One convenient way, is to use the pretrain class in the yaml file. If you want to pretrain the encoder of the speech recognizer, you can use the following code: 

```yaml
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
 loadables:
     encoder: !ref <encoder>
 paths:
   encoder: !ref <encoder_ptfile>
```

where `!ref <encoder>` points the the encoder model previously define in the yaml file, and `encoder_ptfile` is the path where you have stored your pre-train model.

To perform pre-training, make sure to call the pre-trained in the `train.py`:

```
run_on_main(hparams["pretrainer"].collect_files)
    hparams["pretrainer"].load_collected(device=run_opts["device"])
```
You have to call this function before the fit method of the brain class.

For more information, [please take a look into our tutorial on pre-training and fine-tune](https://colab.research.google.com/drive/1LN7R3U3xneDgDRK2gC5MzGkLysCWxuC3?usp=sharing).

## **Step 5: Inference**

At this point, we can use the trained speech recognizer. For this type of ASR model, speechbrain made available some classes ([take a look here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)) such as the `EncoderDecoderASR` one that can make inference easier. For instance, we can transcribe an audio file with a pre-trained model hosted in our [HuggingFace repository](https://huggingface.co/speechbrain) in solely 4 lines of code:


In [None]:
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_model")
audio_file = 'speechbrain/asr-crdnn-rnnlm-librispeech/example.wav'
asr_model.transcribe_file(audio_file)

Downloading:   0%|          | 0.00/4.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/212M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/253k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/104k [00:00<?, ?B/s]

'THE BIRCH CANOE SLID ON THE SMOOTH PLANKS'

But, how does this work with your custom ASR system? 

### **Train and use your speech recognizer on your data**

At this point, three options are available to you:
1. Define a custom python function in your ASR class (extended from Brain). This introduces strong coupling between the training recipe and your transcripts. It is pretty convenient for prototyping and obtaining simple transcripts on your datasets. However, it is not recommended for deployment. 
2. Use already available Interfaces (such as `EncoderDecoderASR`). This is probably the most elegant and convenient way. However, your model should be compliant with some constraints to fit the proposed interface.
3. Build your own Interface perfectly fitting to your custom ASR model.

**Important: All these solutions also apply for other tasks (speaker recognition, source separation ...)**

#### **Custom function in the training script**
The goal of this approach is to enable the user to call a function at the end of `train.py` that transcribes a given dataset:

```python
# Trainer initialization
    asr_brain = ASR(
        modules=hparams["modules"],
        opt_class=hparams["opt_class"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
    )
 
    # Training
    asr_brain.fit(
        asr_brain.hparams.epoch_counter,
        datasets["train"],
        datasets["valid"],
        train_loader_kwargs=hparams["train_dataloader_opts"],
        valid_loader_kwargs=hparams["valid_dataloader_opts"],
    )
 
    # Load best checkpoint for evaluation
    test_stats = asr_brain.evaluate(
        test_set=datasets["test"],
        min_key="WER",
        test_loader_kwargs=hparams["test_dataloader_opts"],
    )

    # Load best checkpoint for transcription !!!!!!
    # You need to create this function w.r.t your system architecture !!!!!!
    transcripts = asr_brain.transcribe_dataset(
        dataset=datasets["your_dataset"], # Must be obtained from the dataio_function
        min_key="WER", # We load the model with the lowest WER
        loader_kwargs=hparams["transcribe_dataloader_opts"], # opts for the dataloading
    )
```



As you can see, there exists a strong coupling with the training recipe due to the need for an instantiated Brain class. 

**Note 1:** You can remove the `.fit()` and `.evaluate()` if you don't want to call them. This is just an example to better highlight how to use it.

**Note 2:** Here, the `.transcribe_dataset()` function takes a `dataset` object to transcribe. You could also simply use a path instead. It is **completely** up to you to implement this function as you wish. 

Now: what to put in this function? Here, we will give an example based on the template, but you will need to adapt it to **your** system.

```python

def transcribe_dataset(
        self,
        dataset, # Must be obtained from the dataio_function
        min_key, # We load the model with the lowest WER
        loader_kwargs # opts for the dataloading
    ):
  
    # If dataset isn't a Dataloader, we create it. 
    if not isinstance(dataset, DataLoader):
        loader_kwargs["ckpt_prefix"] = None
        dataset = self.make_dataloader(
            dataset, Stage.TEST, **loader_kwargs
        )
    
    
    self.on_evaluate_start(min_key=min_key) # We call the on_evaluate_start that will load the best model
    self.modules.eval() # We set the model to eval mode (remove dropout etc)

    # Now we iterate over the dataset and we simply compute_forward and decode
    with torch.no_grad():

        transcripts = []
        for batch in tqdm(dataset, dynamic_ncols=True):
            
            # Make sure that your compute_forward returns the predictions !!!
            # In the case of the template, when stage = TEST, a beam search is applied 
            # in compute_forward(). 
            out = self.compute_forward(batch, stage=sb.Stage.TEST) 
            p_seq, wav_lens, predicted_tokens = out
            
            # We go from tokens to words.
            predicted_words = self.tokenizer(
                predicted_tokens, task="decode_from_list"
            )
            transcripts.append(predicted_words)
            
    return transcripts
```

The pipeline is simple: load the model -> do compute_forward -> detokenize.

#### **Using the `EndoderDecoderASR` interface**

The [EncoderDecoderASR class](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py#L353). interface allows you to decouple your trained model from the training recipe and to infer (or encode) on any new audio file in few lines of code. The class has the following methods:

- *encode_batch*: apply the encoder to an input batch and returns some encoded features.
- *transcribe_file*: transcribes the single audio file in input.
- *transcribe_batch*: transcribes the input batch.

In fact, if you fulfill few constraints that we will detail in the next paragraph, you can simply do:

```python
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="your_local_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
audio_file = 'your_file.wav'
asr_model.transcribe_file(audio_file)
```

Nevertheless, to allow such a generalization over all the possible EncoderDecoder ASR pipelines, you will have to consider a few constraints when deploying your system:

1. **Necessary modules.** As you can see in the `EncoderDecoderASR` class, the modules defined in your yaml file MUST contain certain elements with specific names. In practice, you need a tokenizer, a decoder, and a decoder. The encoder can simply be a `speechbrain.nnet.containers.LengthsCapableSequential` composed with a sequence of features computation, normalization and model encoding. 
```python
    HPARAMS_NEEDED = ["tokenizer"]
    MODULES_NEEDED = [
        "encoder",
        "decoder",
    ]
```

You also need to declare these entities in the YAML file and create the following dictionary called `modules`:

```
encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearchLM
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    language_model: !ref <lm_model>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    coverage_penalty: !ref <coverage_penalty>
    lm_weight: !ref <lm_weight>
    temperature: !ref <temperature>
    temperature_lm: !ref <temperature_lm>

modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>
```

In this case, `enc` is a CRDNN, but could be any custom neural network for instance.

  **Why do you need to ensure this?** Well, it simply is because these are the modules we call when inferring on the `EncoderDecoderASR` class. Here is an example of the `encode_batch()` function.
```python
[...]
  wavs = wavs.float()
  wavs, wav_lens = wavs.to(self.device), wav_lens.to(self.device)
  encoder_out = self.modules.encoder(wavs, wav_lens)
return encoder_out
```
  **What if I have a complex asr_encoder structure with multiple deep neural networks and stuffs ?** Simply put everything in a torch.nn.ModuleList in your yaml:
```yaml
asr_encoder: !new:torch.nn.ModuleList
    - [!ref <enc>, my_different_blocks ... ]
```

2. **Call to the pretrainer to load the checkpoints.** Finally, you need to define a call to the pretrainer that will load the different checkpoints of your trained model into the corresponding SpeechBrain modules. In short, it will load the weights of your encoder, language model or even simply load the tokenizer. 
```yaml
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>
    paths:
      asr: !ref <asr_model_ptfile>
      lm: !ref <lm_model_ptfile>
      tokenizer: !ref <tokenizer_ptfile>
```
The loadable field creates a link between a file (e.g. `lm` that is related to the checkpoint in `<lm_model_ptfile>`) to a yaml instance (e.g. `<lm_model>`) that is nothing more than your lm. 

If you respect these two constraints, it should works! Here, we give a complete example of a yaml that is used for inference only:

```yaml

# ############################################################################
# Model: E2E ASR with attention-based ASR
# Encoder: CRDNN model
# Decoder: GRU + beamsearch + RNNLM
# Tokens: BPE with unigram
# Authors:  Ju-Chieh Chou, Mirco Ravanelli, Abdel Heba, Peter Plantinga 2020
# ############################################################################


# Feature parameters
sample_rate: 16000
n_fft: 400
n_mels: 40

# Model parameters
activation: !name:torch.nn.LeakyReLU
dropout: 0.15
cnn_blocks: 2
cnn_channels: (128, 256)
inter_layer_pooling_size: (2, 2)
cnn_kernelsize: (3, 3)
time_pooling_size: 4
rnn_class: !name:speechbrain.nnet.RNN.LSTM
rnn_layers: 4
rnn_neurons: 1024
rnn_bidirectional: True
dnn_blocks: 2
dnn_neurons: 512
emb_size: 128
dec_neurons: 1024
output_neurons: 1000  # index(blank/eos/bos) = 0
blank_index: 0

# Decoding parameters
bos_index: 0
eos_index: 0
min_decode_ratio: 0.0
max_decode_ratio: 1.0
beam_size: 80
eos_threshold: 1.5
using_max_attn_shift: True
max_attn_shift: 240
lm_weight: 0.50
coverage_penalty: 1.5
temperature: 1.25
temperature_lm: 1.25

normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global

compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>

enc: !new:speechbrain.lobes.models.CRDNN.CRDNN
    input_shape: [null, null, !ref <n_mels>]
    activation: !ref <activation>
    dropout: !ref <dropout>
    cnn_blocks: !ref <cnn_blocks>
    cnn_channels: !ref <cnn_channels>
    cnn_kernelsize: !ref <cnn_kernelsize>
    inter_layer_pooling_size: !ref <inter_layer_pooling_size>
    time_pooling: True
    using_2d_pooling: False
    time_pooling_size: !ref <time_pooling_size>
    rnn_class: !ref <rnn_class>
    rnn_layers: !ref <rnn_layers>
    rnn_neurons: !ref <rnn_neurons>
    rnn_bidirectional: !ref <rnn_bidirectional>
    rnn_re_init: True
    dnn_blocks: !ref <dnn_blocks>
    dnn_neurons: !ref <dnn_neurons>

emb: !new:speechbrain.nnet.embedding.Embedding
    num_embeddings: !ref <output_neurons>
    embedding_dim: !ref <emb_size>

dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
    enc_dim: !ref <dnn_neurons>
    input_size: !ref <emb_size>
    rnn_type: gru
    attn_type: location
    hidden_size: !ref <dec_neurons>
    attn_dim: 1024
    num_layers: 1
    scaling: 1.0
    channels: 10
    kernel_size: 100
    re_init: True
    dropout: !ref <dropout>

ctc_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dnn_neurons>
    n_neurons: !ref <output_neurons>

seq_lin: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <dec_neurons>
    n_neurons: !ref <output_neurons>

log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

lm_model: !new:speechbrain.lobes.models.RNNLM.RNNLM
    output_neurons: !ref <output_neurons>
    embedding_dim: !ref <emb_size>
    activation: !name:torch.nn.LeakyReLU
    dropout: 0.0
    rnn_layers: 2
    rnn_neurons: 2048
    dnn_blocks: 1
    dnn_neurons: 512
    return_hidden: True  # For inference

tokenizer: !new:sentencepiece.SentencePieceProcessor

asr_model: !new:torch.nn.ModuleList
    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]

# We compose the inference (encoder) pipeline.
encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
    input_shape: [null, null, !ref <n_mels>]
    compute_features: !ref <compute_features>
    normalize: !ref <normalize>
    model: !ref <enc>

decoder: !new:speechbrain.decoders.S2SRNNBeamSearchLM
    embedding: !ref <emb>
    decoder: !ref <dec>
    linear: !ref <seq_lin>
    language_model: !ref <lm_model>
    bos_index: !ref <bos_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <beam_size>
    eos_threshold: !ref <eos_threshold>
    using_max_attn_shift: !ref <using_max_attn_shift>
    max_attn_shift: !ref <max_attn_shift>
    coverage_penalty: !ref <coverage_penalty>
    lm_weight: !ref <lm_weight>
    temperature: !ref <temperature>
    temperature_lm: !ref <temperature_lm>


modules:
    encoder: !ref <encoder>
    decoder: !ref <decoder>
    lm_model: !ref <lm_model>

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    loadables:
        asr: !ref <asr_model>
        lm: !ref <lm_model>
        tokenizer: !ref <tokenizer>


```

As you can see, it is a standard YAMl file, but with a pretrainer that loads the model. It is similar to the yaml file used for training. We only have to remove all the parts that are training-specific (e.g, training parameters, optimizers, checkpointers, etc.) and add the pretrainer and `encoder`, `decoder` elements that links the needed modules with their pre-trained files. 

#### **Developing your own inference interface**

While the `EncoderDecoderASR` class has been designed to be as generic as possible, your might require a more complex inference scheme that better fits your needs.  In this case, you have to develop your own interface. To do so, follow these steps:

1. Create your custom interface inheriting from `Pretrained` (code [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/pretrained/interfaces.py)):


```python
class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system
```

This will enable your class to call useful functions such as `.from_hparams()` that fetches and loads based on a HyperPyYAML file, `load_audio()` that loads a given audio file.  Likely, most of the methods that we coded in the Pretrained class will fit your need. If not, you can override them to implement your custom functionality.


2. Develop your interface and the different functionalities. Unfortunately, we can't provide a generic enough example here. You can add **any** function to this class that you think can make inference on your data/model easier and natural. For instance, we can create here a function that simply encodes a wav file using the `mytask_enc` module.
```python
class MySuperTask(Pretrained):
  # Here, do not hesitate to also add some required modules
  # for further transparency.
  HPARAMS_NEEDED = ["mymodule1", "mymodule2"]
  MODULES_NEEDED = [
        "mytask_enc",
        "my_searcher",
  ]
  def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Do whatever is needed here w.r.t your system
  
  def encode_file(self, path):
        waveform = self.load_audio(path)
        # Fake a batch:
        batch = waveform.unsqueeze(0)
        rel_length = torch.tensor([1.0])
        with torch.no_grad():
          rel_lens = rel_length.to(self.device)
          encoder_out = self.encode_batch(waveform, rel_lens)
        
        return encode_file
```

Now, we can use your Interface in the following way:
```python
from speechbrain.pretrained import MySuperTask

my_model = MySuperTask.from_hparams(source="your_local_folder", hparams_file='your_file.yaml', savedir="pretrained_model")
audio_file = 'your_file.wav'
encoded = my_model.encode_file(audio_file)

```

As you can see, this formalism is extremely flexible and enables you to create a holistic interface that can be used to do anything you want with your pretrained model.

We provide different generic interfaces for E2E ASR, speaker recognition, source separation, speech enhancement, etc. Please have a look [here](https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/ASR/seq2seq/train.py) if interested! 



## **Customize your speech recognizer**
In a general case, you might have your own data and you would like to use your own model. Let's comment a bit more on how you can customize your recipe. 

**Suggestion**:  start from a recipe that is working (like the one used for this template) and only do the minimal modifications needed to customize it. Test your model step by step. Make sure your model can overfit on a tiny dataset composed of few sentences. If it doesn't overfit there is likely a bug in your model.

### **Train with your data**
All you have to do when changing the dataset is to update the data preparation script such that we create the JSON files formatted as expected. The `train.py` script expects that the JSON file to be like this:



```json
{
  "1867-154075-0032": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0032.flac",
    "length": 16.09,
    "words": "AND HE BRUSHED A HAND ACROSS HIS FOREHEAD AND WAS INSTANTLY HIMSELF CALM AND COOL VERY WELL THEN IT SEEMS I'VE MADE AN ASS OF MYSELF BUT I'LL TRY TO MAKE UP FOR IT NOW WHAT ABOUT CAROLINE"
  },
  "1867-154075-0001": {
    "wav": "{data_root}/LibriSpeech/train-clean-5/1867/154075/1867-154075-0001.flac",
    "length": 14.9,
    "words": "THAT DROPPED HIM INTO THE COAL BIN DID HE GET COAL DUST ON HIS SHOES RIGHT AND HE DIDN'T HAVE SENSE ENOUGH TO WIPE IT OFF AN AMATEUR A RANK AMATEUR I TOLD YOU SAID THE MAN OF THE SNEER WITH SATISFACTION"
  },
```

You have to parse your dataset and create JSON files with a unique ID for each sentence, the path of the audio signal (wav), the length of the speech sentence in seconds (length), and the word transcriptions ("words"). That's all!



### **Train with your own model**
At some point, you might have your own model and you would like to plug it into the speech recognition pipeline. 
For instance, you might wanna replace our CRDNN encoder with something different. To do that, you have to create your own class and specify there the list of computations for your neural network. You can take a look into the models already existing in [speechbrain.lobes.models](https://github.com/speechbrain/speechbrain/tree/develop/speechbrain/lobes/models). If your model is a plain pipeline of computations, you can use the [sequential container](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/lobes/models/CRDNN.py#L14). If the model is a more complex chain of computations, you can create it as an instance of `torch.nn.Module` and define there the `__init__` and `forward` methods like [here](https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/lobes/models/Xvector.py#L18).

Once you defined your model, you only have to declare it in the yaml file and use it in `train.py`


**Important:**  
When plugging a new model, you have to tune again the most important hyperparameters of the system (e.g, learning rate, batch size, and the architectural parameters) to make the it working well.






## **Conclusion**

In this tutorial, we showed how to create an end-to-end speech recognizer from scratch using SpeechBrain. The proposed system contains all the basic ingredients to develop a state-of-the-art system (i.e., data augmentation, tokenization, language models, beamsearch, attention, etc)

We described all the steps using a small dataset only. In a real case you have to train with much more data (see for instance our [LibriSpeech recipes](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech)).

## Related Tutorials
0. [ASRfromScratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing)
1. [YAML hyperpatameter specification](https://colab.research.google.com/drive/1Pg9by4b6-8QD2iC0U7Ic3Vxq4GEwEdDz?usp=sharing)
2. [Brain Class](https://colab.research.google.com/drive/1fdqTk4CTXNcrcSVFvaOKzRfLmj4fJfwa?usp=sharing)
3. [Checkpointing](https://colab.research.google.com/drive/1VH7U0oP3CZsUNtChJT2ewbV_q1QX8xre?usp=sharing)
4. [Data-io](https://colab.research.google.com/drive/1AiVJZhZKwEI4nFGANKXEe-ffZFfvXKwH?usp=sharing)
5. [Tokenizer](https://colab.research.google.com/drive/12yE3myHSH-eUxzNM0-FLtEOhzdQoLYWe?usp=sharing)
6. [Speech Features](https://colab.research.google.com/drive/1CI72Xyay80mmmagfLaIIeRoDgswWHT_g?usp=sharing)
7. [Speech Augmentation](https://colab.research.google.com/drive/1JJc4tBhHNXRSDM2xbQ3Z0jdDQUw4S5lr?usp=sharing)
8. [Environmental Corruption](https://colab.research.google.com/drive/1mAimqZndq0BwQj63VcDTr6_uCMC6i6Un?usp=sharing)
9. [MultiGPU Training](https://colab.research.google.com/drive/13pBUacPiotw1IvyffvGZ-HrtBr9T6l15?usp=sharing)
10. [Pretrain and Fine-tune](https://colab.research.google.com/drive/1LN7R3U3xneDgDRK2gC5MzGkLysCWxuC3?usp=sharing)




