# Automatic Speech Recognition (ASR) Tutorial

In [None]:
!nvidia-smi

Thu Jul 27 18:11:24 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Fine-tune a pretrained, multilingual ASR model on FLEURS

In this tutorial, we will be evaluating and improving a multilingual ASR model for a language in the FLEURS dataset. We will focus on **Hausa**, but you can follow along in any language in Common Voice. See the [paper](https://arxiv.org/abs/2205.12446) for list of supported languages.

We will be looking at three major open-source ASR multilingual models:
* XLS-R: [[paper]](https://arxiv.org/abs/2111.09296) [[Hugging Face blog]](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)
* Whisper: [[paper]](https://cdn.openai.com/papers/whisper.pdf) [[Hugging Face blog]](https://huggingface.co/blog/fine-tune-whisper#prepare-feature-extractor-tokenizer-and-data)
* MMS: [[paper]](https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/348827959_6967534189927933_6819186233244071998_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=-JOSFMsFL-UAX-4O6o4&_nc_ht=scontent-sjc3-1.xx&oh=00_AfDdMFq0DP2xIRyjWpGrmIpqncnouiylLfWnFsAgxboLWw&oe=6497E242) [[Hugging Face blog]](https://huggingface.co/blog/mms_adapters)

For more details on the models and finetuning them, please refer to the corresponding Hugging Face tutorials. Much of this tutorial draws from the Hugging Face blogs.

## Before you start: Setting up your coding environment

Make sure you follow the set up instructions in the [lrl-asr-experiments README](https://github.com/kashrest/lrl-asr-experiments) for this tutorial to run on Google Colab.


**Note**: The pretrained multilingual ASR models we will be using in this notebook require GPUs with at least 40 GB of space for practical use. If you are using Google Colab Pro, make sure you go to "Runtime" -> "Change runtime type" -> "Hardware accelerator" -> GPU and "GPU type" -> A100. You then should be able to run all lines of this tutorial.

In [None]:
%%capture
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install datasets[audio]
!pip install evaluate
!pip install git+https://github.com/huggingface/transformers.git
!pip install jiwer
!pip install accelerate -U

In [None]:
# If you want to save checkpoints on your Google Drive (note: checkpoints
# may take up a few GBs (as explained in the README), so it is recommended that you download the checkpoints to your local machine instead), uncomment and run the lines below
"""from google.colab import drive
drive.mount('/content/drive')"""

Mounted at /content/drive


## Data Preprocessing

The first step is to download and prepare the data for the ASR model. Hugging Face has an easy way to download FLEURS data for any supported language, where the split can be specified. We also want to specify an output directory where our finetuned model checkpoints will live.

**Note**: Make sure to download checkpoints to your local machine you want to investigate after checkpoint is saved because all data will be gone once runtime is terminated!

In [None]:
import os
from datasets import load_dataset

# create a directory for outputs in tutorial
out_dir = "./tutorial-fleurs/" # NOTE: Since this is a directory in your virtual
                               # machine (which you can see in the side bar, under the folder icon),
                               # make sure to download model checkpoints to your
                               # local machine if you would like to investigate later
try:
    os.mkdir(out_dir)
except:
    print("Output directory already exists; make a new directory.")

# for Hausa, the language code is "ha_ng"
train_data = load_dataset("google/fleurs", "ha_ng", split="train")
val_data = load_dataset("google/fleurs", "ha_ng", split="validation")
test_data = load_dataset("google/fleurs", "ha_ng", split="test")


Output directory already exists; make a new directory.


FLEURS data is organized like so

In [None]:
train_data[0]

{'id': 302,
 'num_samples': 301440,
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/6d52769d3af80da90bb14f9334ba8da9db2bc4cd9ccfdb87c34409eb360029ef/10002175198254707815.wav',
 'audio': {'path': 'train/10002175198254707815.wav',
  'array': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
         6.78300858e-05, 1.26361847e-05, 6.46114349e-05]),
  'sampling_rate': 16000},
 'transcription': 'nasarorin da vautier ta samu marasa alaka da bada umarni ba sun hada da yajin cin abinci a 1973 a kan abin da ya ke ganin dabaibayin siyasa ne',
 'raw_transcription': 'Nasarorin da Vautier ta samu marasa alaka da bada umarni ba sun hada da yajin cin abinci a 1973 a kan abin da ya ke ganin dabaibayin siyasa ne.',
 'gender': 1,
 'lang_id': 30,
 'language': 'Hausa',
 'lang_group_id': 3}

and we have the training, validation, and test split with 1926, 580, 659, examples respectively.

In [None]:
len(train_data), len(val_data), len(test_data)

(3259, 296, 621)

We are interested in the audio ([represented as an array of floats each proportional to the intensity of the sound at a certain point in time](http://artsites.ucsc.edu/EMS/Music/tech_background/TE-16/teces_16.html); the number of floats is determined by the sampling rate which 16,000 Hz, or 16,000 measurements per second) and the corresponding transcript.

**Note**: All three models we will be using in this tutorial require that the data is sampled at 16,000 Hz. Since FLEURS is sampled at 16,000 Hz, we are good.

Let's extract the audio and transcripts

In [None]:
train_transcripts, val_transcripts, test_transcripts = [], [], []
train_audio, val_audio, test_audio = [], [], []

for elem in train_data:
    assert elem["audio"]["sampling_rate"] == 16000
    train_audio.append(elem["audio"]["array"])
    train_transcripts.append(elem["raw_transcription"])

for elem in val_data:
    assert elem["audio"]["sampling_rate"] == 16000
    val_audio.append(elem["audio"]["array"])
    val_transcripts.append(elem["raw_transcription"])

for elem in test_data:
    assert elem["audio"]["sampling_rate"] == 16000
    test_audio.append(elem["audio"]["array"])
    test_transcripts.append(elem["raw_transcription"])

Now, since we are interested in transcribing speech, we want to clean the transcripts by removing special characters that do not have a clear sound (such as ! '). This part may depend on your target application and language. For example for Hausa, many native speakers do not speak English and does not have much code-switching, so we also normalize any foreign characters (ç ş) and symbols (% & $).

In [None]:
import re

def preprocess_texts_hausa(transcriptions):
    chars_to_remove_regex = '[><¥£°¾½²\\\+\,\?\!\-\;\:\"\“\%\‘\'\ʻ\”\�\$\&\(\)\–\—\[\]\{\}/]'

    def _remove_special_characters(transcription):
        transcription = transcription.strip() # remove any leading or trailing white space
        transcription = transcription.lower()
        transcription = re.sub(chars_to_remove_regex, '', transcription)
        return transcription

    def _normalize_diacritics(transcription):
        a = '[āăáã]'
        u = '[ūúü]'
        o = '[öõó]'
        c = '[ç]'
        i = '[í]'
        s = '[ş]'
        e = '[é]'

        transcription = re.sub(a, "a", transcription)
        transcription = re.sub(u, "u", transcription)
        transcription = re.sub(o, "o", transcription)
        transcription = re.sub(c, "c", transcription)
        transcription = re.sub(i, "i", transcription)
        transcription = re.sub(s, "s", transcription)
        transcription = re.sub(e, "e", transcription)

        return transcription

    cleaned_transcriptions = map(_remove_special_characters, transcriptions)
    cleaned_transcriptions = list(map(_normalize_diacritics, list(cleaned_transcriptions)))
    return cleaned_transcriptions

train_transcripts = preprocess_texts_hausa(train_transcripts)
val_transcripts = preprocess_texts_hausa(val_transcripts)
test_transcripts = preprocess_texts_hausa(test_transcripts)

**Note**: It is important to preprocess test transcripts the same way as the training transcripts so that we have a fair evaluation of the model.

Some models (MMS and XLS-R) predict one character at a time, and so we need a character vocabulary made up of all characters in the dataset after preprocessing. We can save the vocabulary in a JSON file

In [None]:
import json

def extract_all_chars(transcription):
      all_text = " ".join(transcription)
      vocab = list(set(all_text))
      return {"vocab": [vocab], "all_text": [all_text]}

vocab_train = list(map(extract_all_chars, train_transcripts))
vocab_val = list(map(extract_all_chars, val_transcripts))
vocab_test = list(map(extract_all_chars, test_transcripts))

vocab_train_chars = []
for elem in [elem["vocab"][0] for elem in vocab_train]:
    vocab_train_chars.extend(elem)

vocab_val_chars = []
for elem in [elem["vocab"][0] for elem in vocab_val]:
    vocab_val_chars.extend(elem)

vocab_test_chars = []
for elem in [elem["vocab"][0] for elem in vocab_test]:
    vocab_test_chars.extend(elem)

vocab_list = list(set(vocab_train_chars) | set(vocab_val_chars) | set(vocab_test_chars))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}

# for word delimiter, change " " --> "|" (ex. "Hello my name is Bob" --> "Hello|my|name|is|Bob")
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict) # this is for models (like MMS and XLS-R) that use the CTC algorithm to predict the end of a character (e.g. "hhh[PAD]iii[PAD]iii[PAD]" == "hii")

This is the character vocabulary based on the FLEURS dataset

In [None]:
vocab_dict

{'j': 0,
 '6': 1,
 'h': 2,
 'v': 3,
 'x': 4,
 'c': 5,
 'g': 6,
 'ɗ': 7,
 'q': 8,
 'y': 9,
 'r': 10,
 '1': 11,
 'o': 12,
 's': 13,
 'u': 14,
 '.': 16,
 'ƙ': 17,
 'b': 18,
 'p': 19,
 'a': 20,
 '5': 21,
 '4': 22,
 'n': 23,
 't': 24,
 'f': 25,
 'ƴ': 26,
 'l': 27,
 'e': 28,
 '7': 29,
 '3': 30,
 '8': 31,
 'z': 32,
 'm': 33,
 'i': 34,
 '0': 35,
 'k': 36,
 '9': 37,
 '2': 38,
 'ɓ': 39,
 'w': 40,
 '’': 41,
 'd': 42,
 '|': 15,
 '[UNK]': 43,
 '[PAD]': 44}

Let's save this vocabulary file for later use in the output folder.

In [None]:
vocab_file = out_dir+"vocab_hausa.json"
with open(vocab_file, 'w') as f:
    json.dump(vocab_dict, f)

## Evaluation code

In ASR, word error rate [(WER)](https://huggingface.co/spaces/evaluate-metric/wer) and character error rate [(CER)](https://huggingface.co/spaces/evaluate-metric/cer) are the common metrics used to evaluate how good a model-produced transcript is in comparison to the gold transcript. These metrics are related to the "edit distance" between two strings and offer a quantitative measure of string difference.

Let's create a simple function that takes in two sets of strings and calculates the WER and CER of the predicted strings over the dataset.

In [None]:
from datasets import load_dataset, Audio
import evaluate

def compute_metrics(label_strs, pred_strs):
    wer_metric = evaluate.load("wer")
    cer_metric = evaluate.load("cer")

    wer = wer_metric.compute(predictions=pred_strs, references=label_strs) * 100
    cer = cer_metric.compute(predictions=pred_strs, references=label_strs) * 100
    return {"wer": wer, "cer": cer}

## Section A: Zero-Shot ASR

Let's run inference on our dataset with Whisper and MMS-1b-all, which are models that are usable off-the-shelf. We will determine performance on the test set since some models will be later fine-tuned on the train split.

### OpenAI Whisper

OpenAI's Whisper model is a pretrained encoder-decoder model that supports a set of languages without futher fine-tuning. Here, we will use whisper-medium. You can use the larger checkpoints if you have enough GPU memory (found on Hugging Face Hub: https://huggingface.co/openai/whisper-medium).

Note: Whisper requires that input is sampled at 16,000 Hz. Also, Whisper may not support all FLEURS languages, so make sure to check the [paper](https://cdn.openai.com/papers/whisper.pdf).

With a batch size of 10, inference takes about 15 minutes.

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import transformers
from tqdm import tqdm
import torch

device = "cuda:0" # change this to a custom gpu if you have access to one, otherwise set to "cpu"
model_id = "openai/whisper-medium"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="hausa", task="transcribe")

predicted_test_transcripts = []

batch_size = 10 # decrease if needed

for i in tqdm(range(0, len(test_audio), batch_size)):
    batch = test_audio[i:i+batch_size] if i+batch_size <= len(test_audio) else test_audio[i:]
    input_features = processor(batch, sampling_rate=16000, return_tensors="pt").input_features.to(device)
    # generate token ids
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
    # decode token ids to text
    predicted_test_transcripts.extend(processor.batch_decode(predicted_ids, skip_special_tokens=True))
    # free GPU memory for upcoming models
    del input_features
    torch.cuda.empty_cache()

# free GPU memory for upcoming models
del model
torch.cuda.empty_cache()

100%|██████████| 63/63 [14:42<00:00, 14.00s/it]


Let's evaluate the performance of whisper-medium on our preprocessed test dataset

In [None]:
compute_metrics(test_transcripts, predicted_test_transcripts)

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.60k [00:00<?, ?B/s]

{'wer': 129.91414659149154, 'cer': 75.46085873151374}

It looks like we have a 129.9% WER and 75.5% CER. Let's create a running table of the performances of different models on our test dataset.

| Model | WER % | CER %|
|-------|-----|----|
|whisper-medium|129.9|75.5|

This is very poor performance, as WER is 129.9%, meaning all words in the dataset were incorrect, and the model predicted more words than are present in the reference text. CER is also very poor with 75.5%, meaning the model predicted incorrect characters on average three quarters of the time. A random example prediction is shown below

In [None]:
import random
n = random.randint(0, len(predicted_test_transcripts)-1)
print(f"Predicted transcript: {predicted_test_transcripts[n]}\nReference transcript: {test_transcripts[n]}")

Predicted transcript:  Tigeak kyanqashi ishwin chikin tari nirwanda ke zuwa da ga koraamun duenia, zuha chikin te huyana azwani da ga amu zul.
Reference transcript: cikakken kashi 20 cikin dari na ruwan da ke zubowa daga koramun duniyar zuwa cikin teku yana zuwa ne daga amazon.


Since Whisper has not been fine-tuned on our dataset, foreign characters and capitalization seems to contribute to the CER/WER. We will later see if we can improve the scores with finetuning.

**Note**: Manual error analysis is important to do along with looking at WER and CER. Sometimes, although the WER/CER is poor, the transcripts are not completely inaccurate, as you may see above.

**Note**: Also, Whisper only predicts for up to 30 secs of audio, so if you have longer samples, you will get poor WER/CER

### Facebook MMS

MMS-1b-all is Facebook's MMS (**M**assively **M**ultilingual **S**peech) model, which is MMS, a Wav2Vec model that is pretrained on a large corpus of Bible data covering 1107 languages, and finetuned on additional labeled datasets. MMS is pretrained similarly to how BERT is trained with a masked language modeling objective, but by masking audio input. We will use MMS-1b-all to run inference on our dataset.

A batch size of 10 takes approximately 3 minutes.

In [None]:
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_id = "facebook/mms-1b-all"

device = "cuda:0" # change this to a custom gpu if you have access to one, otherwise set to "cpu"

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id).to(device)

processor.tokenizer.set_target_lang("hau")
model.load_adapter("hau")


predicted_test_transcripts = []

batch_size = 10

for i in tqdm(range(0, len(test_audio), batch_size)):
    batch = test_audio[i:i+batch_size] if i+batch_size <= len(test_audio) else test_audio[i:]
    inputs = processor(batch, sampling_rate=16_000, return_tensors="pt", padding=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs).logits
    # free GPU memory for upcoming models
    del inputs
    torch.cuda.empty_cache()
    ids = torch.argmax(outputs, dim=-1)
    predicted_test_transcripts.extend((processor.batch_decode(ids)))

# free GPU memory for upcoming models
del model
torch.cuda.empty_cache()

Downloading (…)rocessor_config.json:   0%|          | 0.00/254 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

Downloading (…)pter.hau.safetensors:   0%|          | 0.00/9.07M [00:00<?, ?B/s]

100%|██████████| 63/63 [03:18<00:00,  3.16s/it]


In [None]:
compute_metrics(test_transcripts, predicted_test_transcripts)

{'wer': 29.286263454638643, 'cer': 7.7002053388090355}

It looks like we have a 29.3% WER and 7.7% CER. Let's add this result to our table. This is substantially better than whisper-medium! Although, if you have more GPU memory, it would be better to compare Whisper-large with MMS-1b-all since both have about 1 billion parameters.

| Model | WER % | CER %|
|-------|-----|----|
|whisper-medium|129.9|75.5|
|mms-1b-all|29.3|7.7|

Here's a random example. These are more accurate than Whisper.

In [None]:
import random
n = random.randint(0, len(predicted_test_transcripts)-1)
print(f"Predicted transcript: {predicted_test_transcripts[n]}\nReference transcript: {test_transcripts[n]}")

Predicted transcript: amma ana san ya ku a cikin manyan yankuna masu tsakatsakin kawai 'yan dingiri kadan arewacin kerjin za ku bukaci mu'ammala da zafin rana koyaushe da rana mai ƙarbi lokacin da sama ta bayyana mafi wuya
Reference transcript: amma ana sanya ku a cikin manyan yankuna masu tsakatsakin kawai yan digiri kaɗan arewacin kerjin za ku buƙaci maamala da zafin rana koyaushe da rana mai ƙarfi lokacin da sama ta bayyana mafi wuya.


## Section B: Finetuning

For fine-tuning, we will be using functions from the Hugging Face API for the training loop and model setup. In order to use these functions, we need to wrap the data in a custom PyTorch Dataset object. We have two types of models: Wav2Vec2 (XLS-R, MMS) and Seq2Seq (Whisper), with different processors, so we need two objects

**Note**: For example purposes, we finetune the following models for 3 epochs. In reality, time permitting, it is better to finetune for more (10+) to see when the loss stabalizes.

In [None]:
import torch
class ASRDatasetWav2Vec2(torch.utils.data.Dataset):
    def __init__(self, audio, transcripts, sampling_rate, processor):
        self.audio = audio
        self.transcripts = transcripts
        self.sampling_rate = sampling_rate
        self.processor = processor

    def __getitem__(self, idx):
        input_values = self.processor.feature_extractor(self.audio[idx], sampling_rate=self.sampling_rate).input_values[0]
        labels = self.processor.tokenizer(self.transcripts[idx]).input_ids
        item = {}
        item["input_values"] = input_values
        item["labels"] = labels

        return item

    def __len__(self):
        return len(self.transcripts)

In [None]:
class ASRDatasetWhisper(torch.utils.data.Dataset):
    def __init__(self, audio, transcripts, sampling_rate, processor):
        self.audio = audio
        self.transcripts = transcripts
        self.sampling_rate = sampling_rate
        self.processor = processor

    def __getitem__(self, idx):
        input_values = self.processor.feature_extractor(self.audio[idx], sampling_rate=self.sampling_rate).input_features[0]
        labels = self.processor.tokenizer(self.transcripts[idx]).input_ids
        item = {}
        item["input_features"] = input_values
        item["labels"] = labels

        return item

    def __len__(self):
        return len(self.transcripts)

### Whisper

First, let's import the required Whisper classes and training loop functions from Hugging Face and some other utility functions

**Note**: Hugging Face has a great tutorial that we referenced for fine-tuning Whisper. You can refer to [this tutorial](https://huggingface.co/blog/fine-tune-whisper#prepare-feature-extractor-tokenizer-and-data) for more information if needed.

In [None]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor, WhisperForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Union
import transformers
transformers.set_seed(9)

Let's fine-tune a Whisper-medium model which we will download from Hugging Face, and set up a WhisperProcessor object which contains a feature extractor and a tokenizer. The feature extractor transforms the input into log-Mel spectrograms. This transformation takes in the amplitude information respresented by the input array and transforms it into frequencies (refer to the Hugging Face tutorial for more information). Frequencies encode pitch, and so useful audio signals can be found for speech recognition. Additionally, the tokenizer splits the transcripts into tokens based on Whisper's vocabulary. Whisper utilizes byte-level BPE, which is the same tokenizer as GPT-2. If interested, refer to this page: https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt. This tokenizer enables encoding of any character.

In [None]:
model_card = "openai/whisper-medium"
processor = WhisperProcessor.from_pretrained(model_card, language="Hausa", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(model_card)

The following lines are required to fine-tune the Whisper model. The first line makes the model predict the language and task by setting the token ids that control the transcription language and task, to `None`.

The second line makes sure that all possible tokens are predicted by setting the set of supressed tokens to an empty list.

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

Like mentioned before, Whisper takes inputs sampled at 16,000 Hz, and so we will prepare our data using this sampling rate using the ASRDataset object mentioned before

In [None]:
model_sampling_rate = 16000
train_dataset = ASRDatasetWhisper(train_audio, train_transcripts, model_sampling_rate, processor)
val_dataset = ASRDatasetWhisper(val_audio,  val_transcripts, model_sampling_rate, processor)
test_dataset = ASRDatasetWhisper(test_audio, test_transcripts, model_sampling_rate, processor)

Next, we need a function that will pad all the inputs/outputs in a batch to the same length. This code is from the tutorial mentioned earlier.

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In order to use the Trainer class from Hugging Face, we need to define an evaluation function that takes in a model prediction object.

In [None]:
import evaluate
def compute_metrics(pred):
    wer_metric = evaluate.load("wer")
    cer_metric = evaluate.load("cer")

    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * wer_metric.compute(predictions=pred_str, references=label_str)
    cer = 100 * cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

Now, we will setup the model training hyperparameters by using Hugging Face Seq2SeqTrainingArguments. Feel free to experiment with different hyperparameters. Learning rate is an important hyperparameter to experiment with. Reference the official Seq2SeqTrainingArguments for explanations of the hyperparameters: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments

Note: Decrease batch size if you have limited GPU space. We have also set the mixed precision

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=out_dir+"whisper-finetuning-experiment-1/",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-05,
    warmup_steps=500,
    num_train_epochs=3,
    gradient_checkpointing=True, # another way to save GPU memory by recomputing gradients (less memory, more time)
    fp16=True, # this enables mixed precision training, which lets some data be stored in 16 bit floating point precision instead of 32 bits.
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=100,
    eval_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False)

Now, we set up the Trainer object by inputing our training and validation datasets, our evaluation function, tokenizer, model, data collator, and previously instantiated training arguments.

In [None]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Then we call train to start training. Training the Whisper medium model with batch size 16 for 3 epochs takes about 40 minutes.

In [None]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer,Cer
100,No log,1.681253,77.807152,35.971842
200,No log,0.724689,100.0,100.0
300,No log,0.606645,99.693892,99.615112


TrainOutput(global_step=306, training_loss=1.445460400550194, metrics={'train_runtime': 2193.1387, 'train_samples_per_second': 4.458, 'train_steps_per_second': 0.14, 'total_flos': 9.97845418082304e+18, 'train_loss': 1.445460400550194, 'epoch': 3.0})

Let's see the performance on the Common Voice test set

In [None]:
preds = trainer.predict(test_dataset)
eval_preds = compute_metrics(preds)
eval_preds

{'wer': 77.50512557662738, 'cer': 34.83722861823132}

It looks like we have a 77.5% WER and 35.8% CER. Great! We have some made some improvement after finetuning for just 3 epochs. Let's add this result to our table.

| Model | WER % | CER %|
|-------|-----|----|
|whisper-medium|129.9|75.5|
|mms-1b-all|29.3|7.7|
|finetuned whisper-medium|77.5|35.8|

It looks like mms-1b-all still has the best results. Let's see if further finetuning mms-1b-all will give even better results.

Release GPU memory for upcoming models

In [None]:
del model
torch.cuda.empty_cache()

del trainer
torch.cuda.empty_cache()

### MMS

We wil further fine-tune MMS to see if it can be improved by further finetuning on our Common Voice dataset. You can refer to Hugging Face's recent MMS finetuning blog for more details and explanations if needed: https://huggingface.co/blog/mms_adapters

MMS-1b-all works by incorporating an adapter architecture, which are extra parameters throughout the architecture that are trainable during finetuning, and are language-specific. This enables the user to finetune a smaller number of parameters in comparison to the entire model.

Here, we will finetune the MMS adapter weights for Hausa.

First, we will set up the tokenizer based on our previously made character vocabulary, setting special tokens for unknown characters, padding, and word delimiters according to the vocabulary. We need to specify our vocabulary for the specific language of interest in a dictionary so that the MMS-1b-all checkpoint will correctly finetune the adapter weights for Hausa.

In [None]:
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
transformers.set_seed(9)

target_lang = "hau"

with open(vocab_file, "r") as f:
    vocab_dict = json.load(f)

new_vocab_dict = {target_lang: vocab_dict}

experiment_file = out_dir+"mms-1b-all-finetuning-2/"

try:
    os.mkdir(experiment_file)
except:
    pass

with open(experiment_file+"vocab.json", 'w') as f:
    json.dump(new_vocab_dict, f)

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(experiment_file, unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|", target_lang=target_lang)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Then, we will setup the feature extractor, which transforms the input audio into features. MMS takes in the raw audio, unlike the Whisper model, and simply zero-mean-unit-variance normalizes the values.

In [None]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

And finally, the processor wraps both the tokenizer and feature extractor into one conventient class.

In [None]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

Now, we want to create a data collator (similar to the one we made for Whisper) that prepares the input in batches for the model

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        labels_batch = self.processor.pad(
            labels=label_features,
            padding=self.padding,
            return_tensors="pt",
        )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Now we create an evaluation function.

In [None]:
import numpy as np
def compute_metrics(pred):
    wer_metric = evaluate.load("wer")
    cer_metric = evaluate.load("cer")
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = 100*wer_metric.compute(predictions=pred_str, references=label_str)
    cer = 100*cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

Now, we can define the model.

In [None]:
from transformers import Wav2Vec2ForCTC

model_card = "facebook/mms-1b-all"
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/mms-1b-all",
    attention_dropout=0.0,
    hidden_dropout=0.0,
    feat_proj_dropout=0.0,
    layerdrop=0.0,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
    ignore_mismatched_sizes=True,
)

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/mms-1b-all and are newly initialized because the shapes did not match:
- lm_head.bias: found shape torch.Size([154]) in the checkpoint and torch.Size([47]) in the model instantiated
- lm_head.weight: found shape torch.Size([154, 1280]) in the checkpoint and torch.Size([47, 1280]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We re-initialize the adapter layers to prepare for finetuning

In [None]:
model.init_adapter_layers()

Then we freeze all the parameters (learned from the pretraining and finetuning by the Meta team) except the adapter weights

In [None]:
model.freeze_base_model()

adapter_weights = model._get_adapters()
for param in adapter_weights.values():
    param.requires_grad = True

Then, we set up the parameters for model training like for Whisper

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=experiment_file,
  group_by_length=True,
  per_device_train_batch_size=8,
  evaluation_strategy="steps",
  num_train_epochs=3,
  gradient_checkpointing=True, # another way to save GPU memory by recomputing gradients (less memory, more time)
  fp16=True, # this enables mixed precision training, which lets some data be stored in 16 bit floating point precision instead of 32 bits.
  save_steps=200,
  eval_steps=100,
  logging_steps=100,
  learning_rate=1e-3,
  warmup_steps=100,
  save_total_limit=2,
  push_to_hub=False,
  load_best_model_at_end=True,
  metric_for_best_model="wer",
  greater_is_better=False
)

Then send everything to the Trainer class for training!

In [None]:
# since our processor is different, we will need to create new ASRDataset objects
train_dataset = ASRDatasetWav2Vec2(train_audio, train_transcripts, 16000, processor)
val_dataset = ASRDatasetWav2Vec2(val_audio,  val_transcripts, 16000, processor)
test_dataset = ASRDatasetWav2Vec2(test_audio, test_transcripts, 16000, processor)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=processor.feature_extractor,
)

Training 3 epochs with batch size 8 takes about 30 minutes.

In [None]:
trainer.train()



Step,Training Loss,Validation Loss,Wer,Cer
100,7.2334,3.34596,99.387783,89.554847
200,3.047,2.922044,98.497287,89.47635
300,2.8309,2.850453,96.006679,86.774537
400,2.7403,2.731788,93.154306,74.339107
500,2.6503,2.612951,99.874774,65.527195
600,2.5923,2.56854,95.157924,66.69452
700,2.5322,2.520914,97.648532,65.18029
800,2.5049,2.546631,98.678169,62.838043
900,2.4804,2.485876,98.566857,64.25352
1000,2.4595,2.483927,99.624322,63.091259


TrainOutput(global_step=1224, training_loss=2.9247812383315144, metrics={'train_runtime': 1723.7315, 'train_samples_per_second': 5.672, 'train_steps_per_second': 0.71, 'total_flos': 1.4065565190591801e+19, 'train_loss': 2.9247812383315144, 'epoch': 3.0})

In [None]:
preds = trainer.predict(test_dataset)
eval_preds = compute_metrics(preds)
eval_preds

{'wer': 92.11841599384853, 'cer': 39.67283466383647}

In [None]:
eval_preds

NameError: ignored

It looks like we have a 28.3% WER and 7.9% CER. Perhaps a different set of hyperparameters (such as learning rate, batch size, epochs) would show better results. Or the data does not have more information that MMS can learn. Please refer to Section C for guidance on how to experiment with different hyperparameters. Let's add this result to our table.

| Model | WER % | CER %|
|-------|-----|----|
|whisper-medium|129.9|75.5|
|mms-1b-all|29.3|7.7|
|finetuned whisper-medium|77.5|35.8|
|finetuned mms-1b-all|28.3|7.9|

In [None]:
del model
torch.cuda.empty_cache()

del trainer
torch.cuda.empty_cache()

### XLS-R

XLS-R was released before MMS, and the MMS paper claims (CHECK) that it has better performance than XLS-R. However, it may be a good idea to check to see which model is better for your specific dataset and use-case. Therefore, let's try finetuning XLS-R on the Hausa fleurs dataset. Refer to the [Hugging Face tutorial](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for more details.

Similar to MMS, we will create a tokenizer from the character vocabulary file we made earlier in this tutorial, then the feature extractor and processor that wraps the tokenizer and feature extractor.

In [None]:
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
transformers.set_seed(9)

model_sampling_rate = 16000
tokenizer = Wav2Vec2CTCTokenizer(vocab_file, unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=model_sampling_rate, padding_value=0.0, do_normalize=True, return_attention_mask=True)

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

Then, we want to create our dataset objects using our processor.

In [None]:
train_dataset = ASRDatasetWav2Vec2(train_audio, train_transcripts, model_sampling_rate, processor)
val_dataset = ASRDatasetWav2Vec2(val_audio,  val_transcripts, model_sampling_rate, processor)
test_dataset = ASRDatasetWav2Vec2(test_audio, test_transcripts, model_sampling_rate, processor)

Then, we want to instantiate a data collator of the same class as the one for MMS

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Now, we create variables and functions for training

In [None]:
import numpy as np
def compute_metrics(pred):
    wer_metric = evaluate.load("wer")
    cer_metric = evaluate.load("cer")

    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

For example reasons, we will use the XLS-R checkpoint with 300 million parameters. For better comparison with MMS-1b, it would be better to use [XLS-R with 1 billion parameters](https://huggingface.co/facebook/wav2vec2-xls-r-1b) which requires more GPU memory.

In [None]:
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
    ignore_mismatched_sizes=True
)

model.freeze_feature_extractor()
model.gradient_checkpointing_enable()

Some weights of the model checkpoint at facebook/wav2vec2-xls-r-300m were not used when initializing Wav2Vec2ForCTC: ['quantizer.codevectors', 'quantizer.weight_proj.bias', 'project_q.weight', 'quantizer.weight_proj.weight', 'project_q.bias', 'project_hid.bias', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-xls-r-300m and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it 

In [None]:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
  output_dir=out_dir+"xls-r-300m-experiment-1",
  group_by_length=True,
  per_device_train_batch_size=8,
  gradient_accumulation_steps=4,
  evaluation_strategy="steps",
  num_train_epochs=3,
  fp16=True,
  save_steps=100,
  eval_steps=100,
  logging_steps=20,
  learning_rate=2e-4,
  warmup_steps=500,
  save_total_limit=2,
  metric_for_best_model="wer",
  greater_is_better=False,
  load_best_model_at_end=True
)

In [None]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=processor.feature_extractor,
)

Training with batch size of 32 and 10 epochs takes about 9 minutes and 30 GB on an NVIDIA A100 GPU.

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Wer,Cer
100,5.8744,5.43522,1.0,0.996632
200,2.9923,3.060462,1.0,0.996632
300,2.8477,2.922175,1.0,0.996632


TrainOutput(global_step=306, training_loss=6.491241218217837, metrics={'train_runtime': 562.1895, 'train_samples_per_second': 17.391, 'train_steps_per_second': 0.544, 'total_flos': 4.53984711058008e+18, 'train_loss': 6.491241218217837, 'epoch': 3.0})

In [None]:
preds = trainer.predict(test_dataset)
eval_preds = compute_metrics(preds)
eval_preds

{'wer': 1.0, 'cer': 0.9968390937197176}

It looks like we have a 100% WER and 99.7% CER. Let's add this result to our table.

| Model | WER % | CER %|
|-------|-----|----|
|whisper-large-v2| 97.8| 40.5|
|mms-1b-all|29.3|7.7|
|finetuned whisper-large-v2|40.4|19.2|
|finetuned mms-1b-all|27.9|8.1|
|finetuned xls-r-300m|100|99.7|

**Note** We found that the loss for XLS-R on Hausa stabalizes after around 10 epochs, so train for 10+ epochs for better results. 3 epochs is for example purposes.

# Under Construction

## Section C: Further Improvements

### Available scripts

For convenience, we have provided in this GitHub repo a finetuning script `finetuning.py` that enables the user to enter any FLUERS language or custom prepared dataset, and model training hyperparameters to do finetuning and evaluation all in one easy script.

**Note** Make sure you are using a machine with access to a terminal so you can run the Python script.

Using FLEURS only
```
python finetuning.py --fluers_language_code ha_ng --preprocessing_function preprocess --model "facebook/mms-1b-all"

```

Using a custom dataset
```
python finetuning.py ----custom_dataset_function custom_dataset --preprocessing_function preprocess --model "facebook/mms-1b-all"

```

For the preprocessing function argument, you will need to create a Python script (e.g. `preprocess.py`) that has a `preprocess()` function that takes in a List of strings and outputs a processed List of strings:

**preprocess.py**
```
from typing import List

def preprocess(List[str] transcriptions) -> List[str]:
    cleaned_transformations = your_transformation(transcriptions)
    return cleaned_transformations
```

See the "Adding More Data" section for instructions on how to define a custom dataset script.

#### Adding More Data

In order to use a dataset other than FLEURS, you must make sure to set up a Python script that has a function called `create_dataset()`. It must return three ASRDataset objects for the training, validation, and test set. The ASRDataset is available in the `utilities.py` script.

*Example custom dataset script:*

```
from utilities import ASRDataset
def create_dataset() -> Tuple[ASRDataset]:
    # your code
    train_dataset = ASRDataset(audio_train, transcripts_train, sampling_rate, processor)
    val_dataset = ASRDataset(audio_val, transcripts_val, sampling_rate, processor)
    test_dataset = ASRDataset(audio_test, transcripts_test, sampling_rate, processor)

    return train_dataset, val_dataset, test_dataset
```

This option applies for when you want to combine FLEURS with another dataset as well.

#### Hyperparameter tuning

Using the scripts available in this GitHub repo, you can run your own experiments with different hyperparameters to see what gives the best model performance.