# Fine-tune a wav2vec 2.0 Checkpoint for Automatic Speech Recognition on IPUs

This notebook will demonstrate how to fine-tune a pre-trained wav2vec 2.0 model with PyTorch on Graphcore IPUs. We will use a `wav2vec2-base` model and fine-tune it for a CTC downstream task using the [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) dataset.

We will show how to use a wav2vec 2.0 model written in PyTorch from the [🤗 Transformers library](https://huggingface.co/docs/transformers/index) and parallelize it easily using the [🤗 Optimum Graphcore library](https://github.com/huggingface/optimum-graphcore).

🤗 provides convenient access to pre-trained transformer models. The partnership between 🤗 and Graphcore allows us to run these models on the IPU.

🤗 models ported to the IPU can be found on the [Graphcore Hugging Face organisation page](https://huggingface.co/Graphcore).

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
| Speech | ASR | wav2vec 2.0 | LibriSpeech (librispeech_asr) | Training | recommended: 16XX (min: 4X) | 20Xmn (X1h20mn)   |

[![Run on Gradient](../images/gradient-badge.svg)](https://ipu.dev/3CGkbMq)  [![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Background

Automatic speech recognition (ASR), the task of transcribing audio automatically, has historically required large amounts of labelled data. Additionally, these systems had predominantly used fixed-feature extraction methods which do not learn from the raw signal, for example, using short-time Fourier transform, or Mel-frequency cepstrum coefficients. Research conducted by Facebook AI (now Meta AI) demonstrates a [framework for self-supervised learning for speech representations](https://arxiv.org/abs/2006.11477). In other words, they demonstrated a pre-training phase and architecture which can learn feature representations, and their relationships, by leveraging large amounts of unlabelled, raw audio data.  

There are two phases to training: pre-training on unlabelled data, and fine-tuning on a down-stream task. In the original literature the model is fine-tuned for connectionist temporal classification (CTC), which is an ASR task. The consistent modules between pre-training and fine-tuning are what you’d expect to see in a CTC system; it has feature extraction, and an encoder. But, unlike many models of the past, the feature extraction is a convolutional neural network, which makes it trainable. Following that, there is a BERT-style encoder where a large convolutional block is used before the first layers, rather than using sinusoidal positional encoding.  

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

[![Run on Gradient](../images/gradient-badge.svg)](https://ipu.dev/3CGkbMq)

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

Install the dependencies for this notebook.

In [None]:
%%bash
apt update
apt-get install libsndfile1 -y

In [None]:
%pip install -r requirements.txt

## Utility imports and configuration
We start by importing the utilities that will be used later in the notebook: 

In [None]:
import functools
import json
import logging
import os
import re
import sys
import warnings
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Union

import datasets
import numpy as np
import torch
from datasets import DatasetDict, load_dataset, load_metric
from pathlib import Path
import transformers
from optimum.graphcore import IPUConfig, IPUTrainer
from optimum.graphcore import IPUTrainingArguments
from transformers import (
    AutoConfig,
    AutoFeatureExtractor,
    AutoModelForCTC,
    AutoProcessor,
    AutoTokenizer,
    HfArgumentParser,
    Wav2Vec2Processor,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from transformers.utils.versions import require_version

In [None]:
set_seed(0)

Values for machine size and cache directories can be configured through environment variables or directly in the notebook:

In [None]:
import os

pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/wav2vec2_fine_tuning"
checkpoint_directory = Path(os.getenv("CHECKPOINT_DIR", "/tmp")) / "demo"

## Preparing the LibriSpeech dataset

The [🤗 Datasets](https://huggingface.co/docs/datasets/index) library can be used to conveniently load the LibriSpeech dataset, and the library provides easy-to-use tools to process the data.

First we create a `DatasetDict` dictionary to handle our data, and then load the LibriSpeech splits for training and validation. For this notebook we will use `train.100` which is 100 hours of clean training data. Section C of the appendix in the paper on the [framework for self-supervised learning for speech representations](https://arxiv.org/abs/2006.11477) suggests that fine-tuning a `Base` model can yield a 6.1% word error rate (WER) without an additional language model.

In [None]:
raw_datasets = DatasetDict()
raw_datasets["train"] = load_dataset("librispeech_asr", "clean", split="train.100")
raw_datasets["eval"] = load_dataset("librispeech_asr", "clean", split="validation")

In [None]:
raw_datasets

### Text normalisation

Using the package `map` function, any special characters are removed from the transcription. The resultant transcript is then lower-cased. These two processes mean that the model will not have to learn punctuation and capitalisation. Although the model may have the ability to learn capitalisation and punctuation, it will be easier if this is not required.

There are other situations where text normalisation may be used like converting digits into their text counterpart. This is not performed in this script as LibriSpeech already has the text counterpart.

In [None]:
chars_to_ignore_regex = "".join([",", "?", ".", "!", "-", "\;", "\:", "\"", "“", "%", "‘", "”", "�"])
text_column_name = "text"


def remove_special_characters(batch):
    if chars_to_ignore_regex is not None:
        batch["target_text"] = re.sub(chars_to_ignore_regex, "", batch[text_column_name]).lower() + " "
    else:
        batch["target_text"] = batch[text_column_name].lower() + " "
    return batch


raw_datasets = raw_datasets.map(
    remove_special_characters,
    remove_columns=[text_column_name],
    desc="remove special characters from datasets",
)

### Create vocabulary and tokenizer

We now create a vocabulary from the dataset. This will find all the unique characters from all the normalised text in the datasets.

In [None]:
def create_vocabulary_from_data(
        datasets: DatasetDict,
        word_delimiter_token=None,
        unk_token=None,
        pad_token=None,
):
    # Given training and test labels create vocabulary
    def extract_all_chars(batch):
        all_text = " ".join(batch["target_text"])
        vocab = list(set(all_text))
        return {"vocab": [vocab], "all_text": [all_text]}

    vocabs = datasets.map(
        extract_all_chars,
        batched=True,
        batch_size=-1,
        keep_in_memory=True,
        remove_columns=datasets["train"].column_names,
    )

    # take union of all unique characters in each dataset
    vocab_set = functools.reduce(
        lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values()
    )

    vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))}

    # replace white space with delimiter token
    if word_delimiter_token is not None:
        vocab_dict[word_delimiter_token] = vocab_dict[" "]
        del vocab_dict[" "]

    # add unk and pad token
    if unk_token is not None:
        vocab_dict[unk_token] = len(vocab_dict)

    if pad_token is not None:
        vocab_dict[pad_token] = len(vocab_dict)

    return vocab_dict


word_delimiter_token = "|"
unk_token = "[UNK]"
pad_token = "[PAD]"

vocab_dict = create_vocabulary_from_data(raw_datasets,
                                         word_delimiter_token=word_delimiter_token,
                                         unk_token=unk_token,
                                         pad_token=pad_token)

In [None]:
vocab_dict

With the vocabulary generated from the normalised transcripts, we create a tokenizer which is included in the [🤗 Transformers](https://huggingface.co/docs/transformers/index) library. This will later be used to encode text into indexes, and decode indexes into text.

In [None]:
tokenizer_name_or_path = "/tmp/wav2vec2-notebook"

vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json")

if os.path.isfile(vocab_file):
    os.remove(vocab_file)

os.makedirs(tokenizer_name_or_path, exist_ok=True)

with open(vocab_file, "w") as file:
    json.dump(vocab_dict, file)

tokenizer_kwargs = {
    "config": None,
    "tokenizer_type": "wav2vec2",
    "unk_token": unk_token,
    "pad_token": pad_token,
    "word_delimiter_token": word_delimiter_token,
}

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_auth_token=False, **tokenizer_kwargs)

Let's look at an example for using the tokenizer. The vocabulary does not contain any digits, so these will be set to `[UNK]`. Remember, any special characters (such as commas) have already been removed from the dataset.

In [None]:
tokenizer("wav2vec2 finetuning on ipu")

In [None]:
tokenizer.decode(tokenizer("wav2vec2 finetuning on ipu").input_ids)

### Feature extraction

Now we generate the feature extraction method for the model and map it across the datasets onto the audio data. In this model we are learning from a raw audio signal, so the feature extraction is just used to resample the audio to the rate which the model expects. 

Afterwards we set the minimum and maximum input lengths in samples. These are set to 2.0 and 15.6 seconds, converted to 32000 and 249600 samples respectively for a sampling frequency of 16kHz. 

In [None]:
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

dataset_sampling_rate = next(iter(raw_datasets.values())).features["audio"].sampling_rate
if dataset_sampling_rate != 16000:
    raw_datasets = raw_datasets.cast_column("audio", datasets.features.Audio(sampling_rate=16000))

max_input_length = int(15.6 * feature_extractor.sampling_rate)
min_input_length = int(2.0 * feature_extractor.sampling_rate)

### Prepare dataset

In this step, both the feature extraction and tokenization are applied to the audio and transcript, respectively. The feature extractor resamples the audio, and the tokenizer will convert the normalised text into indexes.

After the `map` function has completed, the dataset will be filtered by the audio length. If the length of the raw audio is not between 2.0 and 15.6 seconds then it will be removed from the data. The result of the filtering is cached.

In [None]:
def prepare_dataset(batch):
    sample = batch["audio"]

    inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
    batch["input_values"] = inputs.input_values[0]
    batch["input_length"] = len(inputs.input_values[0])

    batch["labels"] = tokenizer(batch["target_text"]).input_ids

    return batch


def is_audio_in_length_range(length):
    try:
        return length > min_input_length and length < max_input_length
    except:
        return False


vectorized_datasets = raw_datasets.map(prepare_dataset,
                                       remove_columns=raw_datasets["train"].column_names,
                                       num_proc=8,
                                       desc="preprocess datasets")

vectorized_datasets = vectorized_datasets.filter(is_audio_in_length_range,
                                                 input_columns=["input_length"],
                                                 num_proc=8)

## Data loading

With the dataset prepared, the majority of the processing is complete and the data is almost ready to be sent to the model. The role of the collator is to pad the resampled audio and encoded text to a static size. The padding values for audio will be set to `0.0` but for the indexes they will be `-100` so it's not confused with an index in the vocabulary.

In [None]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.AutoProcessor`)
            The processor used for processing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: AutoProcessor
    padding: Union[bool, str] = "longest"
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        batch["input_values"] = batch["input_values"].half()

        return batch.data


processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

data_collator = DataCollatorCTCWithPadding(processor=processor, pad_to_multiple_of=int(max_input_length),
                                           pad_to_multiple_of_labels=1000)

## Preparing the model

For the model, we are using `wav2vec2-base` from the [🤗 Models Hub](https://huggingface.co/models). This model has been pre-trained only.
Some of the default options for the model will need to be changed for training:
* CTC loss will be normalised by the lengths.
* There is no masking of the features to be applied so both masks are set to 0.0. The current masking strategy isn't supported on the IPU.
* The [PAD] index and vocabulary size are later used in the model for the final output layer and CTC-loss.
* Epsilon is adjusted for FP16 training.

The IPU config describes how to parallelise the model across several IPUs. It also includes additional options such as gradient accumulation, device iterations, and memory proportion. 

In [None]:
config = AutoConfig.from_pretrained("facebook/wav2vec2-base")
config.update(
    {
        "ctc_loss_reduction": "mean",
        "mask_time_prob": 0.0,
        "mask_feature_prob": 0.0,
        "layerdrop": 0.0,
        "pad_token_id": tokenizer.pad_token_id,
        "vocab_size": len(tokenizer),
        "layer_norm_eps": 0.0001,
    }
)

model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base", config=config)

ipu_config = IPUConfig.from_pretrained("Graphcore/wav2vec2-base-ipu", executable_cache_dir=executable_cache_dir)

In [None]:
ipu_config.layers_per_ipu = [5, 5, 5, 6]

Let's set our training hyperparameters using `IPUTrainingArguments`. This subclasses the Hugging Face `TrainingArguments` class, adding parameters specific to the IPU and its execution characteristics.

In [None]:
training_args = IPUTrainingArguments(output_dir= checkpoint_directory,
                                     overwrite_output_dir=True,
                                     do_train=True,
                                     do_eval=True,
                                     evaluation_strategy="epoch",
                                     learning_rate=3e-4,
                                     num_train_epochs=5.0,
                                     adam_epsilon=0.0001,
                                     warmup_steps=400,
                                     dataloader_drop_last=True,
                                     dataloader_num_workers=16,
                                     )

In [None]:
feature_extractor.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)

The performance of the model is measured using the WER. This metric takes a predicted string and the correct string and computes an [edit distance](https://en.wikipedia.org/wiki/Edit_distance) normalised by the length of the string. 

To add this metric to our evaluation, we define a `compute_metrics` function and load the metric from the `datasets` package. This is performed once after all the evaluation outputs have been computed.

In [None]:
eval_metrics = {"wer": load_metric("wer")}


def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False)

    metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}

    return metrics

To train the model, we define a trainer using the `IPUTrainer` class which takes care of compiling the model to run on IPUs, and of performing training and evaluation. The `IPUTrainer` class works just like the Hugging Face `Trainer` class, but takes the additional `ipu_config` argument.

In [None]:
# Initialize Trainer
trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=vectorized_datasets["train"],
    eval_dataset=vectorized_datasets["eval"],
    tokenizer=feature_extractor,
)

## Run the training

In [None]:
trainer.train()
trainer.save_model()

## Next steps

You can try out the notebook on [Running Automated Speech Recognition using a Fine-tuned wav2vec 2.0 Checkpoint on IPUs](https://github.com/huggingface/optimum-graphcore/blob/main/notebooks/wav2vec2/wav2vec2-inference-checkpoint.ipynb) to use the outputs of this notebook.

Also, check out the full list of [Optimum Graphcore notebooks](https://github.com/huggingface/optimum-graphcore/tree/main/notebooks) to get a feel for how IPUs perform on other tasks. 