# wav2vec 2.0 Fine-Tuning on IPU

This notebook will demonstrate how to fine-tune a pre-trained wav2vec 2.0 model with PyTorch on the Graphcore IPU-POD16 system. We will use a "wav2vec2-base" model and fine-tune for a CTC downstream task using LibriSpeech.

We will show how to use a wav2vec 2.0 model written in PyTorch from the 🤗`transformers` library from HuggingFace and paralllize it easily using the 🤗`optimum-graphcore` library.

### Background

ASR (Automatic Speech Recognition), the task of transcribing audio automatically, has historically required large amounts of labelled data. Additionally, these systems had predominantly used fixed feature extraction methods which do not learn from the raw signal, e.g., STFT, or Mel-Frequency. Research conducted by Facebook AI (now Meta AI) demonstrates a “framework for self-supervised learning for speech representations”. In other words, a pre-training phase and architecture which can learn feature representations, and their relationships, by leveraging large amounts of unlabelled, raw audio data.  

There are two phases to training: pre-training on unlabelled data, and fine-tuning on a down-stream task. In the original literature the model is fine-tuned for CTC (connectionist temporal classification), which is an ASR task. The consistent modules between pre-training and fine-tuning are what you’d expect to see in a CTC system; it has feature extraction, and an encoder. But, unlike many models of the past, the feature extraction is a convolutional neural network, which makes it trainable. Following that there is a BERT-style encoder where a large convolutional block is used before the first layers, rather than using sinusoidal positional encoding.  

### Environment

Requirements:
- A Poplar SDK environment enabled (see the [Getting Started](https://docs.graphcore.ai/en/latest/getting-started.html) guide for your IPU system)
- Python packages installed with `python -m pip install -r requirements.txt`

In [2]:
% pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi.python.org/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html


Note: you may need to restart the kernel to use updated packages.


To run this Jupyter notebook on a remote IPU machine:
1. Enable a Poplar SDK environment 
(see the [Getting Started](https://docs.graphcore.ai/en/latest/getting-started.html) 
 guide for your IPU system) and install required packages with `python -m pip install -r requirements.txt`
2. In the same environment, install the Jupyter notebook server: `python -m pip install notebook`
3. Launch a Jupyter Server on a specific port: `jupyter-notebook --no-browser --port <port number>`
4. Connect via SSH to your remote machine, forwarding your chosen port:
`ssh -NL <port number>:localhost:<port number> <your username>@<remote machine>`

For more details about this process, or if you need troubleshooting, 
see our [guide on using IPUs from Jupyter notebooks](../../standard_tools/using_jupyter/README.md)."

### Graphcore Hugging Face models
Hugging Face provides convenient access to pre-trained transformer models. The partnership between Hugging Face and Graphcore allows us to run these models on the IPU.

Hugging Face models ported to the IPU can be found on the Graphcore organisation page on Hugging Face. 

### Utility imports
We start by importing the utilities that will be used later in the tutorial: 

In [3]:
import functools
import json
import logging
import os
import re
import sys
import warnings
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Union

import datasets
import numpy as np
import torch
from datasets import DatasetDict, load_dataset, load_metric

import transformers
from optimum.graphcore import IPUConfig, IPUTrainer
from optimum.graphcore import IPUTrainingArguments
from transformers import (
    AutoConfig,
    AutoFeatureExtractor,
    AutoModelForCTC,
    AutoProcessor,
    AutoTokenizer,
    HfArgumentParser,
    Wav2Vec2Processor,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from transformers.utils.versions import require_version

  from .autonotebook import tqdm as notebook_tqdm


## Preparing the LibriSpeech Dataset

The 🤗`datasets` library from HuggingFace can be used to conventiuently load the LibriSpeech dataset, as well as provide easy to use tool to process the data.

First we are going to create a `DatasetDict` dictionary to handle our data, and then load the LibriSpeech splits for training and validation. For this notebook we will use `train.100` which is 100 hours of clean training data. Section C of the appendix in the [paper](https://arxiv.org/abs/2006.11477) suggests that fine-tuning a `Base` model can yield 6.1% WER without an additional language model.

In [4]:
raw_datasets = DatasetDict()
raw_datasets["train"] = load_dataset("librispeech_asr", "clean", split="train.100")
raw_datasets["eval"] = load_dataset("librispeech_asr", "clean", split="validation")

Reusing dataset librispeech_asr (/home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb)
Reusing dataset librispeech_asr (/home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb)


In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
        num_rows: 28539
    })
    eval: Dataset({
        features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
        num_rows: 2703
    })
})

### Text Normalisation

Using the package `map` function, any special characters are removed from the transcription. The resultant transcript is then lower-cased. These two processes means that the model will not have to learn punctuation and capatilsation, although it may have the ability to do so, this is much easier for the model to do.

There are other situations where text normalisation may be used like converting digits into their text counterpart. This is not performed in this script as LibriSpeech already has the text counterpart.

In [6]:
chars_to_ignore_regex = "".join([",", "?", ".", "!", "-", "\;", "\:", "\"", "“", "%", "‘", "”", "�"])
text_column_name = "text"


def remove_special_characters(batch):
    if chars_to_ignore_regex is not None:
        batch["target_text"] = re.sub(chars_to_ignore_regex, "", batch[text_column_name]).lower() + " "
    else:
        batch["target_text"] = batch[text_column_name].lower() + " "
    return batch


raw_datasets = raw_datasets.map(
    remove_special_characters,
    remove_columns=[text_column_name],
    desc="remove special characters from datasets",
)

Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-8c8acfcd6e3fb31e.arrow
Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-2519815fd64082dc.arrow


### Create Vocabulary and Tokenizer

We now create a vocabulary from the dataset. This will find all the unique characters from all the normalised text in the datasets.

In [7]:
def create_vocabulary_from_data(
        datasets: DatasetDict,
        word_delimiter_token=None,
        unk_token=None,
        pad_token=None,
):
    # Given training and test labels create vocabulary
    def extract_all_chars(batch):
        all_text = " ".join(batch["target_text"])
        vocab = list(set(all_text))
        return {"vocab": [vocab], "all_text": [all_text]}

    vocabs = datasets.map(
        extract_all_chars,
        batched=True,
        batch_size=-1,
        keep_in_memory=True,
        remove_columns=datasets["train"].column_names,
    )

    # take union of all unique characters in each dataset
    vocab_set = functools.reduce(
        lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values()
    )

    vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))}

    # replace white space with delimiter token
    if word_delimiter_token is not None:
        vocab_dict[word_delimiter_token] = vocab_dict[" "]
        del vocab_dict[" "]

    # add unk and pad token
    if unk_token is not None:
        vocab_dict[unk_token] = len(vocab_dict)

    if pad_token is not None:
        vocab_dict[pad_token] = len(vocab_dict)

    return vocab_dict


word_delimiter_token = "|"
unk_token = "[UNK]"
pad_token = "[PAD]"

vocab_dict = create_vocabulary_from_data(raw_datasets,
                                         word_delimiter_token=word_delimiter_token,
                                         unk_token=unk_token,
                                         pad_token=pad_token)

100%|██████████| 1/1 [00:03<00:00,  3.28s/ba]
100%|██████████| 1/1 [00:00<00:00,  6.01ba/s]


In [8]:
vocab_dict

{"'": 1,
 'a': 2,
 'b': 3,
 'c': 4,
 'd': 5,
 'e': 6,
 'f': 7,
 'g': 8,
 'h': 9,
 'i': 10,
 'j': 11,
 'k': 12,
 'l': 13,
 'm': 14,
 'n': 15,
 'o': 16,
 'p': 17,
 'q': 18,
 'r': 19,
 's': 20,
 't': 21,
 'u': 22,
 'v': 23,
 'w': 24,
 'x': 25,
 'y': 26,
 'z': 27,
 '|': 0,
 '[UNK]': 28,
 '[PAD]': 29}

With the vocabulary generated from the normalised trascripts we create a `tokenizer` which is included in the 🤗`transformers` library. This will later be used to encode text into indexes, and decode indexes into text.

In [9]:
tokenizer_name_or_path = "/tmp/wav2vec2-notebook"

vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json")

if os.path.isfile(vocab_file):
    os.remove(vocab_file)

os.makedirs(tokenizer_name_or_path, exist_ok=True)

with open(vocab_file, "w") as file:
    json.dump(vocab_dict, file)

tokenizer_kwargs = {
    "config": None,
    "tokenizer_type": "wav2vec2",
    "unk_token": unk_token,
    "pad_token": pad_token,
    "word_delimiter_token": word_delimiter_token,
}

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_auth_token=False, **tokenizer_kwargs)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Let's look at an example for using the tokenizer. The vocabulary does not contain any digits, so these will be set to `[UNK]`. Remember, any special characters (such as commas) have already been removed from the dataset.

In [10]:
tokenizer("wav2vec2 finetuning on ipu")

{'input_ids': [24, 2, 23, 28, 23, 6, 4, 28, 0, 7, 10, 15, 6, 21, 22, 15, 10, 15, 8, 0, 16, 15, 0, 10, 17, 22], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
tokenizer.decode(tokenizer("wav2vec2 finetuning on ipu").input_ids)

'wav[UNK]vec[UNK] finetuning on ipu'

### Feature Extraction

Now we generate the feature extraction method for the model and map it across the datasets onto the audio data. In this model we are learning from raw audio signal, so the feature extraction is just used to resample the audio to the rate which the model expects. 

Afterwards we set the minimum and maximum input lengths in samples. These are set to 2.0 and 15.6 seconds, converted to 32000 and 249600. 

In [12]:
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

dataset_sampling_rate = next(iter(raw_datasets.values())).features["audio"].sampling_rate
if dataset_sampling_rate != 16000:
    raw_datasets = raw_datasets.cast_column("audio", datasets.features.Audio(sampling_rate=16000))

max_input_length = int(15.6 * feature_extractor.sampling_rate)
min_input_length = int(2.0 * feature_extractor.sampling_rate)

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


### Prepare Dataset

In this step both the feature extraction and tokenization are applied to the audio and transcript, respectively. The feature extractor resamples the audio, and the tokenizer will convert the normalised text into indexes.

After the map function, the dataset will be filtered by the audio length. If the length of the raw audio is not between 2.0 and 15.6 seconds then it will be removed from the data. The result of filtering is cached.

In [13]:
def prepare_dataset(batch):
    sample = batch["audio"]

    inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
    batch["input_values"] = inputs.input_values[0]
    batch["input_length"] = len(inputs.input_values[0])

    batch["labels"] = tokenizer(batch["target_text"]).input_ids

    return batch


def is_audio_in_length_range(length):
    try:
        return length > min_input_length and length < max_input_length
    except:
        return False


vectorized_datasets = raw_datasets.map(prepare_dataset,
                                       remove_columns=raw_datasets["train"].column_names,
                                       num_proc=8,
                                       desc="preprocess datasets")

vectorized_datasets = vectorized_datasets.filter(is_audio_in_length_range,
                                                 input_columns=["input_length"],
                                                 num_proc=8)

Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-299bce06cb35f50b.arrow
Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-439b13de76e6512e.arrow
Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-126dce57a7d4e9cb.arrow
Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-c8c5d63b1a540489.arrow
Loading cached processed dataset at /home/thorinf/.cache/huggingface/datasets/librispeech_asr/clean/2.1.0/14c8bffddb861b4b3a4fcdff648a56980dbb808f3fc56f5a3d56b18ee88458eb/cache-5c84e472820247b7.arrow


## Data Loading

With the dataset prepared, the majority of the processing is complete nearly fit to be sent to the model. The role of the collator is to pad the resampled audio and encoded text to a static size. The padding values for audio will be set to `0.0` but for the indexes they will be `-100` so it's not confused with an index in the vocabulary.

In [14]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.AutoProcessor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: AutoProcessor
    padding: Union[bool, str] = "longest"
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        batch["labels"] = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        return batch.data


processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

data_collator = DataCollatorCTCWithPadding(processor=processor, pad_to_multiple_of=int(max_input_length),
                                           pad_to_multiple_of_labels=1000)

## Preparing the Model


For the model we are using `wav2vec2-base` from the HuggingFace model-hub. This model has been pretrained only.
Some of the defaults options for the model will need to be changed for training:
* CTC loss will be normalised by the lengths
* There is no masking of the features to be applied so both masks are set to 0.0, the current masking strategy isn't supported on IPU.
* The [PAD] index and vocabulary size are later used in the model for the final output layer and CTC-loss.
* Epsilon adjusted for FP16 training.


The IPU config describes how to parallelise the model across several IPUs. It also includes additional options such as gradient accumulation, device iterations, and memory proportion. 

In [16]:
config = AutoConfig.from_pretrained("facebook/wav2vec2-base")
config.update(
    {
        "ctc_loss_reduction": "mean",
        "mask_time_prob": 0.0,
        "mask_feature_prob": 0.0,
        "layerdrop": 0.0,
        "pad_token_id": tokenizer.pad_token_id,
        "vocab_size": len(tokenizer),
        "layer_norm_eps": 0.0001,
    }
)

model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base", config=config)

ipu_config = IPUConfig.from_pretrained("Graphcore/wav2vec2-ctc-base-ipu")

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "
Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForCTC: ['quantizer.weight_proj.bias', 'project_q.bias', 'project_q.weight', 'wav2vec2.masked_spec_embed', 'quantizer.weight_proj.weight', 'project_hid.bias', 'quantizer.codevectors', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initi

In [17]:
config

Wav2Vec2Config {
  "_name_or_path": "facebook/wav2vec2-base",
  "activation_dropout": 0.0,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForPreTraining"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 256,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": false,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "mean",
  "ctc_zero_infinity": false,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": false,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_norm": "group",
  "feat_proj_dropout": 0.1,
  "feat_quantizer_dropout": 0.0,
  "final_dropout": 0.0,
  "freeze_feat_extract_train": true,
  "hidden_act"

Let's set our training hyperparameters using `IPUTrainingArguments`. This subclasses the Hugging Face `TrainingArguments` class, adding parameters specific to the IPU and its execution characteristics.

In [18]:
training_args = IPUTrainingArguments(output_dir="./demo",
                                     overwrite_output_dir=True,
                                     do_train=True,
                                     do_eval=True,
                                     evaluation_strategy="epoch",
                                     learning_rate=3e-4,
                                     num_train_epochs=5.0,
                                     adam_epsilon=0.0001,
                                     warmup_steps=400,
                                     logging_steps=100,
                                     dataloader_drop_last=True,
                                     dataloader_num_workers=16,
                                     )

In [19]:
feature_extractor.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)

('./demo/tokenizer_config.json',
 './demo/special_tokens_map.json',
 './demo/vocab.json',
 './demo/added_tokens.json')

The performance of the model is measured using the WER. This metric takes a predicted string and the correct string and computes an edit distance normalised by the length. For many sentences the sum of the edit distances is normalised by the sum of the lengths. 

To add this metric to our evaluation we define a `compute_metrics` function and load the metric from the `datasets` package. This is performed once after all the evaluation outputs have been computed.

In [20]:
eval_metrics = {"wer": load_metric("wer")}


def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False)

    metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}

    return metrics

To train the model, we define a trainer using the `IPUTrainer` class which takes care of compiling the model to run on IPUs, and of performing training and evaluation. The `IPUTrainer` class works just like the HuggingFace `Trainer` class, but takes the additional `ipu_config` argument.

In [21]:
# Initialize Trainer
trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=vectorized_datasets["train"],
    eval_dataset=vectorized_datasets["eval"],
    tokenizer=feature_extractor,
)

---------- Device Allocation -----------
Conv 0  --> IPU 0
Conv 1  --> IPU 0
Conv 2  --> IPU 0
Conv 3  --> IPU 0
Conv 4  --> IPU 0
Conv 5  --> IPU 0
Conv 6  --> IPU 0
Positional Embedding --> IPU 0
Encoder 0  --> IPU 0
Encoder 1  --> IPU 1
Encoder 2  --> IPU 1
Encoder 3  --> IPU 1
Encoder 4  --> IPU 1
Encoder 5  --> IPU 2
Encoder 6  --> IPU 2
Encoder 7  --> IPU 2
Encoder 8  --> IPU 2
Encoder 9  --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Project Hidden --> IPU 3
---------------------------------------


## Run the Training

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `PipelinedWav2Vec2ForCTC.forward` and have been ignored: input_length.


In [None]:
trainer.save_model()

In [None]:
feature_extractor.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)