# Multi-lingual ASR Transcription on IPUs using Whisper - LoRA Fine-tuning

This notebook demonstrates [LoRA](https://arxiv.org/abs/2106.09685) fine-tuning for multi-lingual speech transcription on the IPU using the [Whisper implementation in the Hugging Face Transformers library](https://huggingface.co/spaces/openai/whisper) alongside [Optimum Graphcore](https://github.com/huggingface/optimum-graphcore). We will be using the Spanish subset of the [Common Voice dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0).

Whisper is a versatile speech recognition model that can transcribe speech as well as perform multi-lingual translation and recognition tasks.
It was trained on diverse datasets to give human-level speech recognition performance without the need for fine tuning. 

[🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore) is the interface between the [🤗 Transformers library](https://huggingface.co/docs/transformers/index) and [Graphcore IPUs](https://www.graphcore.ai/products/ipu).
It provides a set of tools enabling model parallelization and loading on IPUs, training and fine-tuning on all the tasks already supported by Transformers while being compatible with the Hugging Face Hub and every model available on it out of the box.

LoRA is a training method which injects lower rank trainable matrices into Transformer layers. This greatly reduces the number of trainable parameters in a model, hence accelerating the fine-tuning process whilst consuming less device memory.

> **Hardware requirements:** We will fine-tune `whisper-small` with two replicas on the smallest IPU-POD4 machine.

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
| Automatic Speech Recognition | Transcription | Whisper-small | Common Voice (es) dataset | Fine-tuning | 4 | 1hr |

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled. Refer to the [Getting Started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK. Also refer to the [Jupyter Quick Start guide](https://docs.graphcore.ai/projects/jupyter-notebook-quick-start/en/latest/index.html) for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

## Dependencies

This notebook requires SDK >= 3.3, with best performance achieved with SDK 3.4.

In [None]:
import re
import warnings

sdk_version = !popc --version
if sdk_version and (version := re.search(r'\d+\.\d+\.\d+', sdk_version[0]).group()) >= '3.4':
    pass
elif sdk_version and (version := re.search(r'\d+\.\d+\.\d+', sdk_version[0]).group()) >= '3.3':
    warnings.warn("SDK versions lower than 3.4 do not support all the functionality in this notebook so performance will be reduced. We recommend you relaunch the Paperspace Notebook with the Pytorch SDK 3.4 image. You can use https://hub.docker.com/r/graphcore/pytorch-early-access",
                  category=Warning, stacklevel=2)
else:
    raise ValueError("SDK versions lower than 3.3 are not supported by this notebook. Please relaunch the Paperspace Notebook with the Pytorch SDK 3.3 image.")

Install the dependencies the notebook needs.

In [None]:
# Install optimum from source 
!pip install git+https://github.com/huggingface/optimum-graphcore.git "soundfile" "librosa" "evaluate" "jiwer"

In [1]:
import os

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/whisper_lora"

## Imports

In [2]:
# Generic imports
import copy
from dataclasses import dataclass
from typing import Any, Dict, List, Union

import evaluate
import numpy as np
import torch
from datasets import load_dataset, Audio, DatasetDict

# IPU-specific imports
from optimum.graphcore import (
    pipeline,
    IPUConfig, 
    IPUSeq2SeqTrainer, 
    IPUSeq2SeqTrainingArguments, 
)
from optimum.graphcore.models.whisper import WhisperProcessorTorch

# HF-related imports
from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model
from transformers import WhisperForConditionalGeneration

In [3]:
LANGUAGE = "es"
MAX_LENGTH = 224
MODEL_NAME = "openai/whisper-small"
TASK = "transcribe"

## Load Dataset

Common Voice datasets consist of recordings of speakers reading text from Wikipedia in different languages. 🤗 Datasets enables us to download and prepare the training and evaluation splits easily.

First, ensure you have accepted the terms of use on the Hugging Face Hub: [mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0). Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.

In [4]:
common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_13_0", LANGUAGE, split="train+validation", use_auth_token=True)
common_voice["eval"] = load_dataset("mozilla-foundation/common_voice_13_0", LANGUAGE, split="test", use_auth_token=True)

print(common_voice)

Found cached dataset common_voice_13_0 (/home/gorank/.cache/huggingface/datasets/mozilla-foundation___common_voice_13_0/es/13.0.0/2506e9a8950f5807ceae08c2920e814222909fd7f477b74f5d225802e9f04055)
Found cached dataset common_voice_13_0 (/home/gorank/.cache/huggingface/datasets/mozilla-foundation___common_voice_13_0/es/13.0.0/2506e9a8950f5807ceae08c2920e814222909fd7f477b74f5d225802e9f04055)


DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 296037
    })
    eval: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 15708
    })
})


The columns of interest are `audio` - the raw audio samples - and `sentence` - the corresponding ground truth transcription.

In [5]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

Since Whisper was pre-trained on audio sampled at 16kHz, we must ensure the Common Voice samples are downsampled accordingly.

In [6]:
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

## Prepare Dataset

We prepare the datasets by extracting features from the raw audio inputs and injecting labels which are simply transcriptions with some basic processing.

The feature extraction is provided by 🤗 Transformers `WhisperFeatureExtractor`. To decode generated tokens to text after running the model, we will similarly require a tokenizer, `WhisperTokenizer`. Both of these are wrapped by a `WhisperProcessor`.

In [8]:
processor = WhisperProcessorTorch.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK)
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.tokenizer.max_length = MAX_LENGTH
processor.tokenizer.set_prefix_tokens(language=LANGUAGE, task=TASK)

In [9]:
def prepare_dataset(batch, processor):
    inputs = processor.feature_extractor(
        raw_speech=batch["audio"]["array"],
        sampling_rate=batch["audio"]["sampling_rate"],
    )
    batch["input_features"] = inputs.input_features[0].astype(np.float16)

    transcription = batch["sentence"]
    if transcription.startswith('"') and transcription.endswith('"'):
        transcription = transcription[1:-1]
    if transcription[-1] not in [".", "?", "!"]:
        transcription = transcription + "."
    batch["labels"] = processor.tokenizer(text=transcription).input_ids
    return batch

# num_proc > 1 hangs if the num_threads are not set.
torch.set_num_threads(1)
columns_to_remove = common_voice.column_names["train"]
common_voice = common_voice.map(
    lambda elem: prepare_dataset(elem, processor),
    remove_columns=columns_to_remove,
    num_proc=1,
)
torch.set_num_threads(4)

train_dataset = common_voice["train"]
eval_dataset = common_voice["eval"]

Map:   0%|          | 0/296037 [00:00<?, ? examples/s]

Map:   0%|          | 0/15708 [00:00<?, ? examples/s]

Lastly, we pre-process the labels by padding them with values that will be ignored during fine-tuning. We do this on the fly via the below data collator.

In [10]:
@dataclass
class DataCollatorSpeechSeq2SeqWithLabelProcessing:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = {}
        batch["input_features"] = torch.tensor([feature["input_features"] for feature in features])
        
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt", padding="longest", pad_to_multiple_of=MAX_LENGTH)
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

## Define metrics

The performance of our fine-tuned model will be evaluated using word error rate (WER).

In [11]:
metric = evaluate.load("wer")


def compute_metrics(pred, tokenizer):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    pred_ids = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)
    label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    normalized_pred_str = [tokenizer._normalize(pred).strip() for pred in pred_str]
    normalized_label_str = [tokenizer._normalize(label).strip() for label in label_str]

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    normalized_wer = 100 * metric.compute(predictions=normalized_pred_str, references=normalized_label_str)

    return {"wer": wer, "normalized_wer": normalized_wer}

## Load Pre-trained Model

In [None]:
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

In [None]:
model.config.max_length = MAX_LENGTH
model.generation_config.max_length = MAX_LENGTH

Ensure language-appropriate tokens, if any, are set for generation.

In [None]:
model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
    language=LANGUAGE, task=TASK
)
model.config.suppress_tokens = []

## Apply LoRA Adapters

`peft` injects the low-rank trainable matrices into the attention blocks (the query and value projection linear layers) via `get_peft_model`. Observe that the proportion of trainable parameters is only 0.73%!

In [None]:
config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["q_proj", "v_proj"], 
    lora_dropout=0.05, 
    bias="none"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()

## Fine-tuning Whisper on the IPU

The resulting `peft.PeftModel` can be directly fine-tuned on the IPU using the `IPUSeq2SeqTrainer`. 

The `IPUConfig` object specifies how the model will be pipelined across the IPUs. 

For fine-tuning we place the encoder on one IPU, and the decoder on a different IPU. We then create two replicas to use all four IPUs via data-parallelism.

For inference, the entire model is placed on one IPU, and replicated four times.

In [None]:
replication_factor = n_ipu // 2
ipu_config = IPUConfig.from_dict(
    {
        "recompute_checkpoint_every_layer": True,
        "enable_half_partials": True,
        "executable_cache_dir": "./whisper_exe_cache",
        "gradient_accumulation_steps": 32 // replication_factor,
        "replication_factor": replication_factor,
        "layers_per_ipu": [12, 12],
        "matmul_proportion": [0.2, 0.2],
        "projection_serialization_factor": 5,
        "inference_replication_factor": 4,
        "inference_layers_per_ipu": [-1],
        "inference_matmul_proportion": [0.15],
        "inference_projection_serialization_factor": 5,
    }
)
eval_parallelize_kwargs = {
    "use_cache": True,
    "sequence_serialization_factor": 4, 
    "use_cond_encoder": True,
}

Lastly, we specify the arguments controlling the training process.

In [None]:
training_args = IPUSeq2SeqTrainingArguments(
    output_dir="./whisper-small-lora-ipu-checkpoints",
    do_train=True,
    do_eval=True,
    predict_with_generate=True,
    learning_rate=1e-5,
    num_train_epochs=1.0,
    warmup_steps=50,
    evaluation_strategy="steps",
    eval_steps=1000,
    max_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    logging_steps=25,
    dataloader_num_workers=4,
    dataloader_drop_last=True,
    remove_unused_columns=False,  # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
    label_names=["labels"],  # same reason as above
)

Then we just need to pass all of this together with our datasets to the `IPUSeq2SeqTrainer` class:

In [None]:
trainer = IPUSeq2SeqTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorSpeechSeq2SeqWithLabelProcessing(processor),
    compute_metrics=lambda x: compute_metrics(x, processor.tokenizer),
    tokenizer=processor.feature_extractor,
    eval_parallelize_kwargs=eval_parallelize_kwargs,
)

All that remains is to fine-tune the model! The fine-tuning process should take around an hour.

In [None]:
trainer.train()

The model should achieve a WER of around X%. The original pretrained checkpoint achieves Y% WER on the same dataset.

## Using the model for inference

The fine-tuned model can be used for generation by instantiating it as shown below.

In [12]:
peft_model_id = "./whisper-small-lora-ipu-checkpoints/checkpoint-1000"
peft_config = PeftConfig.from_pretrained(peft_model_id)
processor = WhisperProcessorTorch.from_pretrained(peft_config.base_model_name_or_path)
model = WhisperForConditionalGeneration.from_pretrained(peft_config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
# If inference performance is paramount, we can merge the adapter weights back into the 
# base model. This should be done in a deployment scenario, but here we skip this step 
# to demonstrate support of `PeftModel` in pipelines.
# model = model.merge_and_unload()

## Conclusion

In this notebook we demonstrated how to fine-tune Whisper for multi-lingual speech recognition and transcription on the IPU. To reduce the fine-tuning time by using more than 2 replicas more IPUs are required. On Paperspace, this is available using either an IPU-POD16 or a BoW-IPU-POD16. Please contact Graphcore if you need assistance running on larger platforms.

For all available notebooks, check [IPU-powered Jupyter Notebooks](https://www.graphcore.ai/ipu-jupyter-notebooks) to see how IPUs perform on other tasks.

Have a question? Please contact us on our [Graphcore community channel](https://www.graphcore.ai/join-community).
