<div class="alert alert-block alert-warning">
<h1><span style="color:green"> Under-Graduate Research Internship Program (UGRIP) - 2024 <br> Lab 03 - Part A </span><h1>

<h2><span style="color:green"> Transformers for NLP </span><h2>
</div>

---
---

# Attention is All You need! Transformers: the architecture

[Watch this now.](https://youtu.be/zxQyTK8quyY?si=bm8xzz5dnFO1Xsqo)

[Read paper later 😎](https://arxiv.org/abs/1706.03762)


## Hugging Face Transformers🤗 for Natural Language Processing: the library


### Automatic Speech Recoginition(ASR):

Automatic speech recognition is the task of transcribing spoken words to text is framed as sequence to sequence task using the transformer arcitecture. Raw speech waveform/Mel spectrogram is treated as a sequence of continous numbers and tokenized text is treated as a sequence of tokens.

Today **we will build an Arabic ASR model** without the hassle of creating complex model classes using Hugging Face Transformers🤗. We will finetune a pre-trained model (Whisper) from [huggingface transformers](https://huggingface.co/docs/transformers/en/index).



> Pretrained models are machine learning models that have been previously trained on a large dataset and can be used as a starting point for further training or for immediate deployment in various tasks.

Read the whisper paper [here](https://github.com/openai/whisper) (later)

The ASR task is in 3 steps:
1. Feature Extraction: Here we extract audio features depending on model used, Whisper uses mel-spectrogram as it's input so we'll extract mel spectrogram from the audio. We'll also tokenize input arabic characters.
2. Model training: We train the model
3. Output Tokenizer: we tokenize the model's output back to arabic characters.

## Make sure you install these dependencies

In [None]:
!pip install -q transformers datasets librosa evaluate jiwer accelerate transformers

In [None]:
import warnings
warnings.filterwarnings("ignore") #prevent printing of warning messages

## Obtaining dataset

We pull it from [Huggingface datasets](https://huggingface.co/datasets). We'll use the Classical Arabic Speech Dataset.

In [None]:
from datasets import load_dataset
# load dataset from hugging face
dataset = load_dataset("MBZUAI/ClArTTS")

In [None]:
# Check dataset characteristics

dataset

In [None]:
from IPython.display import Audio

# Play audio sample

print(dataset['test'][0]['text'])
Audio(dataset['test'][0]['audio'], rate= dataset['test'][0]['sampling_rate'])

In [None]:
# View dataset features

dataset['train'].features

## What is audio sampling rate?

> *the number of samples of audio carried per second, measured in Hertz (Hz). It determines how frequently the audio signal is measured or sampled.*


It is important to pay attention to the sampling rate of audios as this affects the input token length of audio data and it varies for different pre-trained models.

In [None]:
dataset['train']['sampling_rate'][0]

### Feature Extraction

We will load the processor from the pre-trained checkpoint, setting the language to `arabic` and task to `transcribe`

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small", # TODO: Complete
)

In [None]:
# Get sampling rate of whisper model
sampling_rate = # TODO: Check docs :)
print(sampling_rate)

Notice the sampling rate for this dataset is 40K however whisper uses a sampling rate of 16K. Hence we must resample before feature extraction. We'
ll create a function for feature extraction and include a resampling step for efficiency.

We use the feature extractor to compute the `log-mel spectrogram` input features from our 1-dimensional audio array.
We encode the transcriptions to `label ids` through the use of the tokenizer. We resample the audio using librosa to 16K.


> Log-mel spectrogram ??? Log == logarithmic

> Spectrogram: A spectrogram is a visual representation of the spectrum of frequencies in a signal as it varies with time. It is generated by applying a Short-Time Fourier Transform (STFT) to the audio signal, which breaks it into short overlapping segments and computes the Fourier transform for each segment.

> Mel Scale: The Mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. It is designed to approximate the way humans perceive the pitch of sounds, focusing on how our ears respond to different frequencies.

 Lower Frequencies: The Mel scale is linear below 1 kHz.
 Higher Frequencies: The Mel scale is logarithmic above 1 kHz.


Steps to Generate a Mel Spectrogram:
- Ask ChatGPT `explain mel spectrogram`




Note - since the audio is saved as a sequence in this dataset, we convert it to a numpy array before resampling.

In [None]:
import librosa
import numpy as np


def prepare_dataset(example):
    # resample audio with librosa
    audio = librosa.resample(np.array(example["audio"]), orig_sr=example['sampling_rate'], target_sr=sampling_rate)

    # use whisper processor for feature extraction, pass in audio, sampling rate and target text
    example = processor(
        audio=audio,
        sampling_rate=sampling_rate,
        text=example["text"],
    )

    # compute input length of audio sample in seconds
    example["input_length"] = len(audio) / sampling_rate

    return example

In [None]:
# View column names
dataset.column_names['train']

In [None]:
# Map function to process each row in the dataset. Similar to pandas dataframe Since this operation takes about 30 minutes largely because of the resampling.
# We'll do this just on test set and load already processed dataset🥲
# We remove_columns that are not needed for training for efficiency

dataset['test'].map(prepare_dataset, remove_columns=dataset.column_names['train'])


In [None]:
# Load already processed dataset
df = load_dataset("herwoww/clartts-whisper-ugrip")

Finally, we filter any training data with audio samples longer than 30s. These samples would otherwise be truncated by the Whisper feature-extractor which could affect the stability of training. We define a function that returns True for samples that are less than 30s, and False for those that are longer:

In [None]:
max_input_length = 30.0

def is_audio_in_length_range(length):
    return length < max_input_length

We apply our filter function to all samples of our training dataset through Datasets’ .filter method:

In [None]:
df["train"] = df["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

In [None]:
df["train"]

Since the dataset has no samples greater than 30s, it has no effect here.

### Training and Evaluation


Now that we’ve prepared our data, we’re ready for training. The huggingface Trainer will do much of the heavy lifting for us. All we have to do is:

1. Define a data collator: the data collator takes our pre-processed data and prepares PyTorch tensors ready for the model. ( Similar to pytorch dataloader discussed in Lab 2A) .

2. Define Evaluation metrics: during evaluation, we want to evaluate the model using the word error rate (WER) metric. We need to define a compute_metrics function that handles this computation.

What's WER?

> WER is calculated as the sum of substitutions, insertions, and deletions needed to transform the output of the speech recognition system into the reference text, divided by the total number of words in the reference text.


$$
\text{WER} = \frac{S + D + I}{N}
$$

where:
- \( S \) is the number of substitutions (incorrect words),
- \( D \) is the number of deletions (words that were missed),
- \( I \) is the number of insertions (extra words that were added),
- \( N \) is the total number of words in the reference text.


3. Load the pre-trained checkpoint: we need to load the pre-trained checkpoint and configure it correctly for training.

4. Define the training arguments: these will be used by the [huggingface Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) in constructing the training schedule.

<!-- Once we’ve fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it to transcribe speech in Arabic. -->



### Data Collator

The data collator for a sequence-to-sequence speech model is unique in the sense that it treats the `input_features` and `labels` independently: the `input_features` must be handled by the feature extractor and the `labels` by the tokenizer.

The input_features are already padded to 30s and converted to a log-Mel spectrogram of fixed dimension, so all we have to do is convert them to batched PyTorch tensors. We do this using the feature extractor’s `.pad` method with `return_tensors=pt`. Note that no additional padding is applied here since the inputs are of fixed dimension, the `input_features` are simply converted to PyTorch tensors.

On the other hand, the labels are un-padded. We first pad the sequences to the maximum length in the batch using the tokenizer’s .pad method. The padding tokens are then replaced by -100 so that these tokens are not taken into account when computing the loss. We then cut the start of transcript token from the beginning of the label sequence as we append it later during training.

We can leverage the WhisperProcessor we defined earlier to perform both the feature extractor and the tokenizer operations:

In [None]:
# First let's set up training device
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

We can now initialise the data collator we’ve just defined:

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation metrics

We’ll load the WER metric from huggingfaces Evaluate:

In [None]:
import evaluate

metric = # TODO: Check the docs :)

In [None]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # compute orthographic wer
    wer_ortho = 100 * metric.compute(predictions=pred_str, references=label_str)

    # compute normalised WER
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    # filtering step to only evaluate the samples that correspond to non-zero references:
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]

    wer = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"wer_ortho": wer_ortho, "wer": wer}

## Load a Pre-Trained Checkpoint

Now let’s load the pre-trained Whisper small checkpoint.

In [None]:
from transformers import WhisperForConditionalGeneration
# Load pre-trained model and move to device
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)


We’ll set use_cache to False for training since we’re using gradient checkpointing and the two are incompatible. We’ll also override two generation arguments to control the behaviour of the model during inference: we’ll force the language and task tokens during generation by setting the language and task arguments, and also re-enable cache for generation to speed-up inference time:



In [None]:
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(
    model.generate, # TODO: Complete the function call
)

Define the Training Configuration.

 We define all the parameters related to training. Here, we set the number of training steps to 500. This is enough steps to see a big WER improvement compared to the pre-trained Whisper model, while ensuring that fine-tuning can be run in approximately 45 minutes on a Google Colab free tier. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-ar",  # name on the HF Hub
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # increase to 4000 if you have your own GPU or a Colab paid plan
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=50,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
)

We can forward the training arguments to the transformers Trainer along with our model, dataset, data collator and compute_metrics function:

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=df["train"],
    eval_dataset=df["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

## Training

To launch training, simply execute: ``` trainer.train() ```



In [None]:
trainer.train()

# Since we use ```load_best_model_at_end=True``` we can save the model with best loss
# trainer.save("./checkpoint_best")

## Evaluating from fine-tuned model

With `transformers pipeline ` we can do the feature extraction, model pass and output tokenization steps in 1 line



In [None]:
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="/content/whisper-small-ar/checkpoint-500")

In [None]:
dataset['test'][0]['text']

In [None]:
pipe(librosa.resample(np.array(dataset['test'][0]['audio']), orig_sr=40100, target_sr=16000)) # Why do we use np.arrray here?

Task: Now generate text with pre-trained model and compare performance of both approaches.

In [None]:
# TODO: Complete