## **Fine Tune Whisper**

Leverage the extensive multilingual ASR knowledge acquired by Whisper during pre-training for our low-resource language: Singlish

**Resources**

<u>Fine-tune</u>
- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english
- https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-non-streaming.ipynb

<u>Stream</u>
- https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable
- https://huggingface.co/docs/datasets/en/stream

<u>Create dataset</u>
- https://huggingface.co/docs/datasets/en/audio_dataset
- https://huggingface.co/datasets/AILAB-VNUHCM/vivos/blob/main/vivos.py

<u>PEFT</u>
- https://github.com/Vaibhavs10/fast-whisper-finetuning/blob/main/Whisper_w_PEFT.ipynb
- https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb


### **Load Dataset**

Whenever changes are made to the dataset repo, run ```Remove-Item -Recurse -Force ~/.cache/huggingface/datasets/``` from the terminal

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-18.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90

In [None]:
from datasets import load_dataset
from IPython.display import Audio

**User Action Required**

- Specify the desired dataset to load for fine-tuning

In [None]:
dataset_repo = "johnlohjy/imda_nsc_p3_same_closemic_train"
dataset_train = load_dataset(dataset_repo, split='train', streaming=True, trust_remote_code=True)

imda_nsc_p3_same_closemic_train.py:   0%|          | 0.00/3.92k [00:00<?, ?B/s]

In [None]:
print(dataset_train)

IterableDataset({
    features: ['path', 'audio', 'sentence'],
    num_shards: 1
})


### **Prepare Dataset for Whisper**

- Feature extractor
    - Pads (with silence)/truncates audio to 30s
    - Convert raw audio-inputs to log-mel spectrogram input features

- Tokenizer
    - Tokenizer maps seq of token ids output by Whisper model back to their corresponding text string

In [None]:
from transformers import WhisperProcessor

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

**User Action Required**

- Input the desired whisper version for fine-tuning

In [None]:
whisper_ver = 'whisper-tiny'

In [None]:
# WhisperProcesser class provides both feature extractor and tokenizer
processor = WhisperProcessor.from_pretrained(f"openai/{whisper_ver}", language="English", task="transcribe")

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

In [None]:
def prepare_dataset(batch):
    # load audio data
    audio = batch["audio"]

    # Perform feature extraction: Compute log-Mel input features from input audio array
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # Perform tokenization: Encode target text to label ids
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
print(dataset_train.column_names)

['path', 'audio', 'sentence']


In [None]:
# IterableDataset.map() for processing IterableDataset. Applies processing on-the-fly as examples are streamed
dataset_train_processed = dataset_train.map(prepare_dataset, remove_columns=dataset_train.column_names)

### **Define Data Collator For Training**

- Prepare data in training batches that are ready to be trained on by the model
  - Pad audio features to appropriate max length
  - Pad tokenized labels to appropriate max length
  

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Data collator takes pre-processed data and prepares PyTorch tensors ready for the model
        # Treat input_features and labels independently.
        # input_features are handled by the feature extractor
        # labels are handled by the tokenizer

        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        # By replacing padding tokens with -100, they are not taken into account
        # when computing the loss
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        # beginning of sentence token
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### **Define Evaluation Metrics For Training**

- To monitor the model's performance more effectively
- During evaluation we can evaluate the model using the WER metric
  - Better comparison than default loss metric

In [None]:
import evaluate

In [None]:
metric = evaluate.load("wer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    # Undoing the step in the data collator to ignore padded tokens correctly to calculate loss
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    # Decode the predicted and label ids to strings
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Compute WER between predictions and reference labels
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### **Define Whisper Model for fine-tuning**

In [None]:
from transformers import WhisperForConditionalGeneration

In [None]:
model = WhisperForConditionalGeneration.from_pretrained(f"openai/{whisper_ver}")

In [None]:
# Override generation arguments
# A list of pairs of integers which indicates a mapping from
# generation indices to token indices that will be forced before sampling
# No tokens are forced as decoder outputs
model.config.forced_decoder_ids = None
# A list of tokens that will be suppressed at generation.
# The SupressTokens logit processor will set their log probs to -inf so that they are not sampled
# No tokens are suppressed during generation
model.config.suppress_tokens = []
# We are using gradient checkpointing to save memory
# - Reduce memory usage by saving strategically selected activations/intermediate results
#   throughout the computational graph such that a fraction of the activations are re-computed
#   to calculate gradients during backpropagation. Therefore we set use_cache to False to not
#   cache the intermediate results
model.config.use_cache = False