## **Fine Tune Whisper**

Leverage the extensive multilingual ASR knowledge acquired by Whisper during pre-training for our low-resource language: Singlish

**Resources**

<u>Fine-tune</u>
- https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
- https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english
- https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-non-streaming.ipynb

<u>Stream</u>
- https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable
- https://huggingface.co/docs/datasets/en/stream

<u>Create dataset</u>
- https://huggingface.co/docs/datasets/en/audio_dataset
- https://huggingface.co/datasets/AILAB-VNUHCM/vivos/blob/main/vivos.py

<u>PEFT</u>
- https://github.com/Vaibhavs10/fast-whisper-finetuning/blob/main/Whisper_w_PEFT.ipynb
- https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
- https://huggingface.co/docs/peft/main/en/task_guides/int8-asr
- https://huggingface.co/docs/peft/en/developer_guides/quantization

### **GPU Setup**

In [1]:
import os

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Nov 18 15:34:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0              51W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
# Tell the progam to use the GPU allocated to us by setting the env variable used by CUDA
# Use the first GPU on your machine
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

### **HuggingFace Environment Setup**

To upload model checkpoints to HF Hub while training

Useful features from HF Hub
- Version control of model checkpoints
- Tensorboard logs? Track important metrics during training?

In [4]:
# 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
#from huggingface_hub import notebook_login
#notebook_login()

### **GoogleDrive Environment Setup**

- Store model checkpoints

In [5]:
from google.colab import drive
google_drive_folder = 'whisper-small-checkpoints'
google_drive_path = f'/content/drive/My Drive/{google_drive_folder}'
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
!ls '/content/drive/My Drive/whisper-tiny-checkpoints'

ls: cannot access '/content/drive/My Drive/whisper-tiny-checkpoints': No such file or directory


### **Load Dataset**

Whenever changes are made to the dataset repo, run ```Remove-Item -Recurse -Force ~/.cache/huggingface/datasets/``` from the terminal

In [7]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [8]:
from datasets import load_dataset
from IPython.display import Audio

**User Action Required**

- Specify the desired dataset to load for fine-tuning

In [9]:
dataset_repo_train = "johnlohjy/imda_nsc_p3_same_closemic_train"
dataset_train = load_dataset(dataset_repo_train, split='train', streaming=True, trust_remote_code=True)

imda_nsc_p3_same_closemic_train.py:   0%|          | 0.00/3.92k [00:00<?, ?B/s]

In [10]:
print(dataset_train)

IterableDataset({
    features: ['path', 'audio', 'sentence'],
    num_shards: 1
})


In [11]:
dataset_repo_test = "johnlohjy/imda_nsc_p3_same_closemic_test"
dataset_test = load_dataset(dataset_repo_test, split='test', streaming=True, trust_remote_code=True)

imda_nsc_p3_same_closemic_test.py:   0%|          | 0.00/4.02k [00:00<?, ?B/s]

In [12]:
print(dataset_test)

IterableDataset({
    features: ['path', 'audio', 'sentence'],
    num_shards: 1
})


### **Prepare Dataset for Whisper**

- Feature extractor
    - Pads (with silence)/truncates audio to 30s
    - Convert raw audio-inputs to log-mel spectrogram input features

- Tokenizer
    - Tokenizer maps seq of token ids output by Whisper model back to their corresponding text string

In [13]:
from transformers import WhisperProcessor

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

**User Action Required**

- Input the desired whisper version for fine-tuning

Whisper Model Card: https://github.com/openai/whisper/blob/main/model-card.md

In [14]:
whisper_ver = 'whisper-small'

In [15]:
# WhisperProcesser class provides both feature extractor and tokenizer
processor = WhisperProcessor.from_pretrained(f"openai/{whisper_ver}", language="en", task="transcribe")

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

In [16]:
def prepare_dataset(batch):
    # load audio data
    audio = batch["audio"]

    # Perform feature extraction: Compute log-Mel input features from input audio array
    # Use feature extractor to compute log-Mel spectrogram input features from 1D audio array
    # Pre-process raw audio-inputs
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # Perform tokenization: Encode target text to label ids
    # Encode transcriptions to label ids through use of tokenizer
    # Post-process model outputs to text format
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
    return batch

In [17]:
print(dataset_train.column_names)

['path', 'audio', 'sentence']


In [18]:
# IterableDataset.map() for processing IterableDataset. Applies processing on-the-fly as examples are streamed
# Can try setting num_proc. num_proc specifies how many CPU cores to use. num_proc > 1: multiprocessing
# if .map hangs with multiprocessing, set num_proc = 1 to process dataset sequentially
dataset_train_processed = dataset_train.map(prepare_dataset, remove_columns=dataset_train.column_names)

In [19]:
print(dataset_test.column_names)

['path', 'audio', 'sentence']


In [20]:
dataset_test_processed = dataset_test.map(prepare_dataset, remove_columns=dataset_test.column_names)

### **Define Data Collator For Training**

- Prepare data in training batches that are ready to be trained on by the model
  - Pad audio features to appropriate max length
  - Pad tokenized labels to appropriate max length
  

In [21]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

In [22]:
processor.tokenizer.eos_token_id

50257

In [23]:
processor.tokenizer.pad_token_id

50257

In [24]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Data collator takes pre-processed data and prepares PyTorch tensors ready for the model
        # Treat input_features and labels independently.
        # input_features are handled by the feature extractor
        # labels are handled by the tokenizer

        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        # By replacing padding tokens with -100, they are not taken into account
        # when computing the loss
        # Error: The attention mask is not set and cannot be inferred from input because pad token is same as eos token.
        # eos_token_id and pad_token_id are actually both 50257
        # but we replace it with -100 in this line of code
        # https://discuss.huggingface.co/t/finetuning-whisper-attention-mask-not-set-and-canot-be-inferred/97456
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        # beginning of sentence token
        # Cut the BOS token from the start of the label seq as it is appended later during training
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [25]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### **Define Evaluation Metrics For Training**

- To monitor the model's performance more effectively
- During evaluation we can evaluate the model using the WER metric
  - Better comparison than default loss metric

In [26]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [27]:
!pip install jiwer

Collecting jiwer
  Downloading jiwer-3.0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting rapidfuzz<4,>=3 (from jiwer)
  Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading jiwer-3.0.5-py3-none-any.whl (21 kB)
Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m79.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.0.5 rapidfuzz-3.10.1


In [28]:
import evaluate

In [29]:
metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [30]:
import re
# https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string
def normalize_wer(token):
    token = token.lower()
    token = re.sub(r'[^\w\s]', '', token)
    return token.strip()

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id in label_ids
    # Undoing the step in the data collator to ignore padded tokens correctly to calculate loss
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    # Decode the predicted and label ids to strings
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)


    #print('Pred str before')
    #print(pred_str)
    #print('Pred str after')
    pred_str = [normalize_wer(token) for token in pred_str]
    #print(pred_str)
    #print("")
    #print('Label str before')
    #print(label_str)
    #print('Label str after')
    label_str = [normalize_wer(token) for token in label_str]
    #print(label_str)
    #print("")
    #print("")

    # Compute WER between predictions and reference labels
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### **Define Whisper Model for fine-tuning**





In [31]:
!pip install -q bitsandbytes accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [32]:
from transformers import WhisperForConditionalGeneration

In [33]:
model = WhisperForConditionalGeneration.from_pretrained(f"openai/{whisper_ver}", device_map="auto")

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

In [34]:
# Override generation arguments
# A list of pairs of integers which indicates a mapping from
# generation indices to token indices that will be forced before sampling
# No tokens are forced as decoder outputs
model.config.forced_decoder_ids = None
# A list of tokens that will be suppressed at generation.
# The SupressTokens logit processor will set their log probs to -inf so that they are not sampled
# No tokens are suppressed during generation
model.config.suppress_tokens = []
# We are using gradient checkpointing to save memory
# - Reduce memory usage by saving strategically selected activations/intermediate results
#   throughout the computational graph such that a fraction of the activations are re-computed
#   to calculate gradients during backpropagation. Therefore we set use_cache to False to not
#   cache the intermediate results
model.config.use_cache = False # re-enable for inference!

### **Define Training Config**

```predict_with_generate```: Whether to use generate to calculate generative metrics

```compute_metrics```: Function that will be used to compute metrics at evaluation

By right, when ```predict_with_generate=True```, during evaluation, for each input sample in the evaluation dataset, the model will generate an output sequence using its internal generation mechanism.
- Instructs the trainer to use the model's ```generate``` method for creating outputs during evaluation
- Helps us to use more accurate metrics for assessing the model

The ```compute_metrics``` function takes these generated sequences and calculates metrics.


In [35]:
from transformers import Seq2SeqTrainingArguments

In [36]:
# Mix of arguments from
# https://medium.com/htx-dsai/finetuning-whisper-for-the-singaporean-home-team-context-a3ae1a6ae809
# https://www.jensenlwt.com/blog/singlish-whisper-finetuning-asr-for-singapore-unique-english
# https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

'''
Calculating number of steps for 100 hours
https://discuss.huggingface.co/t/what-is-the-meaning-of-steps-parameters/56411

num_samples is 80276 files (from the train folder): 530 hours

To get 100 hours, take 80276/5=16055 samples. Round to 16100

In one step, 'batch_size' samples are processed

Assuming batch_size=8,

To process 16100 samples, we need 16100/8=2012.5 steps

2015 steps.

...

Tutorial for whisper-small ran for 5000 steps

'''

# https://huggingface.co/docs/transformers/v4.46.2/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
    output_dir=google_drive_path, # Output dir where model predictions and checkpoints are written
    per_device_train_batch_size=128, # default, HF:8. HTX:64. Jensen:128
    gradient_accumulation_steps=1, # default, HF:1. HTX:1. Jensen:1 -> a way to train a batch size that doesn't fit in memory by accumulating gradients over sets of smaller batches, and then updating weights once we hit the desired batch size -> linearity of grad calculation. According to the github: # increase by 2x for every 2x decrease in batch size
    learning_rate=1.25e-5, # Note that diff sized models is used across all 3. HF:1e-3. HTX:6.25e-6. Jensen:1e-5. Suggested by whisper paper for tiny: 3.75x10^-5 : https://github.com/vasistalodagala/whisper-finetune?tab=readme-ov-file#hyperparameter-tuning
    warmup_steps=500, # HF:50. HTX:300. Jensen:500. Increase the learning rate from 0. Idea: Helps the network to slowly adapt to the data (intuitively)
    #num_train_epochs=3, # total number of training epochs
    evaluation_strategy="steps", # HF:epoch. HTX:steps. Jensen:steps.
    fp16=True, # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
    per_device_eval_batch_size=32, # default, HF:8. HTX:32. Jensen:32
    generation_max_length=225, # HF:128. HTX:225. Jensen 225. Max length to use on each evaluation loop when predict_with_generate=True. Defaults to max_length of the model config
    logging_steps=25, # HF:25. HTX:25. Jensen:500. Log every x steps. In one step, batch_size examples are processed. An epoch consists of 1 full cycle through training data/num_samples. logging_dir defaults to default to *output_dir/runs/CURRENT_DATETIME_HOSTNAME*
    remove_unused_columns=False, # Auto remove cols unused by the model forward method -> This =False config is required because PeftModel's forward doesn't have the signature of the base model's forward
    label_names=["labels"], # list of keys in our dict of inputs that correspond to the labels,

    gradient_checkpointing = True, # Can use because we are not using PEFT. Use gradient checkpointing to save memory at the expense of a slower backward pass
    max_steps = 5000, # HTX:3000. Jensen:5000. Overrides num_train_epochs
    predict_with_generate = True,
    save_steps = 50, #HTX:50. Jensen:500. Number of updates steps before two checkpoint saves if save_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps. checkpoint steps checkpoint steps ...
    eval_steps = 50, #HTX:50. Jensen:500
    load_best_model_at_end=True, #  Whether or not to load the best model found during training at the end of training. When this option is enabled, the best checkpoint will always be saved
    metric_for_best_model="wer", # from custom compute_metrics function
    greater_is_better=False, # because lower wer is better

    # push_to_hub = False, # Jensen only. default is false. push_to_hub (bool, optional, defaults to False) — Whether or not to push the model to the Hub every time the model is saved. If this is activated, output_dir will begin a git directory synced with the repo (determined by hub_model_id) and the content will be pushed each time a save is triggered (depending on your save_strategy). Calling save_model() will also trigger a push.
    # max_grad_norm=1.0 # Jensen only. Default is also 1
)



### **Training**

- https://wandb.ai/mostafaibrahim17/ml-articles/reports/A-Deep-Dive-Into-Learning-Curves-in-Machine-Learning--Vmlldzo0NjA1ODY0

- Monitor the training and evaluation loss curve
  - Signs of overfitting: Divergence
  - Observe signs of overfitting when the train loss begins to fluctuate while the eval loss plateaus
  - Observe signs of overfitting when the train loss continues to decrease while the eval loss plateaus
- Select the model checkpoint based on the training and evaluation loss curve as well as the wer curve
  - Select the checkpoint that has the best balance where the training loss is still relatively low and the validation loss has not started to rise
- Consider ```EarlyStoppingCallback``` and Auto-Pause features
  - ```EarlyStoppingCallback``` can help stop training early if the model's performance stops improving

<u>Curves</u>
- Smoothness: Smooth learning curve indicates model is steadily improving over time (learning in a stable and consistent manner) -> Changes gradually and consistently
  - No big and sudden jumps

- Convergence: When learning curve reaches steady state (levels off or plateaus), further training doesn't lead to significant improvements. This means the model has learned as much as it can from the training data and has reached its best performance
  
- Generalisation: Train-Eval loss curve does not show a big difference.
  - Model could be overfitting: Memorises training data too well and struggles to handle new examples

- Curve types
  - Low training loss, High validation loss: Model is overfitting
    - Model is too focused on capturing the patterns of training data, does not generalise well to new, unseen data
    - Solutions: Regularization methods, early stopping
  - Training loss decreases, Validation loss plateaus: Model is overfitting. Same issue as above: too specialized in fitting the training data, but it fails to generalize well to new data. This discrepancy indicates that the model may have learned the specific patterns in the training set but is struggling to apply them to unseen examples.
  - Large gap between training and validation performance: Issues with generalisation.
    - Model is not able to generalize its learnings from the training data to new examples effectively
  - Erratic/Unstable learning curve: Problems with model or data
    - High learning rate causing instability, inadequate processing of data, noisy data

Problem: Model overfitting
- Solution: Regularizaiton techniques, dropout, early stopping

Problem: Model underfitting
- Solution: Experiment with other hyperparameters: learning rate, batch size etc.

  


In [37]:
from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl

In [38]:
# Slice the streaming dataset because evaluating on the whole set would take too long
from itertools import islice
from torch.utils.data import IterableDataset

class SlicedDataset(IterableDataset):
    def __init__(self, dataset, num_examples):
        self.dataset = dataset
        self.num_examples = num_examples

    def __iter__(self):
        return islice(iter(self.dataset), self.num_examples)

    def __len__(self):
        return self.num_examples

dataset_test_processed_reduced = SlicedDataset(dataset_test_processed, num_examples=200)
# dataset_test_processed_reduced = SlicedDataset(dataset_test_processed, num_examples=3)
# dataset_test_processed_reduced_iter = iter(dataset_test_processed_reduced)
# print(next(dataset_test_processed_reduced_iter))

In [39]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset_train_processed,
    #eval_dataset=dataset_test_processed,
    eval_dataset=dataset_test_processed_reduced,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    #processing_class = processor # replace tokenizer with this
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(
max_steps is given, it will override any value given in num_train_epochs


**Training Warnings and Errors**

- Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
  - ```processor = WhisperProcessor.from_pretrained(f"openai/{whisper_ver}", language="en", task="transcribe")```

- The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
  - ```labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)```
  - https://discuss.huggingface.co/t/finetuning-whisper-attention-mask-not-set-and-canot-be-inferred/97456/4

- Eval time takes long
  - https://discuss.huggingface.co/t/trainer-freezes-crashes-after-evaluation-step/77556
  - https://stackoverflow.com/questions/78128694/huggingface-seq2seqtrainer-freezes-on-evaluation
  - Probably because evaluation is done on the whole evaluation dataset

In [40]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Wer
50,1.2192,1.045158,41.505576
100,0.7272,0.584415,25.167286
150,0.5634,0.503478,15.092937
200,0.5258,0.439317,17.918216
250,0.3959,0.361556,12.825279
300,0.2986,0.334546,14.089219
350,0.3773,0.332445,11.979554


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.proce

Step,Training Loss,Validation Loss,Wer
50,1.2192,1.045158,41.505576
100,0.7272,0.584415,25.167286
150,0.5634,0.503478,15.092937
200,0.5258,0.439317,17.918216
250,0.3959,0.361556,12.825279
300,0.2986,0.334546,14.089219
350,0.3773,0.332445,11.979554
400,0.3799,0.319868,11.877323


KeyboardInterrupt: 

### **Clean Up**

In [41]:
drive.flush_and_unmount()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

### **Extra**

<u>Quantization</u>

Load the pre-trained Whisper model for fine-tuning

Load the model in 8-bit (8-bit integers): Quantize the model to use 1/4th precision as compared to float32 with minimal loss in performance

Uses the bitsandbytes lib

Quantization: Reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference.

Reduce the precision of numerical values in a model. Instead of using high-precision data types, such as 32-bit floating-point numbers, quantization represents values using lower-precision data types, such as 8-bit integers. This process significantly reduces memory usage and can speed up model execution while maintaining acceptable accuracy

For example, if your model weights are stored as 32-bit floating points and they’re quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.

Going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type

8-bit or int8 quantization uses only a quarter precision, but it does not degrade performance because it doesn’t just drop the bits or data. Instead, int8 quantization rounds from one data type to another.

Setting device_map="auto" automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory

Enable int8 Quantization, more for inference
- reduced memory usage for storing weights and faster computation

```
from peft import prepare_model_for_int8_training

model = WhisperForConditionalGeneration.from_pretrained(f"openai/{whisper_ver}", load_in_8bit=True, device_map="auto")

# Prepare the model for int_8_training -> This is more for PEFT
# - adds a forward hook to the input embedding layer to calculate the gradients of the input hidden states
# - enables gradient checkpointing for more memory-efficient training
# - casts all the non int8 modules to full precision (fp32) for stability
# not all parts need to be in 8-bit
model = prepare_model_for_int8_training(model)
```