# Fine Tuning Darija on the Whisper Model

### Notes:
* We'll utilize the Hugging Face library for storing and loading data, as well as for accessing the Hugging Face Transformer.
* The choice of the model for fine-tuning was based on hardware constraints and research indicating its strong performance with English language tasks.
* Evaluation of the model's performance will be based on Word Error Rate (WER).
The dataset was sourced from YouTube.
###Steps:
* Prepare Feature Extractor, Tokenizer, and Data: Set up the feature extractor, tokenizer, and data loaders for training and validation.
* Fine-tuning the Model and Saving the Checkpoint: Implement the fine-tuning process, adjusting the model's parameters to specialize in technical vocabulary.
* Evaluate the Model Performance: Assess the model's performance using WER on the validation set.
* Additional Steps: Any further enhancements or adjustments to improve the model's efficacy and efficiency can be explored.

## Part1: Prepare Feature Extractor, Tokenizer, and Data

In [1]:
#install  necessary libraries and packages
!pip install --upgrade pip
!pip install accelerate -U
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install --upgrade datasets transformers accelerate soundfile librosa evaluate jiwer tensorboard gradio
!pip install transformers[torch]
!pip install accelerate>=0.21.0
!pip install webvtt-py


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pip
  Downloading pip-24.0-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.2
    Uninstalling pip-23.3.2:
      Successfully uninstalled pip-23.3.2
Successfully installed pip-24.0
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting huggingface-hub (from accelerate)
  Downloading huggingface_hub-0.23.1-py3-none-any.whl.metadata (12 kB)
Collecting safetensors>=0.3.1 (from accelerate)
  Downloading safetensors-0.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading accelerate-

In [2]:
#authenticate with huggingface account
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Load Data
We have approximately 1h30min of training data. We'll use the 20 minutes of test (validation data):

In [3]:
from datasets import load_dataset, DatasetDict

dar_voice = DatasetDict()

dar_voice["train"] = load_dataset("team4/8dretna_daridja", split="train", use_auth_token=True)
dar_voice["test"] = load_dataset("team4/8dretna_daridja", split="validation", use_auth_token=True)



In [4]:
# Remove unecessary attributes
dar_voice = dar_voice.remove_columns(['start_time', 'end_time'])

#### Prepare Feature Extractor, Tokenizer and Data
The Whisper feature extractor performs two operations:

Pads / truncates the audio inputs to 30s: any audio inputs shorter than 30s are padded to 30s with silence (zeros), and those longer that 30s are truncated to 30s
Converts the audio inputs to log-Mel spectrogram input features, a visual representation of the audio and the form of the input expected by the Whisper model

In [5]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

The Whisper model outputs a sequence of token ids. The tokenizer maps each of these token ids to their corresponding text string

In [6]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="arabic", task="transcribe")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


This processor class inherits from both the WhisperFeatureExtractor and WhisperProcessor, providing a unified interface for handling audio inputs and model prediction

In [7]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="arabic", task="transcribe")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


#### Prepare Data

In [8]:
print(dar_voice["train"][0])

{'audio': {'path': 'darfchouch_train_00-00-00.mp3', 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.01024634,
        0.00450849,  0.01092858]), 'sampling_rate': 44100}, 'text': 'ليوم جبنالكم واحد ليزانفيتي خرجناهم من واحد الدار واحد كان اسمو وسيم خرج اسمو مختار او وحدة ما تخرجش ملكوزينا مرحبا بيكم مايا لايماش'}


* As it is shown the voice_rate is 48khz, before feeding it into the Whisper feature extractor, we'll need to downsample the audio to 16kHz. This step is necessary because the Whisper model expects input with a sampling rate of 16kHz.

In [9]:
from datasets import Audio

dar_voice = dar_voice.cast_column("audio", Audio(sampling_rate=16000))

In [10]:
print(dar_voice["train"][0])

{'audio': {'path': 'darfchouch_train_00-00-00.mp3', 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.02435937,
       -0.03153793, -0.00449912]), 'sampling_rate': 16000}, 'text': 'ليوم جبنالكم واحد ليزانفيتي خرجناهم من واحد الدار واحد كان اسمو وسيم خرج اسمو مختار او وحدة ما تخرجش ملكوزينا مرحبا بيكم مايا لايماش'}


* Now we need to resample data by calling batch["audio"]

In [11]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["text"]).input_ids
    return batch

In [12]:
#excute the function for all training data instances
dar_voice = dar_voice.map(prepare_dataset, remove_columns=dar_voice.column_names["train"], num_proc=2)

#### Define a Data Collator
The data collator for a sequence-to-sequence speech model handles input features and labels independently. Input features, which are preprocessed into fixed-dimension log-Mel spectrograms and padded to a 30-second duration, are converted into batched PyTorch tensors using the feature extractor's .pad method. Labels, on the other hand, are unpadded and padded to match the maximum length in the batch using the tokenizer's .pad method. Padding tokens are replaced with -100 to exclude them from loss computation, and the beginning-of-sequence (BOS) token is removed from the label sequence. The WhisperProcessor class facilitates these operations by integrating both the feature extractor and tokenizer functionalities.

In [13]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels since they have to be of different lengths and need different padding methods
        # First treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        #Get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        #Pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        #Replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        #If bos token is appended in previous tokenization step,
        #Cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [14]:
#Initialise the data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [15]:
import evaluate

metric = evaluate.load("wer")

2024-05-22 08:04:11.733602: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-22 08:04:11.778028: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-22 08:04:11.778068: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-22 08:04:11.779160: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-22 08:04:11.786063: I tensorflow/core/platform/cpu_feature_guar

In [16]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    #Replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    #We do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [17]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

In [18]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [19]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="team4/whisperDAR",
    per_device_train_batch_size=8,      # Reasonable batch size for efficiency
    gradient_accumulation_steps=4,      # Effective utilization of GPUs
    learning_rate=3e-5,                  # Common learning rate for fine-tuning
    warmup_steps=300,                   # Conservative warmup steps
    max_steps=5000,                      # Utilize the 10-hour training time
    gradient_checkpointing=True,        # Save memory, recommended
    fp16=True,                          # Leveraging mixed precision training
    evaluation_strategy="steps",        # Evaluate every few steps
    per_device_eval_batch_size=8,       # Consistent with train batch size
    predict_with_generate=True,         # Enable generation during evaluation
    generation_max_length=50,           # Suitable length for generation
    save_steps=500,                     # Save model periodically
    eval_steps=500,                     # Evaluate every save step
    logging_steps=100,                  # Log metrics regularly
    report_to=["tensorboard"],
    load_best_model_at_end=True,        # Load the best model at the end
    metric_for_best_model="wer",        # Use WER for model selection
    greater_is_better=False,            # Lower WER is better
    push_to_hub=True,                   # Push model to Hub after training
    save_total_limit=5,                 # Limit the number of saved checkpoints
    num_train_epochs=25,                 # Adjusted for the total training time
)




In [20]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dar_voice["train"],
    eval_dataset=dar_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

max_steps is given, it will override any value given in num_train_epochs


In [21]:
processor.save_pretrained(training_args.output_dir)

[]

In [None]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
200,1.0693,1.099554,82.798125
400,0.707,1.015955,79.170283
600,0.6128,1.015628,72.973442
800,0.3929,1.010359,76.080542


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}


In [75]:
kwargs = {
    "dataset_tags": "team4/8dretna_daridja",
    "dataset": "8dretna_daridja",  # a 'pretty' name for the training dataset
    "dataset_args": "split: test",
    "language": "ar",
    "model_name": "whisperDAR",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

In [76]:
trainer.push_to_hub(**kwargs)

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}


CommitInfo(commit_url='https://huggingface.co/hamzabennz/whisperDAR/commit/7051b76b7d2c5e00a7cce54c9d8079b728f9abbf', commit_message='End of training', commit_description='', oid='7051b76b7d2c5e00a7cce54c9d8079b728f9abbf', pr_url=None, pr_revision=None, pr_num=None)

In [77]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="hamzabennz/whisperDAR")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),  # Remove the 'source' parameter
    outputs="text",
    title="Whisper Small DARIDJA",
    description="",
)

iface.launch(share = True)

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Running on local URL:  http://127.0.0.1:7863
Running on public URL: https://4e1a02cc3db33dfa54.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [78]:
from transformers import pipeline

# Load the speech recognition pipeline
pipe = pipeline(model="hamzabennz/whisperDAR")  # change to "your-username/the-name-you-picked"

def transcribe_audio(audio_path):
    try:
        # Transcribe the audio using the pipeline
        text = pipe(audio_path)["text"]
        return text
    except Exception as e:
        return f"Error: {str(e)}"

# Provide the path to the audio file
audio_path = "validation.mp3"  # Replace this with the path to your audio file

# Transcribe the audio and print the result
transcription = transcribe_audio(audio_path)
print("Transcription:", transcription)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Transcription:  سيد أنا هدرت وحده دو زري دميه والله بالميكرو من هذا كانوا عنده دو كست سجل هب من بعد أنا روحت من مغنية هو تلغة زوات هو تلغة زوات أنا ماتلعتش المغنية بندوق وحد لي وي مواهولة ما تلعش ما عندي علشان تلعى المغنية تلعط وحد النهار ومع الاباليش باللي هادك الكست داعي سيد كان يدير دي كوبي ويبيع آه أوي ونمع الاباليش هنقول لي تمعروف في مغنية ونمع الاباليش أمتوا سعلاً دهركوا من البلدك شو سيقصي مولاكي بالله لعالي العدم طلعت المغنية في تاكسي كي ملعبها وللا لكي هاش لي أعرفني طلعت في تاكسي نزلت مشيتوا على القيسارية يبعو سوالة وكي ينواحدي بيعلي كست فيديو تك وندخل ووجي أوفون لي بيعلي كست جي أوفون ونو كي ينسلى واللي سورفات سبردينات وانتاعي ليه هاده وشوفي هادين قولك سي اووووو كتروفان دي ست نسمع للا فوانتاعي وين اقولي هاد انا لا لما تصورتش باللي انا وانا انت مش ها حتى جيت جبرت سيد اندول غاشي شوية ويدخلو وانا بصعما دخلت ليش من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ وانا تخلط سعنة مكاشل يعرفني نس عندهم غير الأوديو وانا تخلطون

In [81]:
import webvtt

# Load the VTT file and extract the ground truth transcription
def load_ground_truth(vtt_file):
    captions = webvtt.read(vtt_file)
    ground_truth = " ".join([caption.text for caption in captions])
    return ground_truth

In [82]:
import jiwer

# Define the path to your VTT file
vtt_file = "validation.vtt"  # Replace this with the path to your VTT file

# Load the ground truth transcription from the VTT file
ground_truth_transcription = load_ground_truth(vtt_file)

# Get the predicted transcription
predicted_transcription = transcribe_audio(audio_path)

# Calculate the Word Error Rate (WER)
wer = jiwer.wer(ground_truth_transcription, predicted_transcription)

print("Predicted Transcription:", predicted_transcription)
print("Ground Truth Transcription:", ground_truth_transcription)
print("Word Error Rate (WER):", wer)

Predicted Transcription:  سيد أنا هدرت وحده دو زري دميه والله بالميكرو من هذا كانوا عنده دو كست سجل هب من بعد أنا روحت من مغنية هو تلغة زوات هو تلغة زوات أنا ماتلعتش المغنية بندوق وحد لي وي مواهولة ما تلعش ما عندي علشان تلعى المغنية تلعط وحد النهار ومع الاباليش باللي هادك الكست داعي سيد كان يدير دي كوبي ويبيع آه أوي ونمع الاباليش هنقول لي تمعروف في مغنية ونمع الاباليش أمتوا سعلاً دهركوا من البلدك شو سيقصي مولاكي بالله لعالي العدم طلعت المغنية في تاكسي كي ملعبها وللا لكي هاش لي أعرفني طلعت في تاكسي نزلت مشيتوا على القيسارية يبعو سوالة وكي ينواحدي بيعلي كست فيديو تك وندخل ووجي أوفون لي بيعلي كست جي أوفون ونو كي ينسلى واللي سورفات سبردينات وانتاعي ليه هاده وشوفي هادين قولك سي اووووو كتروفان دي ست نسمع للا فوانتاعي وين اقولي هاد انا لا لما تصورتش باللي انا وانا انت مش ها حتى جيت جبرت سيد اندول غاشي شوية ويدخلو وانا بصعما دخلت ليش من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ من هذا؟ وانا تخلط سعنة مكاشل يعرفني نس عندهم غير الأوديو و