# Fine-Tuning Whisper For Malayalam Language

## Whisper ASR
*   Pre-trained Multi-lingual ASR model over 6,80,000 hrs of labelled data.
*   It is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model.
*   In a sequence-to-sequence model, the encoder transforms the audio inputs into a set of hidden state representations, extracting important features from the spoken speech.
*  The decoder plays the role of a language model, processing the hidden state representations and generating the corresponding text transcriptions.









<figure>
<img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/whisper_architecture.svg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 1:</b> Whisper model. The architecture
follows the standard Transformer-based encoder-decoder model. A
log-Mel spectrogram is input to the encoder. The last encoder
hidden states are input to the decoder via cross-attention mechanisms. The
decoder autoregressively predicts text tokens, jointly conditional on the
encoder hidden states and previously predicted tokens. Figure source:
<a href="https://openai.com/blog/whisper/">OpenAI Whisper Blog</a>.</figcaption>
</figure>

In [1]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install transformers[torch]  # additional requirements for training in colab
!pip install accelerate -U        # additional requirements for training in colab

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-j8aq9tgq
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-j8aq9tgq
  Resolved https://github.com/huggingface/transformers to commit 37fa1f654f17b68bbe30440c64e611f1a4d55bc7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.41.0.dev0-py3-none-any.whl size=9017227 sha256=ed37f1a6095507cfac4ab62cf3626cf467213d6699bb635bd38b0f4db01d7598
  Stored in directory: /tmp/pip-ephem-wheel-cache-tjhcikzu/wheels/c0/14/d6/6c9a5582d2ac191ec0a483be151a4495fe1eb2a6706ca49f1b
Successfully built transformers

## Load Dataset

In [2]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "ml", split="train+validation")
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "ml", split="test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/8.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.58M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/49.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/358k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/122k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/78.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/31.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/568k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 430it [00:00, 41020.55it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 112it [00:00, 25607.09it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 1939it [00:00, 21595.50it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 11it [00:00, 21762.90it/s]
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [3]:
print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 430
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 112
    })
})


In [4]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 430
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 112
    })
})


## Prepare Feature Extractor, Tokenizer and Data

The ASR pipeline can be de-composed into three stages:

* 1) A feature extractor which pre-processes the raw audio-inputs
* 2) The model which performs the sequence-to-sequence mapping
* 3) A tokenizer which post-processes the model outputs to text format

### Load WhisperFeatureExtractor

The Whisper feature extractor performs two operations:
1. Pads / truncates the audio inputs to 30s
2. Converts the audio inputs to _log-Mel spectrogram_ input features, a visual representation of the audio and the form of the input expected by the Whisper model

We'll load the feature extractor from the pre-trained checkpoint with the default values:

In [5]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [6]:
print(feature_extractor)

WhisperFeatureExtractor {
  "chunk_length": 30,
  "feature_extractor_type": "WhisperFeatureExtractor",
  "feature_size": 80,
  "hop_length": 160,
  "n_fft": 400,
  "n_samples": 480000,
  "nb_max_frames": 3000,
  "padding_side": "right",
  "padding_value": 0.0,
  "processor_class": "WhisperProcessor",
  "return_attention_mask": false,
  "sampling_rate": 16000
}



### Load WhisperTokenizer

* The Whisper tokenizer is pre-trained on the transcriptions for the **96 pre-training languages**
* The Whisper model outputs a sequence of _token ids_. The tokenizer maps each of these token ids to their corresponding text string.
* For Malayalam, we can load the pre-trained tokenizer and use it for fine-tuning without any further modifications.
* We simply have to
specify the target language and the task. These arguments inform the
tokenizer to prefix the language and task tokens to the start of encoded
label sequences:

In [7]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Malayalam", task="transcribe")

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Verifying that the tokenizer correctly encodes Malayalam characters by encoding and decoding the first sample of the Common Voice dataset.

In [11]:
input_str = common_voice["train"][0]["sentence"]
labels = tokenizer(input_str).input_ids            # tokenizing the input string to create label ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)   # decoding the tokenized ids to create string with special tokens
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)             # decoding the tokenized ids to create string with special tokens

In [13]:
print(f"Input:                 {input_str}")
print(f"Decoded with special tokens:    {decoded_with_special}")
print(f"Decoded w/out special tokens: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

Input:                 എന്തുകൊണ്ട് യുവാക്കൾ കൂടുതൽ രാഷ്ട്രീയമായി ചിന്തിക്കണം, എന്തുകൊണ്ട് അവർ സംഘടിതരാകണം എന്നതിന്റെ ഉദാത്തമായ ഉദാഹരണമാകുന്നു കേരളം.
Decoded with special tokens:    <|startoftranscript|><|ml|><|transcribe|><|notimestamps|>എന്തുകൊണ്ട് യുവാക്കൾ കൂടുതൽ രാഷ്ട്രീയമായി ചിന്തിക്കണം, എന്തുകൊണ്ട് അവർ സംഘടിതരാകണം എന്നതിന്റെ ഉദാത്തമായ ഉദാഹരണമാകുന്നു കേരളം.<|endoftext|>
Decoded w/out special tokens: എന്തുകൊണ്ട് യുവാക്കൾ കൂടുതൽ രാഷ്ട്രീയമായി ചിന്തിക്കണം, എന്തുകൊണ്ട് അവർ സംഘടിതരാകണം എന്നതിന്റെ ഉദാത്തമായ ഉദാഹരണമാകുന്നു കേരളം.
Are equal:             True


### Combine To Create A WhisperProcessor

To simplify using the feature extractor and tokenizer, we can _wrap_
both into a single `WhisperProcessor` class. This processor object
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`,
and can be used on the audio inputs and model predictions as required.
In doing so, we only need to keep track of two objects during training:
the `processor` and the `model`:

In [9]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Malayalam", task="transcribe")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Prepare Data

In [11]:
print(common_voice["train"][0])

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/7a5483e18eadf711d9e766a084a54317359e0c06c24e82760bfe23139091a2f4/ml_train_0/common_voice_ml_28913601.mp3', 'array': array([-2.66453526e-14, -1.17239551e-13, -7.10542736e-14, ...,
       -1.06581410e-14, -1.77635684e-14,  7.10542736e-14]), 'sampling_rate': 48000}, 'sentence': 'എന്തുകൊണ്ട് യുവാക്കൾ കൂടുതൽ രാഷ്ട്രീയമായി ചിന്തിക്കണം, എന്തുകൊണ്ട് അവർ സംഘടിതരാകണം എന്നതിന്റെ ഉദാത്തമായ ഉദാഹരണമാകുന്നു കേരളം.'}


Since
our input audio is sampled at 48kHz, we need to _downsample_ it to
16kHz prior to passing it to the Whisper feature extractor, 16kHz being the sampling rate expected by the Whisper model.

We'll set the audio inputs to the correct sampling rate using dataset's
[`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cast_column#datasets.DatasetDict.cast_column)
method. This operation does not change the audio in-place,
but rather signals to `datasets` to resample audio samples _on the fly_ the
first time that they are loaded:

In [12]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [13]:
print(common_voice["train"][0])

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/7a5483e18eadf711d9e766a084a54317359e0c06c24e82760bfe23139091a2f4/ml_train_0/common_voice_ml_28913601.mp3', 'array': array([ 0.00000000e+00, -1.30967237e-10,  0.00000000e+00, ...,
        9.82254278e-11,  4.00177669e-11,  9.09494702e-11]), 'sampling_rate': 16000}, 'sentence': 'എന്തുകൊണ്ട് യുവാക്കൾ കൂടുതൽ രാഷ്ട്രീയമായി ചിന്തിക്കണം, എന്തുകൊണ്ട് അവർ സംഘടിതരാകണം എന്നതിന്റെ ഉദാത്തമായ ഉദാഹരണമാകുന്നു കേരളം.'}


Now we can write a function to prepare our data ready for the model:
1. We load and resample the audio data by calling `batch["audio"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.
2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.
3. We encode the transcriptions to label ids through the use of the tokenizer.

In [14]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [15]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])

Map:   0%|          | 0/430 [00:00<?, ? examples/s]

Map:   0%|          | 0/112 [00:00<?, ? examples/s]

## Training and Evaluation

### Define a Data Collator
*  The data collator takes our pre-processed data and prepares PyTorch tensors ready for the model.

In [20]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

Let's initialise the data collator we've just defined:

In [21]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [22]:
print(data_collator)

DataCollatorSpeechSeq2SeqWithPadding(processor=WhisperProcessor:
- feature_extractor: WhisperFeatureExtractor {
  "chunk_length": 30,
  "feature_extractor_type": "WhisperFeatureExtractor",
  "feature_size": 80,
  "hop_length": 160,
  "n_fft": 400,
  "n_samples": 480000,
  "nb_max_frames": 3000,
  "padding_side": "right",
  "padding_value": 0.0,
  "processor_class": "WhisperProcessor",
  "return_attention_mask": false,
  "sampling_rate": 16000
}

- tokenizer: WhisperTokenizer(name_or_path='openai/whisper-small', vocab_size=50258, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<|startoftranscript|>', '<|en|>', '<|zh|>', '<|de|>', '<|es|>', '<|ru|>', '<|ko|>', '<|fr|>', '<|ja|>', '<|pt|>', '<|tr|>', '<|pl|>', '<|ca|>', '<|nl|>', '<|ar|>', '<|sv|>', '<|it|>', '<|id|>', '<

### Evaluation Metrics

We'll use the word error rate (WER) metric.

In [None]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Load a Pre-Trained Checkpoint

Now let's load the pre-trained Whisper `small` checkpoint. Again, this
is trivial through use of 🤗 Transformers!

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.84k [00:00<?, ?B/s]

In [None]:
# model.config.forced_decoder_ids = None
# model.config.suppress_tokens = []

### Define the Training Configuration

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-ml",
    per_device_train_batch_size=16,         # Batch size per GPU/CPU for training
    gradient_accumulation_steps=1,         # Number of gradient accumulation steps
    learning_rate=1e-5,
    warmup_steps=5,                  # Number of warmup steps for learning rate scheduler
    max_steps=30,                    # Total number of training steps
    gradient_checkpointing=True,     # Reduce memory during backpropagation
    fp16=True,                       # Enable mixed precision training (16-bit precision)
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,           # Maximum length for generated sequences
    save_steps=10,                # Save model checkpoints at specified steps
    eval_steps=10,                # Evaluate at specified steps during training
    logging_steps=2,
    report_to=["tensorboard"],             # Report training metrics to TensorBoard
    load_best_model_at_end=True,       # Load the best model based on evaluation metric at the end of training
    metric_for_best_model="wer",        # Metric to select the best model (Word Error Rate in this case)
    greater_is_better=False,
)


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:

In [None]:
# processor.save_pretrained(training_args.output_dir)

### Training

In [None]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
10,1.8017,1.795273,163.678161


KeyboardInterrupt: 

## Evaluation

In [None]:
# Directory containing audio files
audio_dir = "/Testing/"    # replace with your test directory

# Directory to store text files
output_dir = "check"

# Get list of audio files in the directory
audio_files = os.listdir(audio_dir)

# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Load the pretrained model and processor
processor = WhisperProcessor.from_pretrained("/pretrained-whisper-medium-native-v3")
model = WhisperForConditionalGeneration.from_pretrained("/pretrained-whisper-medium-native-v3").to("cuda")

# Set the target sample rate
target_sample_rate = 16000  # Replace with your desired sample rate

# Process each audio file and save transcriptions in text files
for audio_file in audio_files:
    # Load audio file and resample to the target sample rate
    audio_input, _ = librosa.load(os.path.join(audio_dir, audio_file), sr=target_sample_rate, mono=True)

    # Preprocess the audio using the processor
    inputs = processor(audio_input, return_tensors="pt", sampling_rate=target_sample_rate).input_features
    # Perform inference
    with torch.no_grad():
        predicted_ids = model.generate(inputs.to("cuda"))[0]

        transcription = processor.decode(predicted_ids)
        pred = processor.tokenizer._normalize(transcription)


    # Create a text file with the same name as the audio file in the output directoryz
    text_file_name = os.path.splitext(audio_file)[0] + ".txt"
    with open(os.path.join(output_dir, text_file_name), "w", encoding="utf-8") as text_file:
        text_file.write(transcription)