<a href="https://colab.research.google.com/github/julius-kanani-ops/kisii-asr/blob/main/notebooks/01_Kisii_ASR_Training_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Install Required Libraries**

**What these libraries do:**

1.   transformers: The core Hugging Face library. It lets us download, use, and fine-tune pre-trained models like Whisper.

2.   datasets: A library for easily loading and processing datasets, especially for audio.

3. soundfile & librosa: Powerful tools for reading and manipulating audio files.

4. accelerate: Helps transformers run our training code efficiently on the GPU.

5. evaluate & jiwer: These are for testing our model's performance later by calculating something called the "Word Error Rate".

In [None]:
# Cell 1: Environment Setup

# Step 1: Upgrade the system's audio processing engine (FFmpeg)
!apt-get -qq install --yes ffmpeg

# Step 2: Install the necessary python libraries
!pip install -q transformers datasets soundfile librosa accelerate evaluate jiwer torchaudio torchcodec

**Step 2: Clone the kisii-asr GitHub Repository**

In [None]:
# Clone our project repository to access our dataset
!git clone https://github.com/julius-kanani-ops/kisii-asr.git

fatal: destination path 'kisii-asr' already exists and is not an empty directory.


**Step 3: Loading and Preparing the Dataset**

**Step 3.1: Import Libraries and Define Paths**

In [None]:
# Import the main library for loading datasets
from datasets import load_dataset, DatasetDict

# Define the path to our data within the Colab environment
# The repository is now a folder in our workspace
data_folder = "kisii-asr/data/"

**Step 3.2: Load the Dataset**

In [None]:
# Cell for Step 4 (REVISED)

from datasets import Dataset
import pandas as pd
import os

# Define the path to our data within the Colab environment
data_folder = "kisii-asr/data/"

# Load from the CSV
metadata_df = pd.read_csv(f"{data_folder}/metadata.csv", sep='|', header=None, names=['file_path', 'text'])

# Create the full, absolute path for each audio file
metadata_df['file_path'] = metadata_df['file_path'].apply(lambda x: os.path.join(data_folder, 'audio', x))

# Create the dataset from our pandas DataFrame
dataset = Dataset.from_pandas(metadata_df)

# Create the train/test split
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)

print("--- Dataset with file paths loaded and split ---")
print(split_dataset)

--- Dataset with file paths loaded and split ---
DatasetDict({
    train: Dataset({
        features: ['file_path', 'text'],
        num_rows: 4
    })
    test: Dataset({
        features: ['file_path', 'text'],
        num_rows: 1
    })
})

--- Example from Training Set (before processing) ---
{'file_path': 'kisii-asr/data/audio/sentence3.wav', 'text': 'Erio agwo rikaba mogoroba, naende bokaba maambia, rituko rie ritang’ani. Erio Nyasae agachiika, “Tiga oboiko bobe egati‐gati y’amaache, erio bwatanane amaache korwa ase amaache ande.”'}


**Step 3.3: Inspect Your Loaded Data**

In [None]:
# Let's look at the first example from the training set
print("\n--- Example from Training Set ---")
print(split_dataset["train"][0])


--- Example from Training Set ---
{'file_path': 'kisii-asr/data/audio/sentence3.wav', 'text': 'Erio agwo rikaba mogoroba, naende bokaba maambia, rituko rie ritang’ani. Erio Nyasae agachiika, “Tiga oboiko bobe egati‐gati y’amaache, erio bwatanane amaache korwa ase amaache ande.”'}


**Step 4: Preparing the Data for the Model**

**Step 4.1: Load the Whisper Processor**

In [None]:
from transformers import WhisperProcessor
import librosa

# --- 5.1: Load Processor ---
model_name = "openai/whisper-small"
processor = WhisperProcessor.from_pretrained(model_name, language="Swahili", task="transcribe")

**Step 4.2: Create the Data Preparation Function**

In [None]:
def prepare_dataset(batch):
    # Manually load the audio file using librosa
    audio_array, sampling_rate = librosa.load(batch["file_path"], sr=16000, mono=True)

    # Process the audio array to get the input_features
    batch["input_features"] = processor.feature_extractor(audio_array, sampling_rate=sampling_rate).input_features[0]

    # Process the text to get the labels
    batch["labels"] = processor.tokenizer(batch["text"]).input_ids

    return batch

**Step 4.3: Apply the Function to Our Entire Dataset**

In [None]:
# We remove the original columns to keep the dataset clean
processed_dataset = split_dataset.map(prepare_dataset, remove_columns=split_dataset["train"].column_names, num_proc=1)

print("\n--- Dataset after processing ---")
print(processed_dataset)

print("\n--- Example of processed data ---")
print(processed_dataset["train"][0])

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]


--- Dataset after processing ---
DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 4
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 1
    })
})

--- Example of processed data ---
{'input_features': [[-0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266, -0.45548343658447266

**Step 5: Setting Up the Training Pipeline**

In [None]:
# Import the necessary components for training
from transformers import WhisperForConditionalGeneration
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
import torch
import evaluate
from dataclasses import dataclass
from typing import Any, Dict, List, Union

# --- NEW: Custom Data Collator ---
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels since they have to be of different lengths and need different padding methods.

        # First, pad the audio inputs (input_features)
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Second, pad the text labels
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Replace the tokenizer's -100 padding value with the model's -100 padding value for labels
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # If the batch has a bos token, we need to shift the labels to the right
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

# --- 6.1: Initialize our NEW custom data collator ---
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)


# --- 6.2: Define the Evaluation Metric (Same as before) ---
metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

# --- 6.3: Load the Pre-trained Model (Same as before) ---
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)

# --- 6.4: Define the Training Arguments (Same as before) ---
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-kisii",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=5,
    num_train_epochs=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    predict_with_generate=True,
    logging_steps=5,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False
)

# --- 6.5: Initialize the Trainer (Same as before, with no tokenizer argument) ---
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("--- Trainer successfully initialized with CUSTOM data collator! ---")
print("Ready to start training.")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


--- Trainer successfully initialized with CUSTOM data collator! ---
Ready to start training.


**Step 6: Train the Model!**

In [None]:
# We don't need to use the Weights & Biases logger for this tutorial.
# We will disable it.
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Let's start the training!
trainer.train()

You're using a WhisperTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Wer
1,No log,3.366241,246.666667
2,No log,2.932081,110.0
3,3.133500,2.510672,116.666667
4,3.133500,2.206677,116.666667
5,1.340300,2.032724,120.0
6,1.340300,1.95825,100.0
7,1.340300,1.907223,100.0
8,0.646300,1.876098,100.0
9,0.646300,1.855771,100.0
10,0.431400,1.846762,100.0


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=20, training_loss=1.3878890514373778, metrics={'train_runtime': 1167.5452, 'train_samples_per_second': 0.034, 'train_steps_per_second': 0.017, 'total_flos': 1.15434160128e+16, 'train_loss': 1.3878890514373778, 'epoch': 10.0})

In [None]:
# --- Save the Final Model and Processor ---

# The trainer already saved the best model weights in the last checkpoint.
# We just need to save the processor configuration files to that same directory.
model_path = "./whisper-small-kisii/checkpoint-20"
processor.save_pretrained(model_path)

print(f"Processor files saved to {model_path}")

Processor files saved to ./whisper-small-kisii/checkpoint-20


**Testing the Custom Model**

In [None]:
# Import the pipeline tool from transformers
from transformers import pipeline
import librosa

# --- Step 1: Load the Fine-Tuned Model ---
# The trainer saved the best model in a sub-folder. Let's find it.
# The folder is named 'checkpoint-X' where X is the epoch number with the best WER.
# Since our WER was unstable, let's just pick the last checkpoint.
model_path = "./whisper-small-kisii/checkpoint-20" # 'checkpoint-20' is the last one from 10 epochs with 2 steps each.

# Create a transcription pipeline, pointing it to our fine-tuned model
transcriber = pipeline("automatic-speech-recognition", model=model_path, device=device)

print("--- Fine-tuned model loaded successfully! ---")

# --- Step 2: Prepare a Test Audio File ---
# Let's use the single file from our test set for this demonstration.
test_sample = split_dataset["test"][0]
audio_path = test_sample["file_path"]
reference_transcription = test_sample["text"]

# Load the audio file using librosa to ensure it's in the correct format
# The pipeline handles resampling, but doing it manually is a good practice for consistency
speech_array, sampling_rate = librosa.load(audio_path, sr=16000, mono=True)

# --- Step 3: Transcribe! ---
print("\nTranscribing the audio file...")
prediction = transcriber(speech_array)

print("\n--- RESULTS ---")
print(f"Reference: {reference_transcription}")
print(f"Prediction: {prediction['text']}")

Device set to use cuda:0
`return_token_timestamps` is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it.


--- Fine-tuned model loaded successfully! ---

Transcribing the audio file...


`generation_config` default values have been modified to match model-specific defaults: {'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}. If this is not desired, please set these values explicitly.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProce


--- RESULTS ---
Reference: Naende Nyasae agachiika, “Tiga amaache ayare inse y’erioba asangererekane aase aamo, egere ense enyomo erorekane;” ayio akaba boigo. Akaroka aase aria aomo ense, na amaache aria asangererekanete amo akayaroka chinyancha.
Prediction:  කරනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙනෙ
