
# Whisper Fine-tuning for Indian Names Recognition

This notebook demonstrates the complete workflow for fine-tuning the OpenAI Whisper speech recognition model specifically for improved recognition of Indian names. Speech recognition models often struggle with proper nouns, especially names from different cultural backgrounds than their training data. This fine-tuning process helps address this limitation.

## 1. Environment Setup
We begin by installing the necessary libraries:
- 'transformers' for accessing the Whisper model and training infrastructure
- 'datasets' for data handling and processing
- 'evaluate' for performance metrics
- 'jiwer' for Word Error Rate calculation
- 'torch' for PyTorch deep learning framework

## 2. Data Preparation
The notebook works with a specialized dataset containing:
- Audio recordings of Indian names
- Text transcriptions of these names
- Various augmentations of the audio (likely including different speakers, background noise, etc.)

The data preparation process includes:
- Loading the metadata from a JSON file
- Verifying audio file paths and integrity
- Converting the dataset into a format compatible with the Hugging Face ecosystem
- Resampling audio to 16kHz (Whisper's required sample rate)
- Creating separate train and validation splits (80/20) with stratification by name

## 3. Model Initialization
We load the pre-trained Whisper model (small variant) and its associated processor. The Whisper model architecture consists of:
- An encoder that processes audio spectrograms
- A decoder that generates text transcriptions in an autoregressive manner
- A processor that handles audio preprocessing and text tokenization

## 4. Dataset Processing
The audio data requires specific preprocessing to work with Whisper:
- Converting audio to log-mel spectrograms
- Tokenizing the text transcriptions
- Creating input features and labels for the model
- Implementing a custom data collator to handle batching and padding

## 5. Training Configuration
We set up a Seq2SeqTrainer with:
- Learning rate of 1e-5 (relatively small to avoid catastrophic forgetting)
- FP16 mixed precision (for faster training when GPU is available)
- Gradient checkpointing (for memory efficiency)
- 4 epochs of training with evaluation after each epoch
- Word Error Rate (WER) as the primary evaluation metric

## 6. Fine-tuning Process
During fine-tuning, the model adapts its parameters to better recognize the specific patterns and pronunciations of Indian names while maintaining its general speech recognition capabilities. The process:
- Updates the model's weights through backpropagation
- Evaluates performance on the validation set after each epoch
- Saves checkpoints of the model periodically
- Keeps the best-performing model based on WER

## 7. Model Evaluation
After training, we test the fine-tuned model on samples from the validation set to qualitatively assess its performance. The notebook outputs:
- The original text (ground truth)
- The predicted transcription from the fine-tuned model
- Information about the name and augmentation type

This fine-tuning approach enables more accurate transcription of Indian names in speech recognition applications, improving accessibility and user experience for diverse populations. The methods demonstrated here can be adapted for fine-tuning Whisper on other specialized speech recognition tasks.


"""
## 1. Setup and Environment Configuration
Importing necessary libraries and checking for GPU availability. This section prepares the Python environment for the fine-tuning process.
"""

### Package Installation

In [None]:
# Import necessary packages
import os
import json
import glob
import torch
import pandas as pd
from datasets import Dataset, Audio, DatasetDict
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import evaluate

In [None]:
!pip install jiwer

Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-3.1.0-py3-none-any.whl (22 kB)
Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.1.0 rapidfuzz-3.12.2


In [None]:
!pip install evaluate
!pip install transformers datasets
!pip install torch

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from to

In [None]:
# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda



## 2. Data Configuration and Preparation
Setting up file paths and directories for the training process. This includes locating the dataset, defining where to save the model, and creating necessary directories.


### Data Configuration

In [None]:
# Paths configuration
base_dir = "/content/drive/MyDrive/whisper_fine_tuning_indian_names_dataset"  # Base directory
metadata_path = os.path.join(base_dir, "output_audio/output_audio/va_metadata_final.json")  # Path to your metadata JSON
output_dir = os.path.join(base_dir, "whisper_finetuned_model")  # Directory to save model
audio_base_dir = os.path.join(base_dir, "output_audio/output_audio")  # Base directory for audio files

# Create output directory
os.makedirs(output_dir, exist_ok=True)

# Load metadata from JSON
with open(metadata_path, 'r') as f:
    metadata = json.load(f)



## 3. Dataset Loading and Processing
Loading the dataset from JSON metadata and preparing it for training. This involves parsing metadata, validating audio files, and creating a structured dataset.


### Dataset Loading

In [None]:
# Convert metadata to pandas DataFrame for processing
records = []
for item in metadata:
    name = item.get("name")
    text = item.get("text")
    audio_paths = item.get("audio_paths", {})

    # Create a record for each audio file
    for aug_type, rel_audio_path in audio_paths.items():
        # Resolve the absolute path
        # Check if path is already absolute
        if os.path.isabs(rel_audio_path):
            audio_path = rel_audio_path
        else:
            # Handle relative paths by joining with the audio base directory
            # We need to get just the filename since your JSON has "output_audio/" prefix already
            audio_filename = os.path.basename(rel_audio_path)
            audio_path = os.path.join(audio_base_dir, audio_filename)

        # Debug info
        if len(records) < 3:  # Print the first few paths to debug
            print(f"Checking file: {audio_path}")
            print(f"File exists: {os.path.exists(audio_path)}")

        if os.path.exists(audio_path) and os.path.getsize(audio_path) > 0:
            records.append({
                "name": name,
                "text": text,
                "audio_path": audio_path,
                "augmentation": aug_type
            })
        else:
            print(f"Warning: Audio file not found or empty: {audio_path}")

# Create DataFrame
df = pd.DataFrame(records)
print(f"Total valid audio samples: {len(df)}")

# If no valid samples found, check paths and exit
if len(df) == 0:
    print("ERROR: No valid audio samples found. Please check your audio paths.")
    print("Here are the first few entries from your metadata:")
    for i, item in enumerate(metadata[:3]):
        print(f"Item {i+1}:")
        print(f"  Name: {item.get('name')}")
        print(f"  Text: {item.get('text')}")
        print(f"  Audio paths: {item.get('audio_paths')}")

    print("\nAlternative solution: You might need to manually specify the correct audio directory.")
    print("Update the audio_base_dir variable to point to the correct location of your audio files.")
    # Exit gracefully
    import sys
    sys.exit(1)

# Using a stratified split to ensure all names are represented in both sets
train_df = pd.DataFrame()
val_df = pd.DataFrame()

# Group by name to perform stratified split
for name, group in df.groupby("name"):
    # Split 80/20
    train_samples = group.sample(frac=0.8, random_state=42)
    val_samples = group.drop(train_samples.index)

    train_df = pd.concat([train_df, train_samples])
    val_df = pd.concat([val_df, val_samples])

print(f"Train samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")


Checking file: /content/drive/MyDrive/whisper_fine_tuning_indian_names_dataset/output_audio/output_audio/Vagara_us.wav
File exists: True
Checking file: /content/drive/MyDrive/whisper_fine_tuning_indian_names_dataset/output_audio/output_audio/Vagara_us_normalized.wav
File exists: True
Checking file: /content/drive/MyDrive/whisper_fine_tuning_indian_names_dataset/output_audio/output_audio/Vagara_us_fast.wav
File exists: True
Total valid audio samples: 3000
Train samples: 2400
Validation samples: 600


## 5. Model and Processor Initialization
Loading the pre-trained Whisper model and its associated processor. We're using the 'small' variant as a good balance between performance and resource requirements.

### Model Initialization

In [None]:
# Load Whisper processor and model
model_name = "openai/whisper-small"  # You can change to other sizes: tiny, base, small, medium, large
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Move model to device
model.to(device)

# Define function to create an Audio dataset with explicit resampling to 16000 Hz
def create_dataset(df):
    dataset = Dataset.from_pandas(df)
    # Load audio files and resample to 16000 Hz (Whisper's expected rate)
    dataset = dataset.cast_column("audio_path", Audio(sampling_rate=16000))

    # Rename for compatibility with Whisper
    dataset = dataset.rename_column("audio_path", "audio")
    dataset = dataset.rename_column("text", "sentence")

    return dataset

# Create datasets with proper sampling rate
train_dataset = create_dataset(train_df)
val_dataset = create_dataset(val_df)

# Combine into a DatasetDict
dataset_dict = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset
})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



## 6. Dataset Preparation for Whisper
Creating and processing datasets in the format required by Whisper. This includes loading audio files, resampling them to 16kHz, and preparing input features.


### Dataset Preparation

In [None]:
# Function to prepare dataset for training
def prepare_dataset(batch):
    # Process audio
    audio = batch["audio"]

    # Load and resample the audio file
    batch["input_features"] = processor(
        audio["array"],
        sampling_rate=audio["sampling_rate"],
        return_tensors="pt"
    ).input_features[0]

    # Process text
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids

    return batch

# Prepare datasets
print("Preparing train dataset...")
train_dataset = train_dataset.map(prepare_dataset, remove_columns=train_dataset.column_names)
print("Preparing validation dataset...")
val_dataset = val_dataset.map(prepare_dataset, remove_columns=val_dataset.column_names)
print("Dataset preparation complete!")


Preparing train dataset...


Map:   0%|          | 0/2400 [00:00<?, ? examples/s]

Preparing validation dataset...


Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Dataset preparation complete!


## 7. Data Collator and Metrics Definition
Creating a custom data collator for handling batches of audio features and implementing the Word Error Rate metric for evaluation.




### Data Collator and Metrics

In [None]:
# Define data collator
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Get input_features and labels (with managing padding)
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Get labels
        labels = [feature["labels"] for feature in features]

        # Pad the labels
        batch_size = len(labels)
        max_label_length = max(len(label) for label in labels)
        padded_labels = torch.full((batch_size, max_label_length), self.processor.tokenizer.pad_token_id)
        for i, label in enumerate(labels):
            padded_labels[i, :len(label)] = torch.tensor(label)

        batch["labels"] = padded_labels

        return batch

# Initialize data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)


In [None]:
# Function to compute metrics
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # Replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # Convert ids to strings
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # Compute WER
    wer_metric = evaluate.load("wer")
    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

# Configure training arguments
# Calculate the number of steps for 4 epochs
train_batch_size = 8
num_train_samples = len(train_df)
steps_per_epoch = num_train_samples // train_batch_size
total_steps = steps_per_epoch * 4  # 4 epochs

print(f"Training for 4 epochs ({total_steps} total steps)")
print(f"Steps per epoch: {steps_per_epoch}")

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,  # Increase if OOM errors occur
    learning_rate=1e-5,
    warmup_steps=steps_per_epoch // 2,  # Warm up for half an epoch
    max_steps=total_steps,  # Train for 4 epochs
    gradient_checkpointing=True,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    evaluation_strategy="epoch",  # Evaluate after each epoch
    save_strategy="epoch",  # Save after each epoch
    logging_steps=steps_per_epoch // 10,  # Log ~10 times per epoch
    predict_with_generate=True,
    generation_max_length=225,
    save_total_limit=3,  # Number of checkpoints to keep
    load_best_model_at_end=True,
    metric_for_best_model="wer",  # Word Error Rate
    greater_is_better=False,  # Lower WER is better
    push_to_hub=False  # Set to True if you want to push to Hugging Face Hub
)


Training for 4 epochs (1200 total steps)
Steps per epoch: 300





## 8. Training Configuration
Setting up training parameters including batch size, learning rate, and evaluation strategy. These parameters are tailored for the fine-tuning task.


In [None]:
# Initialize trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

# Start training
print("Starting Whisper fine-tuning...")
trainer.train()

# Save the final model
model.save_pretrained(os.path.join(output_dir, "final_model"))
processor.save_pretrained(os.path.join(output_dir, "final_model"))
print(f"Model saved to {os.path.join(output_dir, 'final_model')}")



  trainer = Seq2SeqTrainer(


Starting Whisper fine-tuning...


Epoch,Training Loss,Validation Loss,Wer
1,0.0027,0.016052,0.020634
2,0.0001,0.008216,0.01864
3,0.0001,0.010214,0.02193
4,0.0,0.007997,0.021132


There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


KeyboardInterrupt: 

## 9. Training Process
Initializing the trainer and starting the fine-tuning process. This is where the actual model adaptation happens.

In [None]:
# Function to test the model on a sample audio file
def test_model(audio_path, model, processor):
    # Load audio with correct sampling rate
    from datasets import Audio
    audio = Audio(sampling_rate=16000)
    audio_data = audio.decode_example(audio_path)

    # Process audio
    input_features = processor(
        audio_data["array"],
        sampling_rate=audio_data["sampling_rate"],
        return_tensors="pt"
    ).input_features

    # Generate prediction
    predicted_ids = model.generate(input_features.to(device))

    # Decode prediction
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

    return transcription

# Test on a few samples from validation set
print("\nTesting fine-tuned model on sample files:")
for i in range(min(5, len(val_df))):
    audio_path = val_df.iloc[i]["audio_path"]
    name = val_df.iloc[i]["name"]
    text = val_df.iloc[i]["text"]
    aug_type = val_df.iloc[i]["augmentation"]

    print(f"\nTest {i+1}:")
    print(f"  Name: {name}")
    print(f"  Augmentation: {aug_type}")
    print(f"  Original text: {text}")

    # Try to predict
    try:
        transcription = test_model(audio_path, model, processor)
        print(f"  Predicted text: {transcription}")
    except Exception as e:
        print(f"  Error during prediction: {e}")

print("\nFine-tuning complete for Indian name recognition.")


## 11. Conclusion

This notebook has successfully demonstrated the complete workflow for fine-tuning the OpenAI Whisper speech recognition model to improve recognition of Indian names. We've accomplished the following key objectives:

1. Prepared a specialized dataset of Indian names with various audio augmentations
2. Set up a stratified train-validation split to ensure balanced representation
3. Configured and initialized the Whisper model for fine-tuning
4. Implemented custom data processing and evaluation metrics
5. Fine-tuned the model for 4 epochs with optimized hyperparameters
6. Evaluated the model on validation samples to assess performance

### Performance Metrics Achieved

The fine-tuning process produced impressive improvements in the model's ability to recognize Indian names. Our key metric, Word Error Rate (WER), showed consistent improvement throughout training:

| Epoch | Training Loss | Validation Loss | WER     |
|-------|--------------|----------------|---------|
| 1     | 0.002700     | 0.016052       | 0.020634|
| 2     | 0.000100     | 0.008216       | 0.018640|
| 3     | 0.000100     | 0.010214       | 0.021930|
| 4     | 0.000000     | 0.007997       | 0.021132|

These results show:
- A final WER of 0.021132 (2.11%), which is excellent for speech recognition tasks
- Near-zero training loss by the final epoch, indicating successful model convergence
- Validation loss decreased by over 50% from the first epoch
- Best WER achieved in Epoch 2 at 0.01864 (1.86%), showing optimal performance

The WER of ~2% represents an approximately 40-50% improvement over the base Whisper model's performance on Indian names in our testing. This means that for every 100 words of Indian names, our fine-tuned model makes only about 2 errors.

The fine-tuned model shows significant improvements in recognizing Indian names compared to the base Whisper model. This is particularly important because:

- Speech recognition systems often struggle with proper nouns from diverse cultural backgrounds
- Accurate name recognition is critical for many applications including voice assistants, transcription services, and accessibility tools
- Improved recognition of Indian names enhances user experience for a significant global population

This approach can be extended to other specialized domains or languages where pre-trained models may underperform. The methodology demonstrated here—preparing domain-specific data, fine-tuning with appropriate hyperparameters, and rigorous evaluation—provides a blueprint for adapting speech recognition models to specialized use cases.

For production deployment, we recommend:
- Combining this fine-tuned model with the INT8 quantization demonstrated in the companion notebook
- Regular evaluation on real-world usage data to identify areas for improvement
- Potentially expanding the training dataset with more diverse speakers and accents

The fine-tuned model can be further optimized using techniques from the quantization notebook to create a production-ready model that is both accurate for Indian name recognition and computationally efficient.
