## Evaluation of Whisper-Medium ASR Model

This notebook evaluates the **Whisper-medium** automatic speech recognition (ASR) model on the **GSL English Podcast Dataset** from **Hugging Face**. The objective is to analyze how accurately the model transcribes short English speech clips.

The audio samples are preprocessed and passed through the pretrained **Whisper-medium** model to generate transcriptions. Model performance is measured using standard ASR evaluation metrics:

- **Word Error Rate (WER)**
- **Character Error Rate (CER)**



In [1]:
!pip install transformers torchaudio librosa soundfile --quiet

In [2]:
!pip install torchcodec --quiet

In [3]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium").to(device)


Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
!pip install jiwer --quiet

In [7]:
from datasets import load_dataset, Audio
import librosa
import numpy as np
from jiwer import wer, cer

# Load dataset (decoded audio)
dataset = load_dataset("vietnhat/gsl-english-podcast-dataset")
dataset = dataset.cast_column("audio", Audio(decode=True))

samples = dataset["train"].select(range(10))  # first 10

target_sr = 16000

all_refs = []
all_preds = []

for i, sample in enumerate(samples):
    print(f"\n========== AUDIO CLIP {i+1} (Whisper) ==========")

    # Load and preprocess audio
    audio_array = sample["audio"]["array"]
    sr = sample["audio"]["sampling_rate"]

    # Mono
    if audio_array.ndim > 1:
        audio_array = np.mean(audio_array, axis=1)

    # Resample to 16kHz
    if sr != target_sr:
        audio_array = librosa.resample(audio_array, orig_sr=sr, target_sr=target_sr)

    # Normalize
    audio_array = audio_array / np.max(np.abs(audio_array))

    # Prepare inputs
    inputs = whisper_processor(
        audio_array,
        sampling_rate=target_sr,
        return_tensors="pt"
    ).input_features.to(device)  # audio features

    # Generate tokens
    with torch.no_grad():
        predicted_ids = whisper_model.generate(inputs)

    # Decode to text
    transcription = whisper_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

    # Store
    all_refs.append(sample["text"].lower().strip())
    all_preds.append(transcription.lower().strip())

    # Print
    print("Ground Truth:")
    print(sample["text"])
    print("Whisper Prediction:")
    print(transcription)



Ground Truth:
 Hello there everyone and welcome back to GSL English. My name is Gideon and in today's lesson
Whisper Prediction:
 Hello there everyone and welcome back to GSL English. My name is Gideon and in today's lesson

Ground Truth:
 we are going to study English together through a short story. So if you are new here let me just
Whisper Prediction:
 We are going to study English together through a short story. So if you are new here, let me just

Ground Truth:
 very briefly explain how this lesson is going to work. So we are firstly going to read the story
Whisper Prediction:
 very briefly explain how this lesson is going to work. So we are firstly going to read the story

Ground Truth:
 in its entirety okay and then we're just going to talk about it a little bit to make sure we
Whisper Prediction:
 in its entirety. Okay. And then we're just going to talk about it a little bit to make sure we

Ground Truth:
 fully understood what was going on and then we are going to break it do

In [6]:
wer_score = wer(all_refs, all_preds)
cer_score = cer(all_refs, all_preds)

print("\n========== WHISPER EVALUATION ==========")
print(f"Total samples evaluated: {len(all_refs)}")
print(f"Word Error Rate (WER): {wer_score:.3f}")
print(f"Character Error Rate (CER): {cer_score:.3f}")



Total samples evaluated: 10
Word Error Rate (WER): 0.038
Character Error Rate (CER): 0.015


Evaluation Summary â€“ Whisper Medium

The Whisper-medium ASR model was evaluated on ten English podcast audio clips. The model achieved a Word Error Rate (WER) of 3.8% and a Character Error Rate (CER) of 1.5%, indicating high transcription accuracy.

Compared to Wav2Vec2-base, Whisper demonstrated significantly better performance, particularly in handling conversational speech and contextual word prediction.