## Evaluation of Hubert Large ASR model

This notebook evaluates the **Hubert Large ASR model** automatic speech recognition (ASR) model on the **GSL English Podcast Dataset** from **Hugging Face**. The objective is to analyze how accurately the model transcribes short English speech clips.

The audio samples are preprocessed and passed through the pretrained **Hubert Large ASR ** model to generate transcriptions. Model performance is measured using standard ASR evaluation metrics:

- **Word Error Rate (WER)**
- **Character Error Rate (CER)**

In [1]:
!pip install transformers torchaudio librosa soundfile --quiet

In [2]:
!pip install torchcodec --quiet

In [3]:
!pip install jiwer --quiet

In [4]:
import torch
import librosa
import numpy as np
from datasets import load_dataset
from transformers import HubertForCTC, Wav2Vec2Processor
from jiwer import wer, cer

In [5]:
dataset = load_dataset("vietnhat/gsl-english-podcast-dataset")
samples = dataset["train"].select(range(10))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft").to("cuda")

In [7]:
target_sr = 16000

predictions = []
references = []

for i, sample in enumerate(samples):
    print(f"\n========== AUDIO CLIP {i+1} ==========")

    audio = sample["audio"]["array"]
    sr = sample["audio"]["sampling_rate"]

    if audio.ndim > 1:
        audio = np.mean(audio, axis=1)

    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)

    audio = audio / np.max(np.abs(audio))

    inputs = processor(audio, sampling_rate=target_sr, return_tensors="pt", padding=True)
    input_values = inputs.input_values.to("cuda")

    with torch.no_grad():
        logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])

    print("Ground Truth:")
    print(sample["text"])
    print("Prediction:")
    print(transcription.lower())

    predictions.append(transcription.lower())
    references.append(sample["text"].lower())


Ground Truth:
 Hello there everyone and welcome back to GSL English. My name is Gideon and in today's lesson
Prediction:
hello there every one and welcome back to g s el english my name is gideon and in to day's lesson

Ground Truth:
 we are going to study English together through a short story. So if you are new here let me just
Prediction:
we are going to study english together through a short story so if you are new here let me just

Ground Truth:
 very briefly explain how this lesson is going to work. So we are firstly going to read the story
Prediction:
very briefly explain how this lesson is going to work so we are firstly going to read the story

Ground Truth:
 in its entirety okay and then we're just going to talk about it a little bit to make sure we
Prediction:
in its entirety o k and then we're just going to talk about it a little bit to make sure we've

Ground Truth:
 fully understood what was going on and then we are going to break it down
Prediction:
fully understood wha

In [8]:
final_wer = wer(references, predictions)
final_cer = cer(references, predictions)

print("\n========== FINAL EVALUATION ==========")
print("Total samples evaluated:", len(samples))
print("Word Error Rate (WER): {:.3f}".format(final_wer))
print("Character Error Rate (CER): {:.3f}".format(final_cer))


Total samples evaluated: 10
Word Error Rate (WER): 0.115
Character Error Rate (CER): 0.033


Evaluation Summary â€“ Hubert Large ASR

The Hubert-large ASR model was evaluated on ten English podcast audio clips. It achieved a Word Error Rate (WER) of 11.5% and a Character Error Rate (CER) of 3.3%, indicating good transcription accuracy, though slightly behind Whisper-medium in this dataset.