# Introduction  
##  Evaluation of ASR models

## Audio Analysis & Automatic Speech Recognition (ASR) Evaluation

This notebook explores **Automatic Speech Recognition (ASR)** using pretrained models. It shows how raw audio is converted into text and how ASR performance can be measured.

**GSL English Podcast Dataset** from **Hugging Face**, which includes short English audio clips and their transcripts is used here. Basic audio preprocessing is applied before using a pretrained **Wav2Vec2 (CTC-based)** model for speech-to-text transcription.


---

## Goals of This Notebook

- Load dataset  
- Applied essential audio preprocessing steps:
  - Resampling to **16 kHz**
  - **Mono** conversion
  - **Amplitude normalization**
- Perform **speech-to-text transcription** using a pretrained ASR model
- Evaluate transcription quality using:
  - **Word Error Rate (WER)**
  - **Character Error Rate (CER)**
- Understand how ASR performance changes as the number of evaluation samples increases




In [2]:
!pip install datasets transformers torchaudio librosa --quiet

In [3]:
import torch
import librosa
import numpy as np
from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC


In [4]:
dataset = load_dataset("vietnhat/gsl-english-podcast-dataset")

samples = dataset["train"].select(range(10))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base-960h"
).to("cuda")


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:

!pip install torchcodec --quiet

In [7]:
!pip install jiwer --quiet


In [8]:
from jiwer import wer, cer

all_references = []
all_predictions = []


In [9]:
target_sr = 16000

for i, sample in enumerate(samples):
    print(f"\n========== AUDIO CLIP {i+1} ==========")

    #  Load audio
    audio = sample["audio"]["array"]
    sr = sample["audio"]["sampling_rate"]
    print("Original SR:", sr, "| Length:", len(audio))

    # Convert to mono (if stereo)
    if audio.ndim > 1:
        audio = np.mean(audio, axis=1)
    print("Mono audio length:", len(audio))

    # Resample to 16 kHz
    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)
    print("Resampled length:", len(audio))

    #  Normalize
    audio = audio / np.max(np.abs(audio))
    print("After normalization → Min:", np.min(audio), "Max:", np.max(audio))

    #  Feed to Wav2Vec2
    inputs = processor(
        audio,
        sampling_rate=target_sr,
        return_tensors="pt",
        padding=True
    )

    input_values = inputs.input_values.to("cuda")

    with torch.no_grad():
        logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])

        # Store for evaluation
    all_references.append(sample["text"].lower().strip())
    all_predictions.append(transcription.lower().strip())


    # Output
    print("Ground Truth:")
    print(sample["text"])
    print("Prediction:")
    print(transcription.lower())



Original SR: 16000 | Length: 143198
Mono audio length: 143198
Resampled length: 143198
After normalization → Min: -1.0 Max: 0.9409418
Ground Truth:
 Hello there everyone and welcome back to GSL English. My name is Gideon and in today's lesson
Prediction:
hello there every one and welcome back to geersal english my name is gideon and into day's lesson

Original SR: 16000 | Length: 134558
Mono audio length: 134558
Resampled length: 134558
After normalization → Min: -1.0 Max: 0.9253715
Ground Truth:
 we are going to study English together through a short story. So if you are new here let me just
Prediction:
we are going to study english together through a short story so if you are new here let me just

Original SR: 16000 | Length: 125918
Mono audio length: 125918
Resampled length: 125918
After normalization → Min: -1.0 Max: 0.8111806
Ground Truth:
 very briefly explain how this lesson is going to work. So we are firstly going to read the story
Prediction:
very briefly explain how this le

In [10]:
wer_score = wer(all_references, all_predictions)
cer_score = cer(all_references, all_predictions)

print("\n========== FINAL EVALUATION ==========")
print(f"Total samples evaluated: {len(all_references)}")
print(f"Word Error Rate (WER): {wer_score:.3f}")
print(f"Character Error Rate (CER): {cer_score:.3f}")



Total samples evaluated: 10
Word Error Rate (WER): 0.115
Character Error Rate (CER): 0.036


Evaluation Results

The Wav2Vec2-base-960h model was evaluated on the first ten audio samples of the GSL English Podcast dataset. Transcription quality was measured using Word Error Rate (WER) and Character Error Rate (CER).

The model achieved a WER of 11.5% and a CER of 3.6%, demonstrating good transcription accuracy on short English speech clips. Increasing the number of evaluation samples led to a more stable and slightly improved WER, indicating consistent model performance across different audio samples.