## Evaluation of S2T medium Model

This notebook evaluates the **S2T medium** automatic speech recognition (ASR) model on the **GSL English Podcast Dataset** from **Hugging Face**. The objective is to analyze how accurately the model transcribes short English speech clips.

The audio samples are preprocessed and passed through the pretrained **S2T-medium** model to generate transcriptions. Model performance is measured using standard ASR evaluation metrics:

- **Word Error Rate (WER)**
- **Character Error Rate (CER)**

In [1]:
!pip install transformers torchaudio librosa soundfile --quiet

In [2]:
!pip install torchcodec --quiet

In [3]:
!pip install jiwer --quiet

In [4]:
import torch
import librosa
import numpy as np
from datasets import load_dataset
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from jiwer import wer, cer

In [5]:
dataset = load_dataset("vietnhat/gsl-english-podcast-dataset")
samples = dataset["train"].select(range(10))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-librispeech-asr")
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-librispeech-asr")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Speech2TextForConditionalGeneration(
  (model): Speech2TextModel(
    (encoder): Speech2TextEncoder(
      (conv): Conv1dSubsampler(
        (conv_layers): ModuleList(
          (0): Conv1d(80, 1024, kernel_size=(5,), stride=(2,), padding=(2,))
          (1): Conv1d(512, 1024, kernel_size=(5,), stride=(2,), padding=(2,))
        )
      )
      (embed_positions): Speech2TextSinusoidalPositionalEmbedding()
      (layers): ModuleList(
        (0-11): 12 x Speech2TextEncoderLayer(
          (self_attn): Speech2TextAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): ReLU()
          (fc1): Linear(in_features=

In [7]:
target_sr = 16000
predictions = []
ground_truths = []

for i, sample in enumerate(samples):
    print(f"\n========== AUDIO CLIP {i+1} ==========")

    # Load audio
    audio = sample["audio"]["array"]
    sr = sample["audio"]["sampling_rate"]
    print("Original SR:", sr, "| Length:", len(audio))

    # Convert to mono
    if audio.ndim > 1:
        audio = np.mean(audio, axis=1)
    print("Mono audio length:", len(audio))

    # Resample to 16 kHz
    if sr != target_sr:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=target_sr)
    print("Resampled length:", len(audio))

    # Normalize
    audio = audio / np.max(np.abs(audio))
    print("After normalization → Min:", np.min(audio), "Max:", np.max(audio))

    inputs = processor(audio, sampling_rate=target_sr, return_tensors="pt").input_features.to(device)

    generated_ids = model.generate(inputs)
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

    predictions.append(transcription.lower())
    ground_truths.append(sample["text"].lower())

    print("Ground Truth:", sample["text"])
    print("Prediction  :", transcription)


Original SR: 16000 | Length: 143198
Mono audio length: 143198
Resampled length: 143198
After normalization → Min: -1.0 Max: 0.9409418
Ground Truth:  Hello there everyone and welcome back to GSL English. My name is Gideon and in today's lesson
Prediction  : hallo there everyone and welcome back to g s o english my name is gideon and into days lesser

Original SR: 16000 | Length: 134558
Mono audio length: 134558
Resampled length: 134558
After normalization → Min: -1.0 Max: 0.9253715
Ground Truth:  we are going to study English together through a short story. So if you are new here let me just
Prediction  : we are going to study english together through a short story so if you are new here let me just

Original SR: 16000 | Length: 125918
Mono audio length: 125918
Resampled length: 125918
After normalization → Min: -1.0 Max: 0.8111806
Ground Truth:  very briefly explain how this lesson is going to work. So we are firstly going to read the story
Prediction  : very briefly explain how this 

In [8]:
final_wer = wer(ground_truths, predictions)
final_cer = cer(ground_truths, predictions)

print("\n========== FINAL EVALUATION ==========")
print("Total samples evaluated:", len(samples))
print(f"Word Error Rate (WER): {final_wer:.3f}")
print(f"Character Error Rate (CER): {final_cer:.3f}")


Total samples evaluated: 10
Word Error Rate (WER): 0.141
Character Error Rate (CER): 0.059


Evaluation Summary – S2T Medium

The s2t-medium-librispeech-asr model was evaluated on ten English podcast audio clips. The model achieved a Word Error Rate (WER) of 14.1% and a Character Error Rate (CER) of 5.9%, indicating moderate transcription accuracy.

As this model was primarily trained on read speech from Librispeech, its performance on conversational podcast-style speech is lower compared to models like Whisper-medium.