## Lab 7.1 – Automatic Speech Recognition (ASR): Thonburian Whisper, Typhoon ASR model

In this lab, we demonstrate how to use pre-trained models for Thai Automatic Speech Recognition (ASR). Two models are introduced: Thonburian Whisper and the Typhoon ASR model.


## 1) Setup
The code below download dataset, imports all required libraries and defines utility functions that will be used in the rest of this notebook.

In [None]:
# Download library
!pip install git+https://github.com/huggingface/transformers
!pip install librosa soundfile
!sudo apt install ffmpeg
!pip install torchaudio ipywebrtc notebook
!pip install -q gradio
!pip install pytube
!jupyter nbextension enable --py widgetsnbextension
!pip install -U nemo_toolkit['asr']
!pip install yt_dlp

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-oq3hj_rn
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-oq3hj_rn
  Resolved https://github.com/huggingface/transformers to commit 88a5623361e1b3d844daef3c6c95535d12e70056
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<2.0,>=1.2.1 (from transformers==5.0.0.dev0)
  Using cached huggingface_hub-1.2.4-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers==5.0.0.dev0)
  Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Using cached huggingface_hub-1.2.4-py3-none-any.whl (520 kB)
Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux

In [None]:
# Import library
import os
import time

import torch
from transformers import pipeline, WhisperProcessor
import nemo.collections.asr as nemo_asr

import librosa
import soundfile as sf
from ipywebrtc import AudioRecorder, CameraStream
from google.colab import output
output.enable_custom_widget_manager()

[NeMo W 2026-01-07 18:44:43 nemo_logging:405] Megatron num_microbatches_calculator not found, using Apex version.
      m = re.match('([su]([0-9]{1,2})p?) \(([0-9]{1,2}) bit\)$', token)
    
      m2 = re.match('([su]([0-9]{1,2})p?)( \(default\))?$', token)
    
      elif re.match('(flt)p?( \(default\))?$', token):
    
      elif re.match('(dbl)p?( \(default\))?$', token):
    


## 2) Thonburian Whisper

Thonburian Whisper is an Automatic Speech Recognition (ASR) model for Thai, fine-tuned using Whisper model originally from OpenAI using Commonvoice 13, Gowajee corpus, Thai Elderly Speech, Thai Dialect datasets. Model demonstrate robustness under environmental noise and fine-tuned abilities to domain-specific audio such as financial and medical domains.

<figure>
<img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/whisper_architecture.svg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 1:</b> Whisper model. The architecture
follows the standard Transformer-based encoder-decoder model. Figure source:
<a href="https://openai.com/blog/whisper/">OpenAI Whisper Blog</a>.</figcaption>
</figure>

Thonburian Whisper is available in three different model types:

1. Thonburian Whisper (Standard Models)

These models are fine-tuned versions of OpenAI's Whisper, optimized for Thai ASR.

2. Distilled Thonburian Whisper Models

These models are distilled versions of the larger Thonburian Whisper models, offering improved efficiency. Use these models for efficient Thai ASR in resource-constrained environments or for faster inference times.

3. Thonburian Whisper with Timestamps

This model is specifically designed for Thai ASR with timestamp generation. It's based on the Whisper medium architecture and fine-tuned on a custom longform dataset.

| Size     | Parameters | URL |
|----------|--------|-------|
| Small     | 4      | ["Link"](https://huggingface.co/biodatlab/whisper-th-small-combined)   |
| Medium     | 6      | ["Link"](https://huggingface.co/biodatlab/whisper-th-medium-combined)   |
| Large-v2    | 12     | ["Link"](https://huggingface.co/biodatlab/whisper-th-large-combined)   |
| Large-v3   | 24     | ["Link"](https://huggingface.co/biodatlab/whisper-th-large-v3-combined)  |
| Distilled (Small)    | 32     | ["Link"](https://huggingface.co/biodatlab/distill-whisper-th-small)
| Distilled (Medium)  | 32     | ["Link"](https://huggingface.co/biodatlab/distill-whisper-th-medium)
| Distilled (Large) | 32     | ["Link"](https://huggingface.co/biodatlab/distill-whisper-th-large)  |
| Timestamps | 32     | ["Link"](https://huggingface.co/biodatlab/whisper-th-medium-timestamp)  |

For demonstration purposes, we'll use `whisper-th-medium-timestamp` version.

The key functions of using Thonburian Whisper:

`pipeline` in Hugging Face Transformers is a high-level API designed to organize the tokenizer, feature extractor, and model.

`WhisperProcessor`'s main functions are:
- audio preprocessing: e.g., converting waveforms to Mel spectrograms.
- tokenization: e.g., converting messages to token IDs.



In [None]:
MODEL_NAME = "biodatlab/whisper-th-medium-timestamp"
lang = "th"

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load processor explicitly
processor = WhisperProcessor.from_pretrained(MODEL_NAME)

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    device=device,
    return_timestamps=True,
)
print(pipe)

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


<transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline object at 0x7a136f8f6c30>


In [None]:
# Inference from your own voice
camera = CameraStream(constraints={'audio': True, 'video': False})
recorder = AudioRecorder(stream=camera)
recorder

AudioRecorder(audio=Audio(value=b'', format='webm'), stream=CameraStream(constraints={'audio': True, 'video': …

In [None]:
recorder.save("recorder.mp3")
recorder_result = pipe("recorder.mp3")
print(recorder_result)

{'text': 'ทดสอบ การ บันทึก เสีย เพื่อ ใช้ ใน การ วัด ผล', 'chunks': [{'timestamp': (0.1, 2.8), 'text': 'ทดสอบ การ บันทึก เสีย เพื่อ ใช้ ใน การ วัด ผล'}]}


Collecting yt_dlp
  Downloading yt_dlp-2025.12.8-py3-none-any.whl.metadata (180 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/180.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m174.1/180.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.3/180.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading yt_dlp-2025.12.8-py3-none-any.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt_dlp
Successfully installed yt_dlp-2025.12.8


In [None]:
# Inference from youtube
def yt_transcribe(yt_url: str):
    """Transcribe a given YouTube URL using yt-dlp"""

    audio_path = "audio.wav"

    ydl_opts = {
        "format": "bestaudio/best",
        "outtmpl": "audio.%(ext)s",
        "postprocessors": [{
            "key": "FFmpegExtractAudio",
            "preferredcodec": "wav",
            "preferredquality": "192",
        }],
        "quiet": True,
    }

    # Download audio
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([yt_url])

    # Transcribe with Whisper / Thonburian Whisper
    result = pipe(
        audio_path,
        generate_kwargs={
            "language": "<|th|>",
            "task": "transcribe"
        },
        return_timestamps=True,
        batch_size=8
    )

    return result

In [None]:
url = r"https://www.youtube.com/watch?v=VmK7imxgfVw"

transcriptions = yt_transcribe(url)
print(transcriptions)



{'text': 'ส่วน ที่ นิวเดลี นะคะ ปรากฏ ว่า หมด ปี แต่ว่า สิ่ง ที่ ตามมา ก็ คือ มี เที่ยวบิน แล้วก็ รถไฟ ที่ ต้อง ล่าช้า เพราะ เนี่ย ค่ะ มอง น้า ทึบ เลย ปกคลุม นิวเดลี วันที่ สาม สิบเอ็ด ธันวาคม นะคะ ซึ่ง คุณภาพอากาศ ใน นิวเดลี อยู่ ใน ระดับ ที่ รุนแรง ค่ะ เอคิวไอ ใน ช่วง ยี่สิบ สี่ ชั่วโมง ที่ผ่านมา วัด ได้ สูง กว่า สี่ ร้อย สี่ สิบ แปด จาก ห้า ร้อย นะคะ ก็ เลย ทำให้ เที่ยวบิน ขา ออก ขา เข้า เนี่ย ลาด ช้า แล้วก็ ถูก ยกเลิก จาก หมอก หนาแบบนี้ ค่ะ', 'chunks': [{'timestamp': (0.48, 33.98), 'text': 'ส่วน ที่ นิวเดลี นะคะ ปรากฏ ว่า หมด ปี แต่ว่า สิ่ง ที่ ตามมา ก็ คือ มี เที่ยวบิน แล้วก็ รถไฟ ที่ ต้อง ล่าช้า เพราะ เนี่ย ค่ะ มอง น้า ทึบ เลย ปกคลุม นิวเดลี วันที่ สาม สิบเอ็ด ธันวาคม นะคะ ซึ่ง คุณภาพอากาศ ใน นิวเดลี อยู่ ใน ระดับ ที่ รุนแรง ค่ะ เอคิวไอ ใน ช่วง ยี่สิบ สี่ ชั่วโมง ที่ผ่านมา วัด ได้ สูง กว่า สี่ ร้อย สี่ สิบ แปด จาก ห้า ร้อย นะคะ ก็ เลย ทำให้ เที่ยวบิน ขา ออก ขา เข้า เนี่ย ลาด ช้า แล้วก็ ถูก ยกเลิก จาก หมอก หนาแบบนี้ ค่ะ'}]}


## 3) Typhoon ASR model

Typhoon ASR Real-Time is an open-source streaming ASR model for Thai language that runs efficiently on CPUs without expensive hardware or cloud dependencies. The model is based on NVIDIA's FastConformer Transducer model, which is optimized for low-latency, real-time performance.

`nemo_asr`'s main functions are:

- Includes ASR models built on PyTorch Lightning, such as EncDecCTCModel, EncDecRNNTModel, and EncDecHybridModel.
- Provides data layer and dataset utilities for loading and managing audio data, such as .wav, .json, and .manifest files.
- Includes a feature extractor for converting waveforms to spectrograms or Mel-filterbank.
- Includes training scripts and configs for training ASR models.
- Supports pretrained models that can be loaded and used immediately, such as QuartzNet, Citrinet, Conformer, FastConformer, and Whisper.

In [None]:
# Select processing device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Load Typhoon ASR Real-Time model
print("Loading Typhoon ASR Real-Time...")
asr_model = nemo_asr.models.ASRModel.from_pretrained(
    model_name="scb10x/typhoon-asr-realtime",
    map_location=device
)

Using device: cuda
Loading Typhoon ASR Real-Time...


typhoon-asr-realtime.nemo:   0%|          | 0.00/462M [00:00<?, ?B/s]

[NeMo I 2026-01-07 15:23:23 nemo_logging:393] Tokenizer SentencePieceTokenizer initialized with 2048 tokens


[NeMo W 2026-01-07 15:23:24 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/workspace/warit/nemo-asr/stt_th_conformer_transducer_large/prepare_data/typhoon_cleanser/20250814/Split_gg/train_data_typhoon_asr_realtime.jsonl
    sample_rate: 16000
    batch_size: 8
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 30.0
    min_duration: 0.1
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: fully_randomized
    bucketing_batch_size: null
    
[NeMo W 2026-01-07 15:23:24 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation d

[NeMo I 2026-01-07 15:23:24 nemo_logging:393] PADDING: 0
[NeMo I 2026-01-07 15:23:26 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2026-01-07 15:23:26 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}


[NeMo W 2026-01-07 15:23:26 nemo_logging:405] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: Driver supports cuda toolkit version 12.4, but the driver needs to support at least 12,6. Please update your cuda driver.


[NeMo I 2026-01-07 15:23:26 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}


[NeMo W 2026-01-07 15:23:26 nemo_logging:405] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: Driver supports cuda toolkit version 12.4, but the driver needs to support at least 12,6. Please update your cuda driver.


[NeMo I 2026-01-07 15:23:26 nemo_logging:393] Model EncDecRNNTBPEModel was successfully restored from /root/.cache/huggingface/hub/models--scb10x--typhoon-asr-realtime/snapshots/a14b79d50c788dbdfe559c8a28a9b90153cf3865/typhoon-asr-realtime.nemo.


In [None]:
def prepare_audio(input_path, output_path=None, target_sr=16000):
    """
    Prepare audio file for Typhoon ASR Real-Time processing
    """
    if not os.path.exists(input_path):
        print(f"❌ File not found: {input_path}")
        return None

    if output_path is None:
        output_path = "processed_audio.wav"

    try:
        print(f"🎵 Processing: {input_path}")

        # Load and resample audio
        y, sr = librosa.load(input_path, sr=None)
        duration = len(y) / sr

        if sr != target_sr:
            y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)
            print(f"   Resampled: {sr} Hz → {target_sr} Hz")

        # Normalize audio
        y = y / max(abs(y))

        # Save processed audio
        sf.write(output_path, y, target_sr)
        print(f"✅ Saved: {output_path} ({duration:.1f}s)")
        return output_path

    except Exception as e:
        print(f"❌ Error: {e}")
        return None

In [None]:
# Process your audio file
input_file = "recorder.mp3"  # Update this path
processed_file = prepare_audio(input_file)

if processed_file:
    print("🌪️ Running Typhoon ASR Real-Time inference...")

    start_time = time.time()

    # Run transcription
    transcriptions = asr_model.transcribe(audio=[processed_file])

    processing_time = time.time() - start_time

    # Get audio duration for performance calculation
    audio_info = sf.info(processed_file)
    audio_duration = audio_info.duration
    rtf = processing_time / audio_duration

    print(f"⚡ Processing time: {processing_time:.2f}s")
    print(f"🎵 Audio duration: {audio_duration:.2f}s")
    print(f"📊 Real-time factor: {rtf:.2f}x")

    if rtf < 1.0:
        print("🚀 Real-time capable!")
    else:
        print("✅ Batch processing mode")

else:
    print("❌ No processed audio file available")
    transcriptions = []

🎵 Processing: /content/test.mp3


[NeMo W 2026-01-07 15:24:21 nemo_logging:405] The following configuration keys are ignored by Lhotse dataloader: use_start_end_token
[NeMo W 2026-01-07 15:24:21 nemo_logging:405] You are using a non-tarred dataset and requested tokenization during data sampling (pretokenize=True). This will cause the tokenization to happen in the main (GPU) process,possibly impacting the training speed if your tokenizer is very large.If the impact is noticable, set pretokenize=False in dataloader config.(note: that will disable token-per-second filtering and 2D bucketing features)


   Resampled: 44100 Hz → 16000 Hz
✅ Saved: processed_audio.wav (16.0s)
🌪️ Running Typhoon ASR Real-Time inference...


Transcribing: 1it [00:00,  1.03it/s]

⚡ Processing time: 1.02s
🎵 Audio duration: 16.00s
📊 Real-time factor: 0.06x
🚀 Real-time capable!





In [None]:
if transcriptions:
    print("=" * 50)
    print("📝 TRANSCRIPTION RESULTS")
    print("=" * 50)

    transcription = transcriptions[0]

    print(f"Text: {transcription.text}")

else:
    print("❌ No transcription results available")

📝 TRANSCRIPTION RESULTS
Text: ขนมหวานที่ดีที่สุดในโลกสองพันยี่สิบห้า ซึ่งเขาโหวตจริงจากคนใช้งานเทสแอดล้านเกือบแสนราย แล้วคัดจนเหลือคะแนนที่เชื้อถือได้ ขนมไทยติดอยู่สองเมนู


In [None]:
# Process your audio file
# 📤 IMPORTANT: First, upload your audio file to Colab. Then, update the path below.
input_file = "/content/test.mp3"
processed_file = prepare_audio(input_file)

if processed_file:
    print("🌪️ Running Typhoon ASR Real-Time inference...")

    start_time = time.time()

    # Run transcription
    transcriptions = asr_model.transcribe(audio=[processed_file], timestamps=True)

    processing_time = time.time() - start_time

    # Get audio duration for performance calculation
    audio_info = sf.info(processed_file)
    audio_duration = audio_info.duration
    rtf = processing_time / audio_duration

    print(f"⚡ Processing time: {processing_time:.2f}s")
    print(f"🎵 Audio duration: {audio_duration:.2f}s")
    print(f"📊 Real-time factor: {rtf:.2f}x")

    if rtf < 1.0:
        print("🚀 Real-time capable!")
    else:
        print("✅ Batch processing mode")

    # by default, timestamps are enabled for char, word and segment level
    print("=" * 50)
    print("📝 TRANSCRIPTION RESULTS WITH WORD TIMESTAMPS")
    print("=" * 50)
    word_timestamps = transcriptions[0].timestamp['word'] # word level timestamps for first sample
    for stamp in word_timestamps:
        print(f"{stamp['start']}s - {stamp['end']}s : {stamp['word']}")

else:
    print("❌ No processed audio file available")
    transcriptions = []

🎵 Processing: /content/test.mp3
   Resampled: 44100 Hz → 16000 Hz
✅ Saved: processed_audio.wav (16.0s)
🌪️ Running Typhoon ASR Real-Time inference...
[NeMo I 2026-01-07 15:24:50 nemo_logging:393] Timestamps requested, setting decoding timestamps to True. Capture them in Hypothesis object,                         with output[0][idx].timestep['word'/'segment'/'char']
[NeMo I 2026-01-07 15:24:50 nemo_logging:393] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}


[NeMo W 2026-01-07 15:24:50 nemo_logging:405] No conditional node support for Cuda.
    Cuda graphs with while loops are disabled, decoding speed will be slower
    Reason: Driver supports cuda toolkit version 12.4, but the driver needs to support at least 12,6. Please update your cuda driver.
[NeMo W 2026-01-07 15:24:50 nemo_logging:405] The following configuration keys are ignored by Lhotse dataloader: use_start_end_token
[NeMo W 2026-01-07 15:24:50 nemo_logging:405] You are using a non-tarred dataset and requested tokenization during data sampling (pretokenize=True). This will cause the tokenization to happen in the main (GPU) process,possibly impacting the training speed if your tokenizer is very large.If the impact is noticable, set pretokenize=False in dataloader config.(note: that will disable token-per-second filtering and 2D bucketing features)
Transcribing: 1it [00:00,  8.60it/s]

⚡ Processing time: 0.19s
🎵 Audio duration: 16.00s
📊 Real-time factor: 0.01x
🚀 Real-time capable!
📝 TRANSCRIPTION RESULTS WITH WORD TIMESTAMPS
0.16s - 2.72s : ขนมหวานที่ดีที่สุดในโลกสองพันยี่สิบห้า
3.36s - 7.36s : ซึ่งเขาโหวตจริงจากคนใช้งานเทสแอดล้านเกือบแสนราย
8.4s - 10.8s : แล้วคัดจนเหลือคะแนนที่เชื้อถือได้
11.68s - 14.16s : ขนมไทยติดอยู่สองเมนู



