Skip to content

Release-1.2.00

Choose a tag to compare

@mountlord mountlord released this 15 Jun 01:46
· 5 commits to main since this release

GenSRT v1.2 — Chunked inference for fine-tuned Whisper models

What's new

Chunked inference for fine-tuned Whisper models. Community fine-tunes like

smcproject/vegam-whisper-medium-ml

were practically unusable on long-form audio in v1.1 — they would transcribe the first ~6-8 seconds and silently drop the rest. v1.2 solves this with silent-boundary chunked inference: audio is sliced along naturally-detected pauses, each chunk is transcribed independently, and the per-chunk results are assembled into a single SRT with original timestamps preserved.

For Malayalam users with vegam, this typically produces 2-3× more transcribed content than running the same model without chunking. The chunked path engages automatically when a fine-tuned model is selected — no configuration required.

ASR engine factory. Pluggable engine layer under

gensrt/asr/

mirroring the existing translation factory pattern. Two engines ship in v1.2:

Whisper (Multilingual)

for the built-in OpenAI Whisper sizes and

Whisper (Monolingual)

for chunked inference on fine-tunes. Future engines (e.g. IndicConformer) can slot in alongside without changes to the pipeline.

Other improvements

  • Auto-Detect now works correctly with monolingual fine-tunes. Models in the known-monolingual registry (currently the vegam variants from SMC and Kurian Benoy's namespace) use their registered training language directly, skipping the per-chunk language detection that produces unreliable results on fine-tuned models. For unknown custom Whisper models, language is detected on the first chunk and reused for the rest.
  • Translation batching fixed for Indic scripts. Google Translate calls now budget by URL-encoded byte count rather than Unicode character count. Previously, batches of Malayalam (or Hindi, Tamil, Bengali — any 3-byte UTF-8 script) blew past Google's URL length limit and silently fell back to MyMemory.
  • In-player subtitle display works after long-running operations. Several
    <track>
    
    element state issues in CEF/WebView2 (pywebview's renderer) were causing subtitles not to display in the player overlay after Generate completed. Replacing the
    <track>
    
    element on each refresh sidesteps the issue.
  • **Bundled
    ffmpeg
    
    and
    ffprobe
    
    .** No separate ffmpeg install required on target machines (in fact, an outdated ffmpeg on the system PATH is bypassed in favor of the bundled one).

Acknowledgments

GenSRT's chunked inference path was developed against vegam-whisper-medium-ml from Swathanthra Malayalam Computing (SMC). Kavya Manohar, Leena G Pillai, and Elizabeth Sherly's analysis of Indic-script ASR evaluation pitfalls (arxiv 2409.02449) shaped how we think about quality measurement for these models. AI4Bharat's OIWER benchmark (arxiv 2603.00941) provides the most rigorous Malayalam ASR comparison currently published.

Known limitations

  • Vegam occasionally emits a phrase from earlier in the audio at chunk tails — visible as substring overlap with the previous cue's text. A post-processor for this is candidate work for v1.3.
  • Whisper's tokenizer can stop generating mid-character on Indic scripts. A
    at the end of a subtitle line is GenSRT signaling this honestly rather than masking it; the text before the
    is accurate. See
    docs/INVESTIGATIONS.md
    
    for the full story.
  • IndicConformer-based ASR was evaluated during v1.2 development and deferred — vegam with chunked inference proved sufficient for the release quality bar. The investigation arc is documented in
    docs/INVESTIGATIONS.md
    
    and can be re-opened if quality complaints arise.

Getting started

Download

gensrt-install.exe

from the assets below and execute the program in a folder of your choice. GenSRT will be installed in gensrt subfolder. From there run

gensrt.exe

. See

user_guide.html

next to the executable for full usage.

Requirements: Windows 10 or 11, a CUDA-capable NVIDIA GPU (~2 GB VRAM is plenty), and a stable internet connection for the first-run model download.

For Malayalam transcription: select

smcproject/vegam-whisper-medium-ml-int8_float16

from the footer Model dropdown. The model auto-downloads on first use (~1.5 GB) and is cached for subsequent runs.