Release-1.2.00
GenSRT v1.2 — Chunked inference for fine-tuned Whisper models
What's new
Chunked inference for fine-tuned Whisper models. Community fine-tunes like
smcproject/vegam-whisper-medium-ml
were practically unusable on long-form audio in v1.1 — they would transcribe the first ~6-8 seconds and silently drop the rest. v1.2 solves this with silent-boundary chunked inference: audio is sliced along naturally-detected pauses, each chunk is transcribed independently, and the per-chunk results are assembled into a single SRT with original timestamps preserved.
For Malayalam users with vegam, this typically produces 2-3× more transcribed content than running the same model without chunking. The chunked path engages automatically when a fine-tuned model is selected — no configuration required.
ASR engine factory. Pluggable engine layer under
gensrt/asr/
mirroring the existing translation factory pattern. Two engines ship in v1.2:
Whisper (Multilingual)
for the built-in OpenAI Whisper sizes and
Whisper (Monolingual)
for chunked inference on fine-tunes. Future engines (e.g. IndicConformer) can slot in alongside without changes to the pipeline.
Other improvements
- Auto-Detect now works correctly with monolingual fine-tunes. Models in the known-monolingual registry (currently the vegam variants from SMC and Kurian Benoy's namespace) use their registered training language directly, skipping the per-chunk language detection that produces unreliable results on fine-tuned models. For unknown custom Whisper models, language is detected on the first chunk and reused for the rest.
- Translation batching fixed for Indic scripts. Google Translate calls now budget by URL-encoded byte count rather than Unicode character count. Previously, batches of Malayalam (or Hindi, Tamil, Bengali — any 3-byte UTF-8 script) blew past Google's URL length limit and silently fell back to MyMemory.
- In-player subtitle display works after long-running operations. Several
element state issues in CEF/WebView2 (pywebview's renderer) were causing subtitles not to display in the player overlay after Generate completed. Replacing the
<track>element on each refresh sidesteps the issue.<track> - **Bundled
and
ffmpeg.** No separate ffmpeg install required on target machines (in fact, an outdated ffmpeg on the system PATH is bypassed in favor of the bundled one).ffprobe
Acknowledgments
GenSRT's chunked inference path was developed against vegam-whisper-medium-ml from Swathanthra Malayalam Computing (SMC). Kavya Manohar, Leena G Pillai, and Elizabeth Sherly's analysis of Indic-script ASR evaluation pitfalls (arxiv 2409.02449) shaped how we think about quality measurement for these models. AI4Bharat's OIWER benchmark (arxiv 2603.00941) provides the most rigorous Malayalam ASR comparison currently published.
Known limitations
- Vegam occasionally emits a phrase from earlier in the audio at chunk tails — visible as substring overlap with the previous cue's text. A post-processor for this is candidate work for v1.3.
- Whisper's tokenizer can stop generating mid-character on Indic scripts. A
at the end of a subtitle line is GenSRT signaling this honestly rather than masking it; the text before the
�is accurate. See�for the full story.docs/INVESTIGATIONS.md - IndicConformer-based ASR was evaluated during v1.2 development and deferred — vegam with chunked inference proved sufficient for the release quality bar. The investigation arc is documented in
and can be re-opened if quality complaints arise.
docs/INVESTIGATIONS.md
Getting started
Download
gensrt-install.exe
from the assets below and execute the program in a folder of your choice. GenSRT will be installed in gensrt subfolder. From there run
gensrt.exe
. See
user_guide.html
next to the executable for full usage.
Requirements: Windows 10 or 11, a CUDA-capable NVIDIA GPU (~2 GB VRAM is plenty), and a stable internet connection for the first-run model download.
For Malayalam transcription: select
smcproject/vegam-whisper-medium-ml-int8_float16
from the footer Model dropdown. The model auto-downloads on first use (~1.5 GB) and is cached for subsequent runs.