Releases: mountlord/GenSRT
Release-1.2.01
What's new
Added support for Adalat AI's whisper-medium-ml-rmft, a Malayalam-specific Whisper fine-tune trained with the R-MFT (Reverse Multi-Stage Fine-Tuning) recipe. The model is now first-class supported in GenSRT's chunked inference pipeline alongside SMC's vegam.
Compared to vegam, on the same hardware and chunking plan:
| Measurement | adalat-ai R-MFT | vegam | Direction |
|---|---|---|---|
| Vividh-ASR Broadcast WER | 31.66 | 55.10 | ~43% relative improvement |
| Vividh-ASR Global WER | 39.64 | 53.39 | ~26% relative improvement |
| Runtime on 197s Malayalam news clip (RTX 3060 Ti) | 155s | 296s | ~2× faster |
| Mid-character truncation rate | 2.6% | 10.2% | ~4× cleaner |
Native-reader review on broadcast Malayalam content (FIFA World Cup news clip) confirms a noticeable quality improvement over vegam, especially on English code-switching, place names, and entity names.
How to use
In GenSRT's footer Model dropdown:
- Click New...
- Paste:
adalat-ai/ct2-whisper-medium-ml-rmft - Click Validate & Add
The model auto-routes to chunked inference and auto-pins source language to Malayalam, same as vegam.
Known limitation
The R-MFT model emits more chunk-tail hallucination fragments (typically 1-second cues containing common verb endings) than vegam. These are perceptually invisible in playback but visible in the SRT cue list. UI affordance for surfacing and cleaning them is planned for v1.3.
Acknowledgments
CT2 conversion published by Adalat AI in cooperation with GenSRT — thanks to Kavya Manohar for the conversion and for the technical guidance on Whisper's CT2 decoder length limit (the 224-token text-token clamp that motivated the chunked inference approach in v1.2.0).
Paper: Vividh-ASR: A Robust ASR Benchmark for Indic Languages (arxiv 2605.13087) by Juvekar, Manohar, et al.
Download
gensrt-install-1.2.1.exe — Windows 7z self-extracting installer. Requirements: Windows 10/11, NVIDIA GPU recommended (CPU fallback works).
Release-1.2.00
GenSRT v1.2 — Chunked inference for fine-tuned Whisper models
What's new
Chunked inference for fine-tuned Whisper models. Community fine-tunes like
smcproject/vegam-whisper-medium-ml
were practically unusable on long-form audio in v1.1 — they would transcribe the first ~6-8 seconds and silently drop the rest. v1.2 solves this with silent-boundary chunked inference: audio is sliced along naturally-detected pauses, each chunk is transcribed independently, and the per-chunk results are assembled into a single SRT with original timestamps preserved.
For Malayalam users with vegam, this typically produces 2-3× more transcribed content than running the same model without chunking. The chunked path engages automatically when a fine-tuned model is selected — no configuration required.
ASR engine factory. Pluggable engine layer under
gensrt/asr/
mirroring the existing translation factory pattern. Two engines ship in v1.2:
Whisper (Multilingual)
for the built-in OpenAI Whisper sizes and
Whisper (Monolingual)
for chunked inference on fine-tunes. Future engines (e.g. IndicConformer) can slot in alongside without changes to the pipeline.
Other improvements
- Auto-Detect now works correctly with monolingual fine-tunes. Models in the known-monolingual registry (currently the vegam variants from SMC and Kurian Benoy's namespace) use their registered training language directly, skipping the per-chunk language detection that produces unreliable results on fine-tuned models. For unknown custom Whisper models, language is detected on the first chunk and reused for the rest.
- Translation batching fixed for Indic scripts. Google Translate calls now budget by URL-encoded byte count rather than Unicode character count. Previously, batches of Malayalam (or Hindi, Tamil, Bengali — any 3-byte UTF-8 script) blew past Google's URL length limit and silently fell back to MyMemory.
- In-player subtitle display works after long-running operations. Several
element state issues in CEF/WebView2 (pywebview's renderer) were causing subtitles not to display in the player overlay after Generate completed. Replacing the
<track>element on each refresh sidesteps the issue.<track> - **Bundled
and
ffmpeg.** No separate ffmpeg install required on target machines (in fact, an outdated ffmpeg on the system PATH is bypassed in favor of the bundled one).ffprobe
Acknowledgments
GenSRT's chunked inference path was developed against vegam-whisper-medium-ml from Swathanthra Malayalam Computing (SMC). Kavya Manohar, Leena G Pillai, and Elizabeth Sherly's analysis of Indic-script ASR evaluation pitfalls (arxiv 2409.02449) shaped how we think about quality measurement for these models. AI4Bharat's OIWER benchmark (arxiv 2603.00941) provides the most rigorous Malayalam ASR comparison currently published.
Known limitations
- Vegam occasionally emits a phrase from earlier in the audio at chunk tails — visible as substring overlap with the previous cue's text. A post-processor for this is candidate work for v1.3.
- Whisper's tokenizer can stop generating mid-character on Indic scripts. A
at the end of a subtitle line is GenSRT signaling this honestly rather than masking it; the text before the
�is accurate. See�for the full story.docs/INVESTIGATIONS.md - IndicConformer-based ASR was evaluated during v1.2 development and deferred — vegam with chunked inference proved sufficient for the release quality bar. The investigation arc is documented in
and can be re-opened if quality complaints arise.
docs/INVESTIGATIONS.md
Getting started
Download
gensrt-install.exe
from the assets below and execute the program in a folder of your choice. GenSRT will be installed in gensrt subfolder. From there run
gensrt.exe
. See
user_guide.html
next to the executable for full usage.
Requirements: Windows 10 or 11, a CUDA-capable NVIDIA GPU (~2 GB VRAM is plenty), and a stable internet connection for the first-run model download.
For Malayalam transcription: select
smcproject/vegam-whisper-medium-ml-int8_float16
from the footer Model dropdown. The model auto-downloads on first use (~1.5 GB) and is cached for subsequent runs.
Release-1.1.00
Added ability to burn subtitles into the video using ffmpeg filter. New user_guide in html next to executable file.