High-quality Turkish Dizi transcription – issues with segmentation, alignment, and diarization #2762
Unanswered
abnatan39-dot
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m working on building a pipeline to transcribe Turkish TV series (Dizi) and generate high-quality, well-synchronized Turkish SRT subtitles.
My goal is to reach a very high level of accuracy in terms of:
Timing (precise sync with speech)
Sentence segmentation (natural subtitle breaks)
Speaker separation
Current pipeline:
Extract audio from video using FFmpeg
Separate audio into vocals and background (non-vocals)
Run WhisperX transcription on the vocals track
Perform alignment
Transcription model:
WhisperX with large-v3
Alignment models tested:
ozcangundes/wav2vec2-large-xlsr-53-turkish
mpoyraz/wav2vec2-xls-r-300m-cv7-turkish
Cosmobillian/turkish_whisper_for_noisy_datas_v1
I also tested with and without diarization.
Issues I’m encountering:
Poor silence-based segmentation
WhisperX does not split segments properly on pauses. Long chunks of speech remain merged even when there are clear silences.
Multiple speakers in the same subtitle line
Different speakers are often grouped into a single subtitle line instead of being separated.
Broken subtitle flow between speakers
In some cases, one speaker starts within another speaker’s subtitle line, causing the text to break unnaturally into the next subtitle.
Diarization not effective enough
Enabling diarization does not significantly improve speaker separation.
Goal:
Produce clean, professional-grade Turkish SRT subtitles with:
Accurate timing and alignment
Natural sentence splitting
Clear speaker separation (ideally one speaker per subtitle block)
Questions:
Are there recommended configurations or preprocessing steps specifically for Turkish (Dizi-style content)?
Which alignment model gives the best results for Turkish in WhisperX?
How can I improve silence-based segmentation?
Are there best practices for combining WhisperX with diarization to achieve reliable speaker separation?
Thanks in advance for your help 🙏
Beta Was this translation helpful? Give feedback.
All reactions