Skip to content

fix(qwen-asr): enable timestamp output when forced_aligner is configured#10013

Merged
mudler merged 7 commits into
mudler:masterfrom
fqscfqj:fix/qwen-asr-timestamps
May 26, 2026
Merged

fix(qwen-asr): enable timestamp output when forced_aligner is configured#10013
mudler merged 7 commits into
mudler:masterfrom
fqscfqj:fix/qwen-asr-timestamps

Conversation

@fqscfqj
Copy link
Copy Markdown
Contributor

@fqscfqj fqscfqj commented May 26, 2026

Problem

The qwen-asr backend loads the forced_aligner model correctly but never actually produces timestamps. All segments return start=0, end=0.

Two bugs cause this:

Bug 1: return_time_stamps not passed to transcribe()

Qwen3ASRModel.transcribe() defaults return_time_stamps=False. The backend never passes True, so the forced aligner is loaded but silently skipped during inference.

Bug 2: Timestamp item format mismatch

The parsing code checks isinstance(ts, (list, tuple)), but qwen_asr returns ForcedAlignItem dataclass instances with .text, .start_time, .end_time attributes — not tuples. The check always fails, so timestamps are zeroed out even if Bug 1 were fixed.

Fix

  1. Pass return_time_stamps=True to transcribe() when a forced_aligner is loaded.
  2. Add hasattr() check for ForcedAlignItem dataclass before falling back to tuple parsing.

Testing

Verified against qwen3-asr-0.6b with Qwen/Qwen3-ForcedAligner-0.6B — timestamps now return correctly in verbose_json, srt, and vtt formats.

Copilot AI review requested due to automatic review settings May 26, 2026 08:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for returning and parsing word/segment timestamps when the Qwen ASR forced aligner is available.

Changes:

  • Detects presence of a forced aligner and requests timestamps from model.transcribe.
  • Adds parsing for forced-aligner timestamp objects (start_time, end_time, text) in addition to tuple/list timestamps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/python/qwen-asr/backend.py Outdated
Comment on lines +155 to +158
results = self.model.transcribe(
audio=audio_path, language=language, context=context,
return_time_stamps=has_aligner,
)
Comment thread backend/python/qwen-asr/backend.py Outdated
Comment on lines +171 to +176
if hasattr(ts, 'start_time') and hasattr(ts, 'end_time') and hasattr(ts, 'text'):
# ForcedAlignItem dataclass (from qwen_asr forced aligner)
start_ms = int(ts.start_time * 1000) if ts.start_time is not None else 0
end_ms = int(ts.end_time * 1000) if ts.end_time is not None else 0
seg_text = ts.text or ""
elif isinstance(ts, (list, tuple)) and len(ts) >= 3:
fqscfqj added 2 commits May 26, 2026 08:51
Two bugs prevented timestamps from working in the qwen-asr backend:

1. transcribe() was called without return_time_stamps=True, so the
   forced aligner was loaded but never invoked. Now we pass
   return_time_stamps=True when a forced_aligner is present.

2. The timestamp parsing code expected (list, tuple) items, but the
   qwen_asr library returns ForcedAlignItem dataclass instances with
   .text, .start_time, .end_time attributes. Added hasattr() check
   to handle this correctly, falling back to tuple parsing for
   backward compatibility.
- Wrap return_time_stamps kwarg in try/except TypeError for safety
- Add defensive float() normalization for timestamp times
- Use str() for text extraction to ensure string type
@fqscfqj fqscfqj force-pushed the fix/qwen-asr-timestamps branch from 4f283ba to 346c5d2 Compare May 26, 2026 08:51
The Go server reads TranscriptSegment.start/end via time.Duration,
which is in nanoseconds. Previously the backend sent milliseconds
(* 1000), causing timestamps to be 1000x too small (e.g. 8e-8
instead of 0.08). Convert seconds → nanoseconds (* 1e9) instead.

Also applies to the legacy tuple path for consistency.
@fqscfqj
Copy link
Copy Markdown
Contributor Author

fqscfqj commented May 26, 2026

Additional fix: seconds → nanoseconds

While testing this PR against a real deployment, I discovered a third issue beyond the two bugs described above:

Bug 3: Timestamp unit mismatch between Python backend and Go server

The Go server reads (int64) and wraps them in :

// core/backend/transcript.go
segments = append(segments, &schema.TranscriptSegment{
    Start: time.Duration(s.Start),
    End:   time.Duration(s.End),
})

Go's time.Duration is in nanoseconds, but the backend was sending milliseconds (* 1000). This caused timestamps to be 1000x too small — e.g. 8e-8 seconds instead of 0.08 seconds.

Fix: Convert seconds → nanoseconds (* 1_000_000_000) instead of seconds → milliseconds (* 1000).

Verified output

After applying all three fixes, verbose_json, srt, and vtt formats all produce correct timestamps:

{"segments": [{"id": 0, "start": 0.08, "end": 0.24, "text": ""}, ...]}
1
00:00:00,080 --> 00:00:00,240
今

Pushed the additional commit to the PR branch.

fqscfqj added 4 commits May 26, 2026 10:05
Read request.timestamp_granularities from the gRPC request.
- 'word': return one segment per aligned item (character / word)
- 'segment' (default): merge consecutive items at sentence boundaries

Sentence boundaries detected via CJK punctuation (。!?;…)
and Latin endings (. ! ? ;). This matches the OpenAI Whisper API
contract where omitting the parameter defaults to segment-level.
Unicode curly quotes (U+2018/2019) were being interpreted as Python
string delimiters, causing SyntaxError. Use explicit unicode escapes.
The forced aligner strips punctuation from its output, so text-based
sentence detection doesn't work. Instead, detect segment boundaries
by measuring time gaps between consecutive aligned items.

Threshold = max(median_gap * 4, 0.3s). This cleanly separates
intra-sentence gaps (< 0.24s) from inter-sentence gaps (> 0.3s)
across Chinese, English, and other languages.
The forced aligner strips whitespace from tokenized text, so English
words like ['hello', 'world'] were joined as 'helloworld'. Add
_smart_join() that inserts spaces between non-CJK tokens while
keeping CJK characters and punctuation unspaced. Works for Chinese,
English, Korean, Japanese, and mixed-language text.
@mudler mudler enabled auto-merge (squash) May 26, 2026 20:09
@mudler mudler merged commit 4e5ec6f into mudler:master May 26, 2026
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants