Skip to content

fix: Whisper word timestamp OOB access on trailing replacement char#44902

Closed
guoyangzhen wants to merge 1 commit intohuggingface:mainfrom
guoyangzhen:fix/whisper-timestamp-oob
Closed

fix: Whisper word timestamp OOB access on trailing replacement char#44902
guoyangzhen wants to merge 1 commit intohuggingface:mainfrom
guoyangzhen:fix/whisper-timestamp-oob

Conversation

@guoyangzhen
Copy link
Copy Markdown

Problem

_split_tokens_on_unicode() crashes with IndexError: string index out of range when the decoded token stream ends with a dangling Unicode replacement character (\uFFFD).

The computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds string access at line ~1334:

decoded_full[unicode_offset + decoded.index(replacement_char)]

This happens when ASR output produces a run of valid tokens followed by a final incomplete Unicode fragment at EOF.

Root Cause

When processing tokens sequentially, unicode_offset tracks the position in decoded_full. If the last decoded chunk ends with a replacement character and unicode_offset + decoded.index(replacement_char) == len(decoded_full), the index is one past the end of the string.

Fix

Add a bounds check before indexing into decoded_full:

replacement_pos = unicode_offset + decoded.index(replacement_char) if replacement_char in decoded else -1
if (
    replacement_char not in decoded
    or replacement_pos >= len(decoded_full)
    or decoded_full[replacement_pos] == replacement_char
):

When replacement_pos >= len(decoded_full), the word is treated as valid and split out (same as the replacement_char not in decoded case), which safely handles the trailing incomplete Unicode fragment.

Testing

The fix handles the edge case where:

  • unicode_offset = 298
  • decoded.index(replacement_char) = 0
  • len(decoded_full) = 298
  • target_index = 298 (would cause OOB)

With the bounds check, this correctly falls through to the word-splitting branch.

Fixes #44869

Fixes IndexError in _split_tokens_on_unicode() when the decoded token
stream ends with a dangling Unicode replacement character (U+FFFD).

The computed index unicode_offset + decoded.index(replacement_char)
can equal len(decoded_full), causing an out-of-bounds access.

Fix: add bounds check before indexing into decoded_full.

Fixes huggingface#44869
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

@Rocketknight1
Copy link
Copy Markdown
Member

No pure code agent PRs please, please check CONTRIBUTING.md!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream

2 participants