fix: Whisper word timestamp OOB access on trailing replacement char by guoyangzhen · Pull Request #44902 · huggingface/transformers

guoyangzhen · 2026-03-20T22:08:49Z

Problem

_split_tokens_on_unicode() crashes with IndexError: string index out of range when the decoded token stream ends with a dangling Unicode replacement character (\uFFFD).

The computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds string access at line ~1334:

decoded_full[unicode_offset + decoded.index(replacement_char)]

This happens when ASR output produces a run of valid tokens followed by a final incomplete Unicode fragment at EOF.

Root Cause

When processing tokens sequentially, unicode_offset tracks the position in decoded_full. If the last decoded chunk ends with a replacement character and unicode_offset + decoded.index(replacement_char) == len(decoded_full), the index is one past the end of the string.

Fix

Add a bounds check before indexing into decoded_full:

replacement_pos = unicode_offset + decoded.index(replacement_char) if replacement_char in decoded else -1
if (
    replacement_char not in decoded
    or replacement_pos >= len(decoded_full)
    or decoded_full[replacement_pos] == replacement_char
):

When replacement_pos >= len(decoded_full), the word is treated as valid and split out (same as the replacement_char not in decoded case), which safely handles the trailing incomplete Unicode fragment.

Testing

The fix handles the edge case where:

unicode_offset = 298
decoded.index(replacement_char) = 0
len(decoded_full) = 298
target_index = 298 (would cause OOB)

With the bounds check, this correctly falls through to the word-splitting branch.

Fixes #44869

Fixes IndexError in _split_tokens_on_unicode() when the decoded token stream ends with a dangling Unicode replacement character (U+FFFD). The computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds access. Fix: add bounds check before indexing into decoded_full. Fixes huggingface#44869

github-actions · 2026-03-20T22:09:53Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

Rocketknight1 · 2026-03-23T11:59:13Z

No pure code agent PRs please, please check CONTRIBUTING.md!

Rocketknight1 closed this Mar 23, 2026

Rocketknight1 added the Code agent slop label Mar 23, 2026

Rocketknight1 mentioned this pull request Mar 23, 2026

fix: prevent IndexError in Whisper word timestamp decode #44885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Whisper word timestamp OOB access on trailing replacement char#44902

fix: Whisper word timestamp OOB access on trailing replacement char#44902
guoyangzhen wants to merge 1 commit intohuggingface:mainfrom
guoyangzhen:fix/whisper-timestamp-oob

guoyangzhen commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Rocketknight1 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guoyangzhen commented Mar 20, 2026

Problem

Root Cause

Fix

Testing

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Rocketknight1 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants