Skip to content

fix: prevent IndexError in Whisper word timestamp decode#44885

Closed
guoyangzhen wants to merge 1 commit intohuggingface:mainfrom
guoyangzhen:fix/whisper-timestamp-indexerror
Closed

fix: prevent IndexError in Whisper word timestamp decode#44885
guoyangzhen wants to merge 1 commit intohuggingface:mainfrom
guoyangzhen:fix/whisper-timestamp-indexerror

Conversation

@guoyangzhen
Copy link
Copy Markdown

Problem

In _split_tokens_on_unicode(), when the decoded token stream ends with a dangling Unicode replacement character (U+FFFD), the computed index can equal len(decoded_full), causing IndexError: string index out of range.

The failing line:

decoded_full[unicode_offset + decoded.index(replacement_char)] == replacement_char

When unicode_offset + decoded.index(replacement_char) >= len(decoded_full), this crashes.

Fix

Add bounds check before indexing into decoded_full:

if (
    replacement_char not in decoded
    or (unicode_offset + decoded.index(replacement_char) < len(decoded_full)
        and decoded_full[unicode_offset + decoded.index(replacement_char)] == replacement_char)
):

This ensures we only access decoded_full when the index is valid.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44885&sha=a326a1

@Rocketknight1
Copy link
Copy Markdown
Member

Your bot has created a duplicate PR at #44902 for the same issue. As we mentioned in CONTRIBUTING, we're being flooded with code agent PRs right now, so temporarily blocking you to cut down the notification spam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants