Skip to content

Decoding Tokens added by the user for Whisper models  #803

@aravindMahadevan

Description

@aravindMahadevan

Feature request

Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in _decode_asr to make this work.

Motivation

Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens.

Your contribution

To support this feature, we just need to modify the if statement in _decode_asr from token >= timestamp_begin to token >= timestamp_begin && token <= timestamp_end where timestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0].

Why this should work:

When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary. The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from "<|0.00|>" to "<|30.00|>". By bounding the if statement condition from token >= timestamp_begin to token >= timestamp_begin && token <= timstamp_end, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the else block

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions