Decoding Tokens added by the user for Whisper models 

### Feature request

Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in [_decode_asr](https://github.com/xenova/transformers.js/blob/da2688626d7812ad1ea47fd304c2072cc685051b/src/tokenizers.js#L3700) to make this work. 


### Motivation

Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens. 

### Your contribution

To support this feature, we just need to modify the if statement in [_decode_asr](https://github.com/xenova/transformers.js/blob/da2688626d7812ad1ea47fd304c2072cc685051b/src/tokenizers.js#L3700) from `token >= timestamp_begin` to `token >= timestamp_begin && token <= timestamp_end` where `timestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0]`.

### Why this should work: 
When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary.  The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from `"<|0.00|>"` to `"<|30.00|>"`. By bounding the if statement condition from `token >= timestamp_begin` to `token >= timestamp_begin && token <= timstamp_end`, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the [else block](https://github.com/xenova/transformers.js/blob/da2688626d7812ad1ea47fd304c2072cc685051b/src/tokenizers.js#L3757)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoding Tokens added by the user for Whisper models #803

Feature request

Motivation

Your contribution

Why this should work:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decoding Tokens added by the user for Whisper models #803

Description

Feature request

Motivation

Your contribution

Why this should work:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions