WhisperTokenizer with decode_with_timestamps: behaviour is incoherent when skipping special tokens #32378

bruno-hays · 2024-08-01T14:09:13Z

System Info

transformers == 4.44.0.dev0

Who can help?

@sanchit-gandhi @kamilakesbi @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch

from transformers import WhisperTokenizer, WhisperForConditionalGeneration, WhisperFeatureExtractor
from datasets import load_dataset

model_name = "openai/whisper-tiny"

tokenizer = WhisperTokenizer.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

dataset = load_dataset("BrunoHays/Accueil_UBS_pseudo_labelled")

audio_sample = dataset["test"][-1]["audio"]

features = feature_extractor(
    audio_sample["array"], return_tensors="pt", sampling_rate=audio_sample["sampling_rate"]
).input_features

preprompt_tokens = [50361, 50364, 47320, 485, 50464, 50464, 8774, 11, 13274, 1930, 288, 257, 2251, 15081, 1506, 408,
                    18297, 1736, 485, 50564, 50564, 47320, 485, 50614, 50614, 16227, 871, 485, 50664, 50664, 10802,
                    48291, 8404, 871, 465, 6668, 30, 50714, 50714, 25475, 13, 50764, 50764, 3790, 485, 50814, 50814,
                    1282, 29531, 1736, 5977, 12, 9498, 465, 17872, 11, 2420, 1930, 8487, 485, 50914, 50914, 4416, 8487,
                    421, 6, 268, 2994, 1776, 517, 368, 2449, 368, 1117, 4900, 465, 2630, 12, 306, 2156, 13, 51014,
                    51014, 3790, 485, 51064, 51064, 2193, 1798, 1053, 1956, 262, 6, 25543, 485, 51164, 51164, 2251, 371,
                    9304, 19984, 11, 2251, 2199, 13, 51314]

preprompt_tokens_no_timestamps = [tok for tok in preprompt_tokens if tok < 50364]

decoder_in_ids = torch.tensor(preprompt_tokens)
generated_ids = model.generate(inputs=features, return_timestamps=True, prompt_ids=decoder_in_ids)

print("PREPROMPT TIMESTAMPS + SKIP SPECIAL TOKENS")
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True, decode_with_timestamps=True))
print()
print("PREPROMPT TIMESTAMPS + SPECIAL TOKENS")
print(tokenizer.decode(generated_ids[0], skip_special_tokens=False, decode_with_timestamps=True))
print()

decoder_in_ids = torch.tensor(preprompt_tokens_no_timestamps)
generated_ids = model.generate(inputs=features, return_timestamps=True, prompt_ids=decoder_in_ids)

print("PREPROMPT NO TIMESTAMPS + SKIP SPECIAL TOKENS")
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True, decode_with_timestamps=True))
print()
print("PREPROMPT NO TIMESTAMPS + SPECIAL TOKENS")
print(tokenizer.decode(generated_ids[0], skip_special_tokens=False, decode_with_timestamps=True))

The output:

PREPROMPT TIMESTAMPS + SKIP SPECIAL TOKENS
<|0.00|> Bien, bon je m'ai reçu.<|1.00|><|1.00|> Non, c'est parce que nous avons profini à l'ocier d'inscription.<|4.00|><|4.00|> Il demande de dérogation pour entrer en deuxième année.<|7.00|><|7.00|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|10.00|><|10.00|> Et quand ils seront en aurait la réponse,<|12.00|><|12.00|> et comment se passe les inscriptions...<|15.00|><|15.00|> ...falus le piment.<|17.00|><|17.00|> Euh...<|18.00|><|18.00|> C'est pour ma fille.<|20.00|><|20.00|> Non, non, non, non.<|21.00|><|21.00|> Non, non, non, non.<|22.00|><|22.00|> On va être une accolée, hein?<|23.00|><|23.00|> On d'accord.<|24.00|><|24.00|> Putain, merci.<|25.00|><|25.00|> Ouais.<|26.00|>

PREPROMPT TIMESTAMPS + SPECIAL TOKENS
<|startofprev|><|0.00|> Euh...<|2.00|><|2.00|> Non, après il y a une autre je ne vois pas...<|4.00|><|4.00|> Euh...<|5.00|><|5.00|> Elle est...<|6.00|><|6.00|> Vous dites elle est en zone?<|7.00|><|7.00|> Ouais.<|8.00|><|8.00|> Et...<|9.00|><|9.00|> On aurait pas peut-être en première, mais il faut...<|11.00|><|11.00|> Il faut qu'enforcer un deux dements sont en vous-leurs.<|13.00|><|13.00|> Et...<|14.00|><|14.00|> En organien qui s'appelle...<|16.00|><|16.00|> une viche correction, une simple.<|19.00|><|startoftranscript|><|fr|><|transcribe|><|19.00|> Bien, bon je m'ai reçu.<|20.00|><|20.00|> Non, c'est parce que nous avons profini à l'ocier d'inscription.<|23.00|><|23.00|> Il demande de dérogation pour entrer en deuxième année.<|26.00|><|26.00|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|29.00|><|29.00|> Et quand ils seront en aurait la réponse,<|31.00|><|31.00|> et comment se passe les inscriptions...<|34.00|><|34.00|> ...falus le piment.<|36.00|><|36.00|> Euh...<|37.00|><|37.00|> C'est pour ma fille.<|39.00|><|39.00|> Non, non, non, non.<|40.00|><|40.00|> Non, non, non, non.<|41.00|><|41.00|> On va être une accolée, hein?<|42.00|><|42.00|> On d'accord.<|43.00|><|43.00|> Putain, merci.<|44.00|><|44.00|> Ouais.<|45.00|>

PREPROMPT NO TIMESTAMPS + SKIP SPECIAL TOKENS
<|0.00|> Bien, bon je m'airs-y.<|1.20|><|1.20|> Non, c'est parce que nous avons profini à l'ocid, à l'inscription,<|4.20|><|4.20|> une demande de dérogation pour entrer en deuxième année.<|6.80|><|6.80|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|10.40|><|10.40|> Et quand elles seront nos rélareis, pôts, c'est comment se passe les inscriptions...<|15.60|><|15.60|> Enfin, le pépiment.<|17.00|><|17.00|> Euh... C'est pour moi, là... C'est pour ma fille.<|20.20|><|20.20|> Non, non, mais les sons, c'est que la passion on va être ma collègue.<|22.80|><|22.80|> On d'accord, putain, merci.<|24.20|><|24.20|> Ouais.<|25.20|>

PREPROMPT NO TIMESTAMPS + SPECIAL TOKENS
<|startofprev|> Euh... Non, après il y a une autre je ne vois pas... Euh... Elle est... Vous dites elle est en zone? Ouais. Et... On aurait pas peut-être en première, mais il faut... Il faut qu'enforcer un deux dements sont en vous-leurs. Et... En organien qui s'appelle... une viche correction, une simple.<|startoftranscript|><|fr|><|transcribe|><|0.00|> Bien, bon je m'airs-y.<|1.20|><|1.20|> Non, c'est parce que nous avons profini à l'ocid, à l'inscription,<|4.20|><|4.20|> une demande de dérogation pour entrer en deuxième année.<|6.80|><|6.80|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|10.40|><|10.40|> Et quand elles seront nos rélareis, pôts, c'est comment se passe les inscriptions...<|15.60|><|15.60|> Enfin, le pépiment.<|17.00|><|17.00|> Euh... C'est pour moi, là... C'est pour ma fille.<|20.20|><|20.20|> Non, non, mais les sons, c'est que la passion on va être ma collègue.<|22.80|><|22.80|> On d'accord, putain, merci.<|24.20|><|24.20|> Ouais.<|25.20|>

As you can see the timestamps differ when we skip special tokens or don't include timestamps in the prompt_ids.
It seems like the timestamps are offset by the last timestamp of the prompt, but only when it is not skipped in the final text.

Expected behavior

I don't think this is the intended behaviour.
I think Decoding should only replace each token by the corresponding string : Identical token_ids should always be decoded by the same string, regardless of the context.

As is, we output tokens like <|45.00|> that do not even exist in the Tokenizer vocabulary which is quite misleading

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-08-27T08:05:15Z

I think I can reproduce, here is what I got.

PREPROMPT TIMESTAMPS + SKIP SPECIAL TOKENS
<|0.00|> Bien, bon je m'ai reçu.<|1.00|><|1.00|> Non, c'est parce que nous avons profini à l'ocier d'inscription.<|4.00|><|4.00|> Il demande de dérogation pour entrer en deuxième année.<|7.00|><|7.00|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|10.00|><|10.00|> Et quand ils seront en aurait la réponse,<|12.00|><|12.00|> et comment se passe les inscriptions...<|15.00|><|15.00|> ...falus le piment.<|17.00|><|17.00|> Euh...<|18.00|><|18.00|> C'est pour ma fille.<|20.00|><|20.00|> Non, non, non, non.<|21.00|><|21.00|> Non, non, non, non.<|22.00|><|22.00|> On va être une accolée, hein?<|23.00|><|23.00|> On d'accord.<|24.00|><|24.00|> Putain, merci.<|25.00|><|25.00|> Ouais.<|26.00|>

PREPROMPT TIMESTAMPS + SPECIAL TOKENS
<|startofprev|><|0.00|> Euh...<|2.00|><|2.00|> Non, après il y a une autre je ne vois pas...<|4.00|><|4.00|> Euh...<|5.00|><|5.00|> Elle est...<|6.00|><|6.00|> Vous dites elle est en zone?<|7.00|><|7.00|> Ouais.<|8.00|><|8.00|> Et...<|9.00|><|9.00|> On aurait pas peut-être en première, mais il faut...<|11.00|><|11.00|> Il faut qu'enforcer un deux dements sont en vous-leurs.<|13.00|><|13.00|> Et...<|14.00|><|14.00|> En organien qui s'appelle...<|16.00|><|16.00|> une viche correction, une simple.<|19.00|><|startoftranscript|><|fr|><|transcribe|><|19.00|> Bien, bon je m'ai reçu.<|20.00|><|20.00|> Non, c'est parce que nous avons profini à l'ocier d'inscription.<|23.00|><|23.00|> Il demande de dérogation pour entrer en deuxième année.<|26.00|><|26.00|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|29.00|><|29.00|> Et quand ils seront en aurait la réponse,<|31.00|><|31.00|> et comment se passe les inscriptions...<|34.00|><|34.00|> ...falus le piment.<|36.00|><|36.00|> Euh...<|37.00|><|37.00|> C'est pour ma fille.<|39.00|><|39.00|> Non, non, non, non.<|40.00|><|40.00|> Non, non, non, non.<|41.00|><|41.00|> On va être une accolée, hein?<|42.00|><|42.00|> On d'accord.<|43.00|><|43.00|> Putain, merci.<|44.00|><|44.00|> Ouais.<|45.00|>

PREPROMPT NO TIMESTAMPS + SKIP SPECIAL TOKENS
<|0.00|> Bien, bon je m'airs-y.<|1.20|><|1.20|> Non, c'est parce que nous avons profini à l'ocid, à l'inscription,<|4.20|><|4.20|> une demande de dérogation pour entrer en deuxième année.<|6.80|><|6.80|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|10.40|><|10.40|> Et quand elles seront nos rélareis, pôts, c'est comment se passe les inscriptions...<|15.60|><|15.60|> Enfin, le pépiment.<|17.00|><|17.00|> Euh... C'est pour moi, là... C'est pour ma fille.<|20.20|><|20.20|> Non, non, mais les sons, c'est que la passion on va être ma collègue.<|22.80|><|22.80|> On d'accord, putain, merci.<|24.20|><|24.20|> Ouais.<|25.20|>

PREPROMPT NO TIMESTAMPS + SPECIAL TOKENS
<|startofprev|> Euh... Non, après il y a une autre je ne vois pas... Euh... Elle est... Vous dites elle est en zone? Ouais. Et... On aurait pas peut-être en première, mais il faut... Il faut qu'enforcer un deux dements sont en vous-leurs. Et... En organien qui s'appelle... une viche correction, une simple.<|startoftranscript|><|fr|><|transcribe|><|0.00|> Bien, bon je m'airs-y.<|1.20|><|1.20|> Non, c'est parce que nous avons profini à l'ocid, à l'inscription,<|4.20|><|4.20|> une demande de dérogation pour entrer en deuxième année.<|6.80|><|6.80|> Et je voulais savoir d'une partilée, c'était bien parvenu.<|10.40|><|10.40|> Et quand elles seront nos rélareis, pôts, c'est comment se passe les inscriptions...<|15.60|><|15.60|> Enfin, le pépiment.<|17.00|><|17.00|> Euh... C'est pour moi, là... C'est pour ma fille.<|20.20|><|20.20|> Non, non, mais les sons, c'est que la passion on va être ma collègue.<|22.80|><|22.80|> On d'accord, putain, merci.<|24.20|><|24.20|> Ouais.<|25.20|>

I seems that skip_special_tokens will skip the prompt ids.
This seems related to the WhisperTokenizer/Fast. It might be documented somewhere I am not exactly sure.

If this is expected, maybe the doc needs an update, otherwise feel free to open a PR. There has a been a lot of changes so could be expected.

ArthurZucker · 2024-08-27T08:10:10Z

Regarding the <|45.0|> this was a choice, and I am pretty sure it matches the output that you would get for a longer than 30sec audio in the original whisper. It has now become the convention so we can't really change it now!

ylacombe · 2024-09-05T13:43:10Z

Hey @Hubert-Bonisseur, thanks for opening this issue!

There's a few things to unpack here, and I'll to be as clear as possible:

1. Regarding skip_special_tokens impact on the prompt_ids:

It seems that the tokenizer strips everything that is before the first decoder token id "<|startoftranscript|>" when skip_special_tokens=True.

I don't think that's a bug here, but just an implementation choice. However, it's indeed not documented. Would you like to open a PR to make it clearer?

2. Impact of skip_special_tokens on timestamps:

The stripping mentioned above is a done as a pre-processing step. Decoding with timestamps is a post-processing step, so there's two possible cases:

If you have a prompt_ids with timestamps and skip_special_tokens=False:

As highlighted in point 1., the tokenizer will keep the prompt ids, including timestamps. The decoding with timestamps steps will thus also include the timestamps of the prompt_ids. In other words, there'll be no difference between the prompt_ids and the generated tokens, and the tokenizer will see everything as a single block.

This is also why you see a transcription going from 0s to 45s: the tokenizer include the prompt input ids in his timestamps computation.

BTW, if you actually use output_offsets=True:

outputs = tokenizer.decode(generated_ids[0], skip_special_tokens=False, decode_with_timestamps=True, output_offsets=True)

You'll see that outputs["offsets"][11] is:

{'text': ' une viche correctio...ai reçu.", 'timestamp': (16.0, 1.0)}

So you see the transition between the prompt inputs ids and the generated tokens. It is as if the model had decoded a long input into two chunks (long-form generation).

In any other cases, since there's no other timestamps than the ones in the generated tokens, the tokenizer will start the first timestamp at the beginning of the generated text part.

BTW, there's another shortcoming in the documentation: if there's a prompt_ids W/O timestamps and output_offsets=True, the prompt_ids will be removed from the decoded text. Would you like to open a documentation PR about it ?

3. Impact of prompt_ids timestamps on the generation:

As you can see the timestamps differ when we [...] don't include timestamps in the prompt_ids.

This is clearly true, but in my opinion, it's because when you include or don't include timestamps into your prompts, you actually change your prompts. So the model has a different behavior, because it doesn't see the same tokens in one case or another.
That might explains the timestamps difference and the generated text difference.

TL;DR: everything has a reason!

Would you like to open a documentation PR to correct the points highlighted above?

cc @eustlb for visibility

bruno-hays · 2024-09-06T16:12:06Z

@ylacombe
Thanks for the very detailed answer. I'll try to update the doc on Monday

ylacombe · 2024-09-11T15:28:19Z

Closing the issue, since the docs were updated !

bruno-hays added the bug label Aug 1, 2024

amyeroberts added the Audio label Aug 1, 2024

ArthurZucker added the Whisper label Aug 27, 2024

ArthurZucker mentioned this issue Aug 27, 2024

Whisper generate return a slice of result if result have more than one added token #33082

Closed

4 tasks

bruno-hays mentioned this issue Sep 9, 2024

Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour #33390

Merged

5 tasks

ylacombe closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WhisperTokenizer with decode_with_timestamps: behaviour is incoherent when skipping special tokens #32378

WhisperTokenizer with decode_with_timestamps: behaviour is incoherent when skipping special tokens #32378

bruno-hays commented Aug 1, 2024 •

edited

Loading

ArthurZucker commented Aug 27, 2024 •

edited

Loading

ArthurZucker commented Aug 27, 2024

ylacombe commented Sep 5, 2024

bruno-hays commented Sep 6, 2024

ylacombe commented Sep 11, 2024

WhisperTokenizer with decode_with_timestamps: behaviour is incoherent when skipping special tokens #32378

WhisperTokenizer with decode_with_timestamps: behaviour is incoherent when skipping special tokens #32378

Comments

bruno-hays commented Aug 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Aug 27, 2024 • edited Loading

ArthurZucker commented Aug 27, 2024

ylacombe commented Sep 5, 2024

bruno-hays commented Sep 6, 2024

ylacombe commented Sep 11, 2024

bruno-hays commented Aug 1, 2024 •

edited

Loading

ArthurZucker commented Aug 27, 2024 •

edited

Loading