Add per token confidence to each segment. #991

Pikauba · 2023-02-22T17:55:47Z

As it seems to be useful for some user, I implemented a way to obtain the per token confidence of the model.

The tokens_probs will be added to the dictionary of each segment the same way tokens is. Each tokens_probs item's position in the list exactly matches each item's position in the tokens list. The confidence of each token can easily be retrieved because the positions of the elements in the two lists match.

As it is difficult to produce a general method to append the tokens which represent a word together depending on the language, I solely focused on producing the token level confidence. I am aware that there is work in progress about a word level timestamp which deals with regrouping tokens per word when possible: word-level timestamps in transcribe() #869

More specifically the split_tokens_on_spaces function.

So we could eventually use this to merge confidences of the tokens when the language used allows us to do so.

I would like to know if I would be necessary to set the tokens_probs results in each segment as optional and if so what would be the expectations about the structure of the implementation of it.

ryanheise · 2023-02-23T01:08:55Z

I am aware that there is work in progress about a word level timestamp which deals with regrouping tokens per word when possible: word-level timestamps in transcribe() #869

More specifically the split_tokens_on_spaces function.

There is actually word-level confidence in the word-level timestamps branch already (here) .

Although maybe I can see some benefit to having a token-level analysis option. For example, German can have very long compound words where it might be helpful in some use cases to have a finer level of analysis.

Pikauba · 2023-02-28T15:43:00Z

Thank you for pointing that out! I looked at the code and wonder if obtaining the confidence per token once the whole auto-regression process is completed by applying a feed-forward pass with the resulting tokens( as described here) and fetch the confidence provided by the model at this point for each token while assuming inference is done without timestamps -> text_tokens = [token for token in tokens if token < tokenizer.eot] is equivalent to saving the confidence score per token throught the auto-regression process. (I tested it and it seems like this might actually not be the case from what I obtained).

However, I might be wrong and could misunderstand something about all of it but isn't it right that the timestamp's predictions are impactful on the predictions of the all the other tokens as they are predicted throught the same inference process?

# coming from line 173 of transcribe.py (https://github.com/openai/whisper/blob/8eb29c3ef10559910cbee47b1baedefd8388458a/whisper/transcribe.py#L173)
tokens = torch.tensor(
    [
        *tokenizer.sot_sequence,
        tokenizer.no_timestamps,
        *text_tokens,
        tokenizer.eot,
    ]
).to(model.device) 

logits = model(mel.unsqueeze(0), tokens.unsqueeze(0))[0]
token_probs = logits[len(tokenizer.sot_sequence):, :tokenizer.eot].softmax(dim=-1)
text_token_probs = token_probs[np.arange(len(text_tokens)), text_tokens].tolist()

per token confidence added

6cbcc2c

Pikauba marked this pull request as draft February 23, 2023 05:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per token confidence to each segment. #991

Add per token confidence to each segment. #991

Pikauba commented Feb 22, 2023 •

edited

Loading

ryanheise commented Feb 23, 2023

Pikauba commented Feb 28, 2023 •

edited

Loading

Add per token confidence to each segment. #991

Are you sure you want to change the base?

Add per token confidence to each segment. #991

Conversation

Pikauba commented Feb 22, 2023 • edited Loading

ryanheise commented Feb 23, 2023

Pikauba commented Feb 28, 2023 • edited Loading

Pikauba commented Feb 22, 2023 •

edited

Loading

Pikauba commented Feb 28, 2023 •

edited

Loading