Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per token confidence to each segment. #991

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Pikauba
Copy link

@Pikauba Pikauba commented Feb 22, 2023

As it seems to be useful for some user, I implemented a way to obtain the per token confidence of the model.

The tokens_probs will be added to the dictionary of each segment the same way tokens is. Each tokens_probs item's position in the list exactly matches each item's position in the tokens list. The confidence of each token can easily be retrieved because the positions of the elements in the two lists match.

As it is difficult to produce a general method to append the tokens which represent a word together depending on the language, I solely focused on producing the token level confidence. I am aware that there is work in progress about a word level timestamp which deals with regrouping tokens per word when possible: word-level timestamps in transcribe() #869

More specifically the split_tokens_on_spaces function.

So we could eventually use this to merge confidences of the tokens when the language used allows us to do so.

I would like to know if I would be necessary to set the tokens_probs results in each segment as optional and if so what would be the expectations about the structure of the implementation of it.

@ryanheise
Copy link
Contributor

I am aware that there is work in progress about a word level timestamp which deals with regrouping tokens per word when possible: word-level timestamps in transcribe() #869

More specifically the split_tokens_on_spaces function.

There is actually word-level confidence in the word-level timestamps branch already (here) .

Although maybe I can see some benefit to having a token-level analysis option. For example, German can have very long compound words where it might be helpful in some use cases to have a finer level of analysis.

@Pikauba Pikauba marked this pull request as draft February 23, 2023 05:19
@Pikauba
Copy link
Author

Pikauba commented Feb 28, 2023

Thank you for pointing that out! I looked at the code and wonder if obtaining the confidence per token once the whole auto-regression process is completed by applying a feed-forward pass with the resulting tokens( as described here) and fetch the confidence provided by the model at this point for each token while assuming inference is done without timestamps -> text_tokens = [token for token in tokens if token < tokenizer.eot] is equivalent to saving the confidence score per token throught the auto-regression process. (I tested it and it seems like this might actually not be the case from what I obtained).

However, I might be wrong and could misunderstand something about all of it but isn't it right that the timestamp's predictions are impactful on the predictions of the all the other tokens as they are predicted throught the same inference process?

# coming from line 173 of transcribe.py (https://github.com/openai/whisper/blob/8eb29c3ef10559910cbee47b1baedefd8388458a/whisper/transcribe.py#L173)
tokens = torch.tensor(
    [
        *tokenizer.sot_sequence,
        tokenizer.no_timestamps,
        *text_tokens,
        tokenizer.eot,
    ]
).to(model.device) 

logits = model(mel.unsqueeze(0), tokens.unsqueeze(0))[0]
token_probs = logits[len(tokenizer.sot_sequence):, :tokenizer.eot].softmax(dim=-1)
text_token_probs = token_probs[np.arange(len(text_tokens)), text_tokens].tolist()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants