Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix probability computation in WhisperNoSpeechDetection when recomputing scores #29248

Merged
merged 3 commits into from Apr 3, 2024

Conversation

cifkao
Copy link
Contributor

@cifkao cifkao commented Feb 23, 2024

What does this PR do?

Fix #29313.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@patrickvonplaten @sanchit-gandhi @ylacombe

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whisper, num_beams > 1 paths seems a bit tricky. Would you mind adding a test to make sure we have the expected new results?

@cifkao
Copy link
Contributor Author

cifkao commented Mar 30, 2024

@ArthurZucker I added a slow test where I set the logprob_threshold to 0 to make sure the no-speech detection is triggered. Without the fix, all the outputs are empty because all the no-speech probabilities are >1.

@ylacombe
Copy link
Collaborator

ylacombe commented Apr 1, 2024

Hey @cifkao, thanks for the PR!
Could you make sure to rebase the branch to fix the failing test ?

This scenario arises when the model is called with language set and with num_beams > 1.

Also, could you point out why this scenario applies when language is set ? I understand the case for num_beams>1 but don't see the point for the other case!
Many thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@cifkao
Copy link
Contributor Author

cifkao commented Apr 1, 2024

Also, could you point out why this scenario applies when language is set ? I understand the case for num_beams>1 but don't see the point for the other case!

@ylacombe The bug manifests only when both conditions (num_beams > 1 and language set) are true. The latter condition causes decoding to start after the language token, which is why we need to call the model again to get the logits at that position (which are then incorrectly treated as log-probabilities). With language unset, the decoding starts from the beginning, so we never enter that branch.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @cifkao!

if input_ids.shape[1] == self.begin_index:
if self.start_of_trans_offset > 1:
with torch.no_grad():
logits = self.model(**self.inputs).logits

no_speech_index = self.begin_index - self.start_of_trans_offset
no_speech_scores = logits[:, no_speech_index]
is_scores_logprobs = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@@ -2615,6 +2615,59 @@ def test_whisper_longform_multi_batch_hard_prev_cond(self):
for i in range(num_samples):
assert decoded_all[i] == EXPECTED_TEXT[i]

@slow
def test_whisper_longform_no_speech_detection(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this slow test! Just to confirm, before the fix all the transcriptions are empty due to the no-speech probabilities exceeding 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

@sanchit-gandhi
Copy link
Contributor

Ready for final review from @ArthurZucker

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 awesome catch!

@ArthurZucker ArthurZucker merged commit 240e106 into huggingface:main Apr 3, 2024
21 checks passed
@ArthurZucker
Copy link
Collaborator

Thank you for taking the time to add a test 🔥

@sanchit-gandhi
Copy link
Contributor

Thanks for the contribution @cifkao!

ArthurZucker pushed a commit that referenced this pull request Apr 22, 2024
…uting scores (#29248)

* Fix is_scores_logprobs in WhisperNoSpeechDetection

* Add test_whisper_longform_no_speech_detection

* Fix typo
itazap pushed a commit that referenced this pull request May 14, 2024
…uting scores (#29248)

* Fix is_scores_logprobs in WhisperNoSpeechDetection

* Add test_whisper_longform_no_speech_detection

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect no-speech probability in WhisperNoSpeechDetection when language is set and num_beams > 1
5 participants