Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Illogical "Avoid computing higher temperatures on no_speech" #1903

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Purfview
Copy link

@Purfview Purfview commented Dec 17, 2023

Bugfix for #1279

The bug: It's "silence" when decoding has failed due to compression_ratio_threshold [+no_speech_threshold] in #1279, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to logprob_threshold [+no_speech_threshold].

Like described there:

parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")

And in code there:

if no_speech_threshold is not None:
# no voice activity check
should_skip = result.no_speech_prob > no_speech_threshold
if (
logprob_threshold is not None
and result.avg_logprob > logprob_threshold
):
# don't skip if the logprob is high enough, despite the no_speech_prob
should_skip = False

@Purfview
Copy link
Author

Related: SYSTRAN/faster-whisper#621

@Purfview
Copy link
Author

Purfview commented Dec 17, 2023

I think this bug can trigger the hallucination loops because on some hallucination it wouldn't trigger the prompt reset on high temperature , and because higher temperatures are not computed on what is not an actual "silence".

@Purfview
Copy link
Author

@TheoBoyer @jongwook , would be great if you could have a look.

@TheoBoyer
Copy link
Contributor

This change is consistent with the rest of the code, so I'm not against it.

The original PR indeed skipped processing based on the logprob_threshold, but it was also contingent on logprob_threshold being set. @jongwook modified this. I assume the intention was to make the process independent of whether a threshold is set, but there may be reasons for this change that I'm unaware of.

However, I'm skeptical about involving logprob_threshold in silence discrimination in the first place.
The approach figure in the original paper clearly shows that there shouldn't be any decoding after no_speech.

Approach

PR #1279 was created because no_speech does not depend on token decoding; hence, regardless of the tokens decoded, no_speech_prob will remain unchanged.

In the (too) few experiments I conducted, the model seemed capable of hallucinating high-probability tokens during silences. It would be beneficial if someone could further investigate the relevance of incorporating logprob_threshold in silence discrimination. I'm also interested to know if any related experiments already exist.

@Purfview
Copy link
Author

Purfview commented Dec 18, 2023

However, I'm skeptical about involving logprob_threshold in silence discrimination in the first place.

no_speech_threshold alone is pretty unreliable, model can generate no_speech_prob close to 1.0 on a perfectly fine speech.

@Purfview
Copy link
Author

I think this bug can trigger the hallucinations loop because on some hallucination it wouldn't trigger the prompt reset on high temperature , because higher temperatures are not computed on what is not an actual "silence".

My guess was right, Today I encountered one:

DEBUG: Compression ratio threshold is not met with temperature 0.0 (6.677966 > 2.400000)
[04:17.320 --> 04:29.020]  been doing it for a long time. I'm a professional. I'm a professional. I'm a
[04:29.020 --> 04:29.340]  professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:29.340 --> 04:34.560]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:34.560 --> 04:38.360]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm
[04:38.360 --> 05:03.750]  a professional. I'm a professional. I'm a professional. I'm a professional. I'm

No hallucination loop with this bugfix:

DEBUG: Compression ratio threshold is not met with temperature 0.0 (6.677966 > 2.400000)
DEBUG: Compression ratio threshold is not met with temperature 0.2 (8.533333 > 2.400000)
DEBUG: Compression ratio threshold is not met with temperature 0.4 (8.884615 > 2.400000)
[04:17.320 --> 04:22.640]  got me feeling natural. Finding a natural-seeming way to fail at any given task.
[04:23.700 --> 04:27.140]  In each of the commercials that I'm in, I'm the one who simply can't go on
[04:27.140 --> 04:33.340]  without the product. It's ridiculous that we don't have the product. Show them.
DEBUG: Reset prompt. prompt_reset_on_temperature threshold is met 0.600000 > 0.500000
DEBUG: Log probability threshold is not met with temperature 0.0 (-1.344815 < -1.000000)
DEBUG: Log probability threshold is not met with temperature 0.2 (-1.150256 < -1.000000)
[04:33.340 --> 04:35.340]  No, you shouldn't.
[04:36.020 --> 04:36.300]  Please.
[04:36.560 --> 04:37.520]  You wanna see?
[04:38.020 --> 04:39.080]  Yeah, I wanna see.
[04:43.260 --> 04:44.120]  She's amazing.
[05:03.870 --> 05:05.110]  I just...
[05:05.110 --> 05:05.650]  I...

Bugfix for openai#1279

It's "silence" when decoding has failed due to `compression_ratio_threshold` too, when further down the code it's not "silence" anymore.

"Silence" should be only when decoding has failed due to `logprob_threshold`.

Like described there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L421

And in code there:
https://github.com/openai/whisper/blob/8bc8860694949db53c42ba47ddc23786c2e02a8b/whisper/transcribe.py#L243-L251
@Purfview
Copy link
Author

Another example of hallucination fix: #1962

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants