Sharing settings that have had the most accurate transcripion for me so far #2766

Roman215 · 2026-04-19T02:09:44Z

Roman215
Apr 19, 2026

So I wanted to share some configuration that I've had extremely accurate results for English ASR with. Using whisper-ctranslate2 using settings that are probably not conventional. Here's an example of the settings I used from the command line:

whisper-ctranslate2 --model large-v2 --device cuda --output_format srt --task transcribe --language English --patience 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 --logprob_threshold -10000000000 --temperature 0.0000000000001 --best_of 1 --beam_size 1 --condition_on_previous_text False --no_speech_threshold 0.99 --compute_type float32 --suppress_tokens 50364 --suppress_blank False --repetition_penalty 1 --length_penalty -10000000000 <input_file_path>

I found out that using large-v2 and setting patience to an extremely high value combined with temperature very close to 0 as possible but not 0, trying to "avoid" failures in detection by setting logprob_threshold and length_penalty very low and only picking 1 for best_of and beam_size while letting it hear as much as possible with the 0.99 no_speech_threshold produces accurate results. I also almost never see any hallucinations in the output with large-v2 (with large-v3 I see more hallucinations and generally worse results).

It even detects music/lyrics and can many times produce lyrics to the music. It can also detect sound effects similar to many subtitles and might say [laughter], [evil laughter], [growling], [baby crying] for sounds that sound very close to those.

I know tools like whisperx try to include VAD to try to reduce hallucinations and all, but I found that VAD is actually harmful to this method and large-v2 is actually very good at listening to complete audio with the settings I mentioned without producing hallucinations.

If anyone has tried others and produced better results, I'm open to hearing about those too so I could test them out as well.

misutoneko · 2026-04-20T13:19:27Z

misutoneko
Apr 20, 2026

Thank you so much for sharing! Wow, look at all those zeroes :D
Not sure how effective those are (what does "float" mean in Python?) but I guess you've tested it...
Well, I appreciate the full disclosure in any case :D

In my own testing I've seen that --suppress-tokens alone can be very effective, but my tests have been somewhat limited in many ways. And I've never bothered much with v2 or v3 (those are a bit too heavy for the potato cpu that I'm using).
I can't replicate your setup exactly, but from some quick trials (with the small model & original OpenAI whisper) this does seem to work pretty well. Or, at least it didn't do any worse than using --suppress_tokens only.
More testing is needed, of course.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing settings that have had the most accurate transcripion for me so far #2766

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Sharing settings that have had the most accurate transcripion for me so far #2766

Uh oh!

Uh oh!

Roman215 Apr 19, 2026

Replies: 1 comment

Uh oh!

Uh oh!

misutoneko Apr 20, 2026

Roman215
Apr 19, 2026

misutoneko
Apr 20, 2026