Sharing settings that have had the most accurate transcripion for me so far #2766
Replies: 1 comment
-
|
Thank you so much for sharing! Wow, look at all those zeroes :D In my own testing I've seen that --suppress-tokens alone can be very effective, but my tests have been somewhat limited in many ways. And I've never bothered much with v2 or v3 (those are a bit too heavy for the potato cpu that I'm using). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
So I wanted to share some configuration that I've had extremely accurate results for English ASR with. Using whisper-ctranslate2 using settings that are probably not conventional. Here's an example of the settings I used from the command line:
whisper-ctranslate2 --model large-v2 --device cuda --output_format srt --task transcribe --language English --patience 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 --logprob_threshold -10000000000 --temperature 0.0000000000001 --best_of 1 --beam_size 1 --condition_on_previous_text False --no_speech_threshold 0.99 --compute_type float32 --suppress_tokens 50364 --suppress_blank False --repetition_penalty 1 --length_penalty -10000000000 <input_file_path>
I found out that using large-v2 and setting patience to an extremely high value combined with temperature very close to 0 as possible but not 0, trying to "avoid" failures in detection by setting logprob_threshold and length_penalty very low and only picking 1 for best_of and beam_size while letting it hear as much as possible with the 0.99 no_speech_threshold produces accurate results. I also almost never see any hallucinations in the output with large-v2 (with large-v3 I see more hallucinations and generally worse results).
It even detects music/lyrics and can many times produce lyrics to the music. It can also detect sound effects similar to many subtitles and might say [laughter], [evil laughter], [growling], [baby crying] for sounds that sound very close to those.
I know tools like whisperx try to include VAD to try to reduce hallucinations and all, but I found that VAD is actually harmful to this method and large-v2 is actually very good at listening to complete audio with the settings I mentioned without producing hallucinations.
If anyone has tried others and produced better results, I'm open to hearing about those too so I could test them out as well.
Beta Was this translation helpful? Give feedback.
All reactions