Wrong result #367

tcl8273 · 2022-10-19T09:51:00Z

tcl8273
Oct 19, 2022

I used whisper with a large model. I run it 3 times and get 3 results.
{'text': ' every', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 2.0, 'text': ' every', 'tokens': [50364, 633, 50464], 'temperature': 1.0, 'avg_logprob': -3.2822134494781494, 'compression_ratio': 0.38461538461538464, 'no_speech_prob': 0.3573426604270935}], 'language': 'ja'}
{'text': '發信さんが楽しみにしているのですごい楽しみにしている人はいい加減してます', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 16.98, 'text': '發信さんが楽しみにしているのですごい楽しみにしている人はいい加減してます', 'tokens': [50364, 14637, 17665, 15567, 5142, 35479, 2849, 11362, 4108, 8822, 22979, 23072, 41068, 9991, 1764, 35479, 2849, 11362, 4108, 8822, 22979, 4035, 3065, 220, 13806, 9990, 9592, 249, 8822, 5368, 51213], 'temperature': 1.0, 'avg_logprob': -2.9063494205474854, 'compression_ratio': 0.4367816091954023, 'no_speech_prob': 0.3573426604270935}], 'language': 'ja'}
{'text': '歌うわくわり', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 16.54, 'text': '歌うわくわり', 'tokens': [50364, 29582, 35845, 6134, 9206, 5095, 51191], 'temperature': 1.0, 'avg_logprob': -3.4128024578094482, 'compression_ratio': 0.25, 'no_speech_prob': 0.3573426604270935}], 'language': 'ja'}
And I see the result does not match the input.
Could you help me resolve this problem?
Thank you.

japaneseV2.1.mp4

Answered by jongwook

Oct 20, 2022

Whisper is not very accurate on singing voices in general. By using the large model and giving a prompt (of the words preceding the audio), it gets pretty close, except the last line where it hallucinated and got the timestamp wrong.

root@devbox-0:~$ whisper --model large 196658015-54bed2d2-218b-414a-8010-43c2021fe8fa.mp4 --language ja --initial_prompt "ふわふわる"
[00:00.000 --> 00:06.000] ふわふわり あなたが笑っている それだけで笑いになる
[00:06.000 --> 00:14.000] 神様ありがとう 運命のイタズラでも
[00:14.000 --> 00:31.000] 眩暮らし

View full answer

jongwook · 2022-10-20T00:22:27Z

jongwook
Oct 20, 2022
Maintainer

Whisper is not very accurate on singing voices in general. By using the large model and giving a prompt (of the words preceding the audio), it gets pretty close, except the last line where it hallucinated and got the timestamp wrong.

root@devbox-0:~$ whisper --model large 196658015-54bed2d2-218b-414a-8010-43c2021fe8fa.mp4 --language ja --initial_prompt "ふわふわる"
[00:00.000 --> 00:06.000] ふわふわり あなたが笑っている それだけで笑いになる
[00:06.000 --> 00:14.000] 神様ありがとう 運命のイタズラでも
[00:14.000 --> 00:31.000] 眩暮らし

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong result #367

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Wrong result #367

Uh oh!

tcl8273 Oct 19, 2022

Replies: 1 comment

Uh oh!

jongwook Oct 20, 2022 Maintainer

tcl8273
Oct 19, 2022

jongwook
Oct 20, 2022
Maintainer