"Субтитры подогнал «Симон»" - dirty datasets for Russian subtitles. #2131

LocalVoidPictures · 2024-04-12T17:02:11Z

LocalVoidPictures
Apr 12, 2024

On multiple occasions, I have received "nonsensical" results, such as the one in the title (read literally - "Subtitles by Simon"). You can frequently find this line by doing a simple online search. This phrase has nothing in common with actual transcription. Including it in the dataset results in wildly incorrect results.

Funnily enough, I'm only getting this when using a large model, not medium or tiny.

Where do we report this?

Here's a sample of what I'm getting sometimes:

% whisper "video.mov" --model large --language Russian
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00.000 --> 00:29.980]  Субтитры подогнал «Симон»
[00:30.000 --> 00:59.980]  Субтитры подогнал «Симон»
[01:00.000 --> 01:29.980]  Субтитры подогнал «Симон»
[01:30.000 --> 01:59.980]  Субтитры подогнал «Симон»
[02:00.000 --> 02:29.980]  Субтитры подогнал «Симон»
[02:30.000 --> 02:59.980]  Субтитры подогнал «Симон»
[03:00.000 --> 03:29.980]  Субтитры подогнал «Симон»
[03:30.000 --> 03:59.980]  Субтитры подогнал «Симон»
[04:00.000 --> 04:29.980]  Субтитры подогнал «Симон»
[04:30.000 --> 04:59.980]  Субтитры подогнал «Симон»
[05:00.000 --> 05:29.980]  Субтитры подогнал «Симон»
[05:30.000 --> 05:59.980]  Субтитры подогнал «Симон»
[06:00.000 --> 06:29.980]  Субтитры подогнал «Симон»
[06:30.000 --> 06:59.980]  Субтитры подогнал «Симон»

itaipee · 2024-04-14T07:59:04Z

itaipee
Apr 14, 2024

large-v2 has much lesser hallucinations than the original large model

please try
whisper "video.mov" --model large-v2 --language Russian

if it does not help , you need to research a bit on hallucinations , there are several posts on the subject.

p.s. do the 7 minutes in the video include any speech , or just silence or music ?

1 reply

LocalVoidPictures Apr 16, 2024
Author

In fact I've used large-v3, which is what is automatically downloaded by the CLI tool whenever I set --model large. Perhaps though it's not as straightforward as it sounds? I'll try your suggestion nevertheless.

The audio is chock-full of text, except for the initial 2 minutes (ambience). However, smaller models pick up on it correctly (they just make too many mistakes in the words themselves, which is why I attempted a larger model).

I'll check out the posts on hallucinations of course. However, I believe that the issue I raised is important regardless. It's not a random hallucination, it's obviously a part of a larger issue. You can find this exact phrase all over the internet with a simple search. For example, check here (Ctrl+F for Субтитры подогнал «Симон»). Obviously it introduces unnecessary noise into the model. The question is whether this is the correct place to raise this topic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Субтитры подогнал «Симон»" - dirty datasets for Russian subtitles. #2131

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

"Субтитры подогнал «Симон»" - dirty datasets for Russian subtitles. #2131

LocalVoidPictures Apr 12, 2024

Replies: 1 comment · 1 reply

itaipee Apr 14, 2024

LocalVoidPictures Apr 16, 2024 Author

LocalVoidPictures
Apr 12, 2024

Replies: 1 comment 1 reply

itaipee
Apr 14, 2024

LocalVoidPictures Apr 16, 2024
Author