Language detection for large-v3 #136

andruxa-smirnov · 2023-11-15T06:44:30Z

Language detection for large-v3 does not work:

RuntimeError: Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead

For all other models detection work fine.

Jeronymous · 2023-11-15T08:06:10Z

Can you please provide the code or command that is failing?
And the version of whisper-timestamped?

Some related issues were fixed on Monday, and whisper-timestamped should work perfectly with large-v3 now (I double checked : not specifying the language works with that last model).
It's not clear how you use whisper-timestamped (which is designed for timestamped transcriptions) for language detection...

andruxa-smirnov · 2023-11-15T08:13:41Z

whisper_timestamped 1.13.2

`import whisper_timestamped as whisper

audio = whisper.load_audio("test.tmp")

lang_audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(lang_audio).to("cuda")
# detect the spoken language
model = whisper.load_model("large-v3", download_root="/opt/models/", device="cuda")
_, probs = model.detect_language(mel)

result = whisper.transcribe(model, audio)

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))
`

andruxa-smirnov · 2023-11-15T08:20:14Z

whisper - 20231106

Jeronymous · 2023-11-15T08:20:20Z

OK then you should just call log_mel_spectogram with the right number of features (it changed in large-v3, from 80 to 128):

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Nothing related to this repo as your usage is not documented here.

Side note: why don't you pass the detected language to whisper.transcribe?

andruxa-smirnov · 2023-11-15T08:21:59Z

I want to detect language first, than transcribe using top language

andruxa-smirnov · 2023-11-15T08:27:15Z

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Ok. You are absolutely right. Now it works fine. Thank you!

Jeronymous · 2023-11-15T08:27:55Z

I want to detect language first, than transcribe using top language

Yes, look at my command above.
I suggest you pass the detected language to the model.
To save computation, and also to guarantee those are the same languages (it could be different in some settings, for instance if VAD is used in whisper.transcribe)

And thinking about it, your usage is not optimal, because you possibly extract mel on a super long audio, to just use the first 30 sec in the end (as language detection is performed on the first 30 seconds).
It seems possible to easily add something in whisper-timestamped to add the language probability as a new key in the output dictionary.
You can open a issue requesting this feature if you are interested (in having a simple optimized way to do what you do)

andruxa-smirnov · 2023-11-15T08:39:40Z

I don't understand, why it's not optimal? I am taking 30 seconds from audio, than transforming it to mel for language detection. After detecting language i am passing to transcribing

Jeronymous · 2023-11-15T08:45:07Z

OK my bad, I missed the use of pad_or_trim.

Then there is only the problem of guaranteeing that detected languages are the same (between model.detect_language and whisper.transcribe). Devils are in the details (butterfly effects...) and there can be some corner cases where it detect 2 different languages.
You can probably start by checking that they are the same.

Note that using VAD (option available in whisper_timestamped.transcribe) can improve a lot language detections, in case where the first 30 sec of audio mainly contain silence or music background.
So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

andruxa-smirnov · 2023-11-15T08:48:15Z

So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

It would be interesting

Jeronymous · 2023-11-15T16:53:19Z

@andruxa-smirnov I added the feature.
Now, if you don't specify the language of the audio, you will have a new key in the output dictionary, with the language probabilities.
So the output will look like

{
  ...
  "language": "fr",
  "language_probs": {
    "en": 0.027954353019595146,
    "zh": 0.02743500843644142,
    ...
    "su": 3.0119704064190955e-08,
    "yue": 2.2565967810805887e-05
  }
}

and it should work with all options.
You can read https://github.com/linto-ai/whisper-timestamped#options-that-may-improve-results to see options that can improve accuracy (like VAD)

Jeronymous closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language detection for large-v3 #136

Language detection for large-v3 #136

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023 •

edited

Loading

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023 •

edited

Loading

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023 •

edited

Loading

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023

Language detection for large-v3 #136

Language detection for large-v3 #136

Comments

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023 • edited Loading

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023 • edited Loading

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023 • edited Loading

andruxa-smirnov commented Nov 15, 2023

Jeronymous commented Nov 15, 2023

andruxa-smirnov commented Nov 15, 2023 •

edited

Loading

andruxa-smirnov commented Nov 15, 2023 •

edited

Loading

Jeronymous commented Nov 15, 2023 •

edited

Loading