Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language detection for large-v3 #136

Closed
andruxa-smirnov opened this issue Nov 15, 2023 · 11 comments
Closed

Language detection for large-v3 #136

andruxa-smirnov opened this issue Nov 15, 2023 · 11 comments

Comments

@andruxa-smirnov
Copy link

Language detection for large-v3 does not work:

RuntimeError: Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead

For all other models detection work fine.

@Jeronymous
Copy link
Member

Can you please provide the code or command that is failing?
And the version of whisper-timestamped?

Some related issues were fixed on Monday, and whisper-timestamped should work perfectly with large-v3 now (I double checked : not specifying the language works with that last model).
It's not clear how you use whisper-timestamped (which is designed for timestamped transcriptions) for language detection...

@andruxa-smirnov
Copy link
Author

andruxa-smirnov commented Nov 15, 2023

whisper_timestamped 1.13.2

`import whisper_timestamped as whisper

audio = whisper.load_audio("test.tmp")

lang_audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(lang_audio).to("cuda")
# detect the spoken language
model = whisper.load_model("large-v3", download_root="/opt/models/", device="cuda")
_, probs = model.detect_language(mel)

result = whisper.transcribe(model, audio)

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))
`

@andruxa-smirnov
Copy link
Author

whisper - 20231106

@Jeronymous
Copy link
Member

OK then you should just call log_mel_spectogram with the right number of features (it changed in large-v3, from 80 to 128):

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Nothing related to this repo as your usage is not documented here.

Side note: why don't you pass the detected language to whisper.transcribe?

@andruxa-smirnov
Copy link
Author

I want to detect language first, than transcribe using top language

@andruxa-smirnov
Copy link
Author

andruxa-smirnov commented Nov 15, 2023

mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda")

Ok. You are absolutely right. Now it works fine. Thank you!

@Jeronymous
Copy link
Member

I want to detect language first, than transcribe using top language

Yes, look at my command above.
I suggest you pass the detected language to the model.
To save computation, and also to guarantee those are the same languages (it could be different in some settings, for instance if VAD is used in whisper.transcribe)

And thinking about it, your usage is not optimal, because you possibly extract mel on a super long audio, to just use the first 30 sec in the end (as language detection is performed on the first 30 seconds).
It seems possible to easily add something in whisper-timestamped to add the language probability as a new key in the output dictionary.
You can open a issue requesting this feature if you are interested (in having a simple optimized way to do what you do)

@andruxa-smirnov
Copy link
Author

I don't understand, why it's not optimal? I am taking 30 seconds from audio, than transforming it to mel for language detection. After detecting language i am passing to transcribing

@Jeronymous
Copy link
Member

Jeronymous commented Nov 15, 2023

OK my bad, I missed the use of pad_or_trim.

Then there is only the problem of guaranteeing that detected languages are the same (between model.detect_language and whisper.transcribe). Devils are in the details (butterfly effects...) and there can be some corner cases where it detect 2 different languages.
You can probably start by checking that they are the same.

Note that using VAD (option available in whisper_timestamped.transcribe) can improve a lot language detections, in case where the first 30 sec of audio mainly contain silence or music background.
So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

@andruxa-smirnov
Copy link
Author

So again, adding language detection probability "inside" whisper_timestamped.transcribe would be more user-friendly and unlock possible improvements.

It would be interesting

@Jeronymous
Copy link
Member

@andruxa-smirnov I added the feature.
Now, if you don't specify the language of the audio, you will have a new key in the output dictionary, with the language probabilities.
So the output will look like

{
  ...
  "language": "fr",
  "language_probs": {
    "en": 0.027954353019595146,
    "zh": 0.02743500843644142,
    ...
    "su": 3.0119704064190955e-08,
    "yue": 2.2565967810805887e-05
  }
}

and it should work with all options.
You can read https://github.com/linto-ai/whisper-timestamped#options-that-may-improve-results to see options that can improve accuracy (like VAD)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants