-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language detection for large-v3 #136
Comments
Can you please provide the code or command that is failing? Some related issues were fixed on Monday, and whisper-timestamped should work perfectly with large-v3 now (I double checked : not specifying the language works with that last model). |
whisper_timestamped 1.13.2 `import whisper_timestamped as whisper audio = whisper.load_audio("test.tmp") lang_audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(lang_audio).to("cuda") result = whisper.transcribe(model, audio) import json |
whisper - 20231106 |
OK then you should just call
Nothing related to this repo as your usage is not documented here. Side note: why don't you pass the detected language to |
I want to detect language first, than transcribe using top language |
mel = whisper.log_mel_spectrogram(lang_audio, model.dims.n_mels).to("cuda") Ok. You are absolutely right. Now it works fine. Thank you! |
Yes, look at my command above. And thinking about it, your usage is not optimal, because you possibly extract mel on a super long audio, to just use the first 30 sec in the end (as language detection is performed on the first 30 seconds). |
I don't understand, why it's not optimal? I am taking 30 seconds from audio, than transforming it to mel for language detection. After detecting language i am passing to transcribing |
OK my bad, I missed the use of Then there is only the problem of guaranteeing that detected languages are the same (between Note that using VAD (option available in whisper_timestamped.transcribe) can improve a lot language detections, in case where the first 30 sec of audio mainly contain silence or music background. |
It would be interesting |
@andruxa-smirnov I added the feature.
and it should work with all options. |
Language detection for large-v3 does not work:
RuntimeError: Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead
For all other models detection work fine.
The text was updated successfully, but these errors were encountered: