You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to use WhisperX to do forced alignment on the Mozilla Common Voice German Dataset, but the words are often cut of or the segments do not align at all.
Additionally, some audio tracks are recognized as Farsi instead of German.
Is it because of the short duration of these clips (< 2-5 seconds, each)?
And how can I improve this accuracy?
Is the accuracy of the english models (for english audio) better?
The text was updated successfully, but these errors were encountered:
I am using the Python API (result = model.transcribe(audio_file)) and was not aware of a parameter for the transcribe function, that allowed me to enforce a certain language.
I was able to improve the performance to a usable level by adding the extend_duration parameter with 0.1 as value, but it still cuts of the beginning of the word from time to time
I wanted to use WhisperX to do forced alignment on the Mozilla Common Voice German Dataset, but the words are often cut of or the segments do not align at all.
Additionally, some audio tracks are recognized as Farsi instead of German.
Is it because of the short duration of these clips (< 2-5 seconds, each)?
And how can I improve this accuracy?
Is the accuracy of the english models (for english audio) better?
The text was updated successfully, but these errors were encountered: