-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finetuning on own dataset #156
Comments
Hey @raghav-menon, Did you play around with the hyper-parameters a bit to see what works / doesn't work well? One important thing to notice with Also I would definitely keep utterances that are <4s. I usually filter only utterances that are shorter than <1s (or even keep those as well). It's very normal that the validation loss goes up again where as the WER continues going down. In terms of hyperparameters, I would try to play around with the Also it might help quite a bit to use data-augmentation (@anton-l - do you maybe have a link to some good techniques here?) |
Hi @raghav-menon ! I found that these hyperparameters work pretty well for most languages in CommonVoice: Also these augmentations from Audiomentations gave a bit more stability: Keeping shorter utterances should not be a problem too. It's more important to catch any incorrectly transcribed clips, since they can greatly destabilize the CTC loss (if you get |
Hello @patrickvonplaten, Thank you for your response. I did play around with the hyperparameters. A slight deviation from the value given the WER remains at 1 throughout. So not much of a progress on that end. The data is real time data from radio transmission recording and hence not of studio quality. I will, as you have mentioned, filter out the ones which are <1s and keep the rest and try it. I had also tried pretraining wav2vec2 with untranscribed data but looks like even the colab pro memory is not enough. I will let you know how it progresses. Thanks. Regards, |
Hello @anton-l, Thanks for your suggestions. How did the trained model fare with the test data when you experienced increasing validation loss and decreasing WER. Just curious. I did not bother to run the model on the test data as model final WER was 76% and far worse than my HMM-GMM where I obtained a 60 WER. The best WER I had obtained for this data was with a TDNN architecture and it was around 50 were I had included a little bit of Self-supervised learning as well. Just to let you know my data is not studio quality as these are real time radio transmission recordings. I am wondering what the impact of noise is on the wav2vec2 feature extractor as it is a huge difference in WER. I will indeed try out your suggestions and let you know. Thanks. Regards, |
@raghav-menon the test data followed the same WER dynamic as the validation one. I also had very limited success working with noisy speech (youtube & radio). With a frozen feature encoder the model stopped converging at around 40 WER even with 100s of hours of speech. At the moment, the most promising pretrained model for noisy speech is Wav2Vec-Robust (https://huggingface.co/facebook/wav2vec2-large-robust), which may or may not work for you, since the training data for it was English-only. |
Hello Patrick @patrickvonplaten
Thanks for the nice post on how to finetune wav2vec2. It was quite intuitive and simple.
I had been trying to fine tune "facebook/wav2vec2-large-xlsr-53" and also "facebook/wav2vec2-base" for my own Somali dataset which I had used previously in my research [The best WER obtained was approximately 50 using a combination of CNN-LSTM system]. The dataset contains extremely low amounts of data for finetuning (approximately maybe an hour of data). In terms of utterances, I have approximately 500 odd utterance and this is after cleaning and getting utterances whose length is > 4s. Tried fine tuning in the way described in your post but unfortunately the best WER I could obtain is 76 [my hmm-gmm systems was giving me a WER of approximately 63 with the same dataset where I have included all the sentences and not just the ones > 4s]. Something else which I noticed was that the validation loss decreases to a certain point and then starts increasing but the WER does not seem to increase reflecting the increase in validation loss. Not able to comprehend what is going on. Would you be able to comment on what is going on and am I overlooking anything here which is important. I was interested in wav2vec2 as there were claims that it could fine tune with as little as 10m of data.
Thanks in advance
Regards,
Raghav
The text was updated successfully, but these errors were encountered: