finetuning on own dataset #156

raghav-menon · 2021-09-16T10:27:10Z

Thanks for the nice post on how to finetune wav2vec2. It was quite intuitive and simple.

I had been trying to fine tune "facebook/wav2vec2-large-xlsr-53" and also "facebook/wav2vec2-base" for my own Somali dataset which I had used previously in my research [The best WER obtained was approximately 50 using a combination of CNN-LSTM system]. The dataset contains extremely low amounts of data for finetuning (approximately maybe an hour of data). In terms of utterances, I have approximately 500 odd utterance and this is after cleaning and getting utterances whose length is > 4s. Tried fine tuning in the way described in your post but unfortunately the best WER I could obtain is 76 [my hmm-gmm systems was giving me a WER of approximately 63 with the same dataset where I have included all the sentences and not just the ones > 4s]. Something else which I noticed was that the validation loss decreases to a certain point and then starts increasing but the WER does not seem to increase reflecting the increase in validation loss. Not able to comprehend what is going on. Would you be able to comment on what is going on and am I overlooking anything here which is important. I was interested in wav2vec2 as there were claims that it could fine tune with as little as 10m of data.

Thanks in advance

Regards,
Raghav

patrickvonplaten · 2021-09-21T09:42:58Z

Hey @raghav-menon,

Did you play around with the hyper-parameters a bit to see what works / doesn't work well? One important thing to notice with facebook/wav2vec2-large-xlsr-53 is that it was pretrained on read-out audio data meaning that the data was quite clean. Is this also the case for you dataset?

Also I would definitely keep utterances that are <4s. I usually filter only utterances that are shorter than <1s (or even keep those as well).

It's very normal that the validation loss goes up again where as the WER continues going down.

In terms of hyperparameters, I would try to play around with the learning_rate, batch_size and hidden_dropout -> those seem to be quite correlated to the final WER. Here is a nice graphic about hyperparameters for the fine-tuning the model in Turkish: https://wandb.ai/wandb/xlsr/sweeps/p23j88jo?workspace=user-borisd13

Also it might help quite a bit to use data-augmentation (@anton-l - do you maybe have a link to some good techniques here?)

anton-l · 2021-09-21T12:58:22Z

Hi @raghav-menon !
My experience with low-resource languages was quite similar to yours (increasing validation loss + decreasing WER). I'm speculating that this is due to overfitting to high-frequency words.

I found that these hyperparameters work pretty well for most languages in CommonVoice:
attention_dropout=0.2, activation_dropout=0.1, hidden_dropout=0.05, final_dropout=0.1, feat_proj_dropout=0.05, mask_time_prob=0.05, layerdrop=0.04, learning_rate=3e-4 + batch_size=128 (can be obtained e.g. with batch_size=16, gradient_accumulation_steps=8)

Also these augmentations from Audiomentations gave a bit more stability:
AddGaussianNoise(min_amplitude=0.0001, max_amplitude=0.005, p=0.2) and PitchShift(min_semitones=-1, max_semitones=2, p=0.2)

Keeping shorter utterances should not be a problem too. It's more important to catch any incorrectly transcribed clips, since they can greatly destabilize the CTC loss (if you get inf loss with ctc_zero_infinity=False, then it's likely that) .

raghav-menon · 2021-09-22T10:37:27Z

Hello @patrickvonplaten,

Thank you for your response. I did play around with the hyperparameters. A slight deviation from the value given the WER remains at 1 throughout. So not much of a progress on that end. The data is real time data from radio transmission recording and hence not of studio quality.

I will, as you have mentioned, filter out the ones which are <1s and keep the rest and try it. I had also tried pretraining wav2vec2 with untranscribed data but looks like even the colab pro memory is not enough.

I will let you know how it progresses.

Thanks.

Regards,
Raghav

raghav-menon · 2021-09-22T10:47:14Z

Hello @anton-l,

Thanks for your suggestions. How did the trained model fare with the test data when you experienced increasing validation loss and decreasing WER. Just curious. I did not bother to run the model on the test data as model final WER was 76% and far worse than my HMM-GMM where I obtained a 60 WER. The best WER I had obtained for this data was with a TDNN architecture and it was around 50 were I had included a little bit of Self-supervised learning as well. Just to let you know my data is not studio quality as these are real time radio transmission recordings. I am wondering what the impact of noise is on the wav2vec2 feature extractor as it is a huge difference in WER.

I will indeed try out your suggestions and let you know.

Thanks.

Regards,
Raghav

anton-l · 2021-09-22T11:58:09Z

@raghav-menon the test data followed the same WER dynamic as the validation one.

I also had very limited success working with noisy speech (youtube & radio). With a frozen feature encoder the model stopped converging at around 40 WER even with 100s of hours of speech.

At the moment, the most promising pretrained model for noisy speech is Wav2Vec-Robust (https://huggingface.co/facebook/wav2vec2-large-robust), which may or may not work for you, since the training data for it was English-only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finetuning on own dataset #156

finetuning on own dataset #156

raghav-menon commented Sep 16, 2021 •

edited

patrickvonplaten commented Sep 21, 2021

anton-l commented Sep 21, 2021

raghav-menon commented Sep 22, 2021

raghav-menon commented Sep 22, 2021

anton-l commented Sep 22, 2021

finetuning on own dataset #156

finetuning on own dataset #156

Comments

raghav-menon commented Sep 16, 2021 • edited

patrickvonplaten commented Sep 21, 2021

anton-l commented Sep 21, 2021

raghav-menon commented Sep 22, 2021

raghav-menon commented Sep 22, 2021

anton-l commented Sep 22, 2021

raghav-menon commented Sep 16, 2021 •

edited