Last setup what was used for training best fine tuned model for Whisper in HuggingFace Fine tuning event 2022.
First modification was to get access to bigger batch_size
without gradient_accumulation_steps
using DeepSpeed.
To make it run inside Docker, I've used guide from Zihao's blogpost.
It was idea from Bayar. Whisper model uses 30 second batches, but Common Voice dataset is around 3-5 seconds of audio in each sample. We can concatenate audio and text together to fewer samples. To learn from more dense data. It should run faster and learn a lot more from each sample.
According to some details of training Large v2 model in Whisper paper I have some ideas to try in next steps.
- SpecAugment: A New Data Augmentation Method for Automatic Speech Recognition
- More data collected from other datasets (google/fleurs, common voice 12/13) and combination with Farsipal multistreaming modification.
- PyTorch 2.0 optimization
- Collect custom dataset to get more training data
- creative commons filter on YouTube and videos with subtitles.
- download videos without subtitles, use whisper to get some and manually fix them ASR corpus creator
- HuggingFace crew - for event itself and all support on discord
- LabmdaLabs - for all GPU hours (insane 20k+ !!!)
- OpenAI for Whisper model