In order to quickly train in Google Colab, audio files were locally converted into melspectograms using the generate_mel_spec script and then stored as a HDF5 file which was then read into memory for traning and inference (reading from google drive while training was about 5x slower than reading from RAM).
- Melspectogram
- 32kHz sampling rate
- 128 mels
- Mel power 2
- Pretrained efficientNet-B4, 2x64 fully connected layers with 0.3 dropout
- Input size 380x380
- No image augmentations
- 200 epochs
- Batch size of 32
- Random 4 second crop from 6 second window around sample (hard labels)
- 5-fold cross validation
- Projections made on every 4 second window for the 60 second sample using a sliding window of 1 second.
- These results were then adjusted by using a 3 second running average to smoothen out the estimates (raised LB score by roughly 0.005)
- Max value accross all samples and folds for each class gave the final score for each recording
- EfficientNet-B4
- EfficientNet-B3
- ResNest50
- ResNest101