What is it?
This is the repo for our Portfolio project for DSR. The goal is to train a model that can reliably classify bird species from their songs and make it available as a webservice/app. Our motivation is twofold: we want to contribute to the development of tools for automated biodiversity monitoring and provide bird enthusiasts with a handy tool.
The bird recordings are downloaded form the Xeno-Canto database with
The audio recordings vary greatly in quality and number of species present. Assuming that the foreground species is usually the loudest in a recording we follow the methodology described in Sprengel et al., 2016 to extract signal sections from a noisy background. This script localizes spectrogram sections with amplitudes above 3 times frequency- and time-axis medians, allowing us to extract audio sections most likely containing foreground bird vocalizations. We run the script over all recordings in our storage and store the respective timestamps for signal sections in our database.
Initially our aim was to store only raw audio and integrate the preprocessing of spectrograms into a custom PyTorch Dataset. That way we would have retained flexibility in terms of the spectrogram functions and parameters. But despite extensive efforts in cutting down preprocessing time, data loading remained the main bottleneck in out training times.
Thus the decision was made to precompute spectrogram slices according to a procedure common in the literature: 5 second slices with 2.5 second overlap where first converted into spectrograms using STFT (FFT window size: 2048, hop length: 512) and then passed through a 256 Mel filterbank resulting in Mel-Spectrograms with a dimension of 256 x 216 x 1 as input to our models.
For our early approaches we rebuild models we found in the literature which all treat the audio classification task as an image problem. Through experiments with non-square kernels we tried to take into account the different nature of information on the time axis versus the frequency axis of the spectrogram, leading to improved results (Hawk, Zilpzalp). A breakthrough came when we decided to couple a CNN with an LSTM, designing the former to pick up on timbral features and the latter to detect time-dependent patterns (Puffin).
We build the following models:
- Bulbul: (Grill & Schlüter, 2017)
- Sparrow: (Grill & Schlüter, 2017)
- SparrowExp: (Schlüter, 2018)
- Zipzalp: own creation by Tim
- Hawk: (Pons et al., 2018)
- Puffin: own creation by Tim
Configure your run in config.py
Running a job locally:
python train_precomputed.py config.py