# Train a CNN to detect bird vocalizations

This notebook demonstrates how to train a CNN deep learning model with OpenSoundscape. We will train the CNN to recognize bird vocalizations in spectrogram representations of audio data.

#### package imports

In [15]:
## Opensoundscape imports
from opensoundscape import BoxedAnnotations, CNN

# general purpose packages
import pandas as pd
import numpy as np
from pathlib import Path
import re # for regex matching of annotation and audio files
import random 
from glob import glob
import sklearn

random.seed(0)
np.random.seed(0)

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[10,3] #set default graphic size
%config InlineBackend.figure_format = 'retina'

## Step 1: Prepare CNN training data from Raven-annotated audio

If you have listened to some of your field recordings and annotated them for the presence of your sounds of interest, it's easy to use them as training data to train a classifier using OpenSoundscape. This notebook shows the data processing steps used to turn annotations of audio into the data format used for model training in OpenSoundscape. In this example we are using a set of recordings that were annotated using the software Raven Pro 1.6.4 (Bioacoustics Research Program 2022):

<i>An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information. </i><br>
Lauren M. Chronister,  Tessa A. Rhinehart,  Aidan Place,  Justin Kitzes <br>
https://doi.org/10.1002/ecy.3329 


## Download instructions
Download the datasets to your current working directory and unzip them. You can do so by downloading both `annotation_Files.zip` and `wav_Files.zip` from the url below or by executing the cell below. 

https://datadryad.org/stash/dataset/doi:10.5061/dryad.d2547d81z

In [None]:
!wget -O annotation_Files.zip https://datadryad.org/stash/downloads/file_stream/641805
!wget -O wav_Files.zip https://datadryad.org/stash/downloads/file_stream/641808
!unzip annotation_Files.zip
!unzip wav_Files.zip

### Load Raven annotations and create label dataframes
The below shows the data munging process of reading in raven files, and using them to create dataframes we can use for training and tset sets for training our model. We will take the annotation files and turn them into a dataframe with 1-hot labels for each 3 second interval - one hot labels that are 1 if a species is present in the audio and 0 if the species is not present in that.

In [7]:
# set the current directory to where the dataset is downloaded
dataset_path = Path("./ecy3329-sup-0001-datas1/").resolve() 

# make a list of all of the selection table files
selections = glob(f"{dataset_path}/Annotation_Files/*/*.txt")

# Audio files have the same names as selection files
audio_files = [f.replace('Annotation_Files','Recordings').replace('.Table.1.selections.txt','.mp3') for f in selections]

#### Loading raven annotations 
The BoxedAnnotations class stores frequency-time annotations in a table. It can parse and load Raven formatted selection tables with the `from_raven_files()` method. We pass the method a list of raven files and the corresponding list of audio files. 


In [8]:
all_annotations = BoxedAnnotations.from_raven_files(selections,audio_files)
all_annotations.df.head(2)

Unnamed: 0,audio_file,raven_file,annotation,start_time,end_time,low_f,high_f,View,Selection,Channel
0,/Users/SML161/demos-for-opso/ecy3329-sup-0001-...,/Users/SML161/demos-for-opso/ecy3329-sup-0001-...,BTNW,0.913636,2.202273,4635.1,7439.0,Spectrogram 1,1,1
1,/Users/SML161/demos-for-opso/ecy3329-sup-0001-...,/Users/SML161/demos-for-opso/ecy3329-sup-0001-...,EATO,2.236363,2.693182,3051.9,4101.0,Spectrogram 1,2,1


This table contains one row per annotation created in Raven pro. 
We can easily convert this annotation format to a table of 0 (absent) or 1 (present) labels for a series of time-regions in each audio file. Each class will be a separate column. We can specify a list of classes or let the function automatically create one class for each unique annotation in the Raven selection tables. 

Here, we need to make some choices: first, how many seconds is each audio "clip" that we want to generate a label for (clip_duration), and how many seconds of overlap should there be between consecutive clips (clip_overlap)? Here, we'll choose 3 second clips with zero overlap. 

Second, how much does an annotation need to overlap with a clip for us to consider the annotation to apply to the clip (min_label_overlap)? For example, if an annotation spans 1-3.02 seconds, we might not want to consider it a part of a clip that spans 3-6 seconds, since only 0.02 seconds of that annotation overlap with the clip. Here, we'll choose a min_label_overlap of 0.25 seconds. 

In [9]:
clip_labels = all_annotations.one_hot_clip_labels(
    clip_duration=3,
    clip_overlap=0,
    min_label_overlap=0.25)

### split annotated data into training and validation sets

Our plan is to train a machine learning model on the files in folders `Recording_1`, `Recording_2` and `Recording_3` and test its performance on recordings in the folder `Recording_4`. Let's separate the labels into two sets called `train` and `validation`. We'll use the train set to train the CNN, and the validation set to check how it performs on data that it has not seen during training. 


In [10]:
# select all files from Recording_4 as a test set
mask = clip_labels.reset_index()['file'].apply(lambda x: 'Recording_4' in x).values
test_set = clip_labels[mask]

# all other files will be used as a training set
training_set = clip_labels.drop(test_set.index)

Save .csv tables of the training and validation sets for use in training a model

In [11]:
training_set.to_csv("./resources/03/training_set.csv")
test_set.to_csv("./resources/03/test_set.csv")

## Train a CNN
Now that we have prepared and split our labeled data into training and testing sets, we can train a CNN to recognize the labeled classes. Let's choose 7 classes from the annotated data and train our CNN to recognize vocalizations of these species:

- NOCA: Northern Cardinal
- EATO: Eastern Towhee
- SCTA: Scarlet Tanager
- BAWW: Black-and-white Warbler
- BCCH: Black-capped Chickadee
- AMCR: American Crow
- NOFL: Northern Flicker

In [13]:
# Filter just to our species of interest
species_of_interest = ["NOCA", "EATO", "SCTA", "BAWW", "BCCH", "AMCR", "NOFL"]
training_set = training_set[species_of_interest]
test_set = test_set[species_of_interest]

In [17]:
# Split our training data into training and validation sets
train_df, valid_df = sklearn.model_selection.train_test_split(training_set, test_size=0.1, random_state=0)

### Resample data for even class representation

Here, we balance the number of samples of each class in the training set. This helps the model learn all of the classes, rather than paying too much attention to the classes with the most labeled annotations. 

In [18]:
# this upsamples the less common classes and improves performance
# in classes that would otherwise be rare in the training set
from opensoundscape.data_selection import resample
balanced_train_df = resample(train_df,n_samples_per_class=800,random_state=0)

Create the model object. We're using a resnet34 architecture CNN.

In [6]:
# create a CNN object designed to recognize 3-second samples
# we use the resnet34 architecture, 
model = CNN('resnet34',classes=species_of_interest,sample_duration=3.0, single_target=False)

# if your computer has a GPU, uncomment the relevant line to set model.device
# model.device='mps' #uncomment for Apple Silicon GPU use
# model.device='cuda' #uncomment for GPUs using cuda

#### initialize Weights and Biases logging session

Note: to use wandb logging, you will need to create an account on the [wandb website](https://wandb.ai/). The first time you use wandb, you'll be asked for an authentication key which can be found in your wandb profile. 

In [None]:
import wandb
try:
    wandb_session = wanb.init(
        entity='kitzeslab',
        project='opensoundscape training demo',
        name='Notebook 03: Train CNN',
    )
except: #if wandb.init fails, don't use wandb logging
    wandb_session = None

## Training
train the model for 30 epochs

We use default training parameters, but many aspects of CNN training can be customized (see [this tutorial](http://opensoundscape.org/en/latest/tutorials/cnn_training_advanced.html) for examples)

Training can be a slow process, and the speed of training will depend on your computer. If you wish to skip this step, you can simply load the model that this cell would create by uncommenting the subsequent cell and running it instead

In [None]:
%%capture --no-stdout --no-display

model.train(balanced_train_df, valid_df, epochs = 30, batch_size= 64, log_interval=100, num_workers = 32, wandb_session=wandb_session)

As training progresses, performance metrics will be plotted to the wandb logging platform and visible on this run's web page. One this cell completes, you have trained the CNN. In the next notebook, we will use the CNN to predict on the test set and evaluate its performance. 