# Prepair CNN training data from Raven-annotated audio

If you have listened to some of your field recordings and annotated them for the presence of your sounds of interest, it's easy to use this as training data to train a classifier using OpenSoundscape. This notebook shows the data processing steps used to turn annotations of audio  into the data format used for training models in OpenSoundscape. In this example we are using a set of recordings that were annotated using the software Raven Pro:

<i>An annotated set of audio recordings of Eastern North American birds containing frequency, time, and species information. </i><br>
Lauren M. Chronister,  Tessa A. Rhinehart,  Aidan Place,  Justin Kitzes <br>
https://doi.org/10.1002/ecy.3329 


#### package imports

In [29]:
## Opensoundscape imports
from opensoundscape.annotations import BoxedAnnotations

# general purpose packages
import pandas as pd
import numpy as np
from pathlib import Path
import re # for regex matching of annotation and audio files
import random 
from glob import glob

random.seed(0)
np.random.seed(0)

## Download instructions
Download the datasets to your current working directory and unzip them. You can do so by downloading both `annotation_Files.zip` and `wav_Files.zip` from the url below or by executing the cell below. 

https://datadryad.org/stash/dataset/doi:10.5061/dryad.d2547d81z

In [2]:
!wget -O annotation_Files.zip https://datadryad.org/stash/downloads/file_stream/641805
!wget -O wav_Files.zip https://datadryad.org/stash/downloads/file_stream/641808
!unzip annotation_Files.zip
!unzip wav_Files.zip

--2023-03-14 09:42:42--  https://datadryad.org/stash/downloads/file_stream/641805
Resolving datadryad.org (datadryad.org)... ^C
--2023-03-14 09:42:43--  https://datadryad.org/stash/downloads/file_stream/641808
Resolving datadryad.org (datadryad.org)... 44.225.200.72, 52.12.151.55
Connecting to datadryad.org (datadryad.org)|44.225.200.72|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://dryad-assetstore-merritt-west.s3.us-west-2.amazonaws.com/ark%3A/13030/m5799nzg%7C5%7Cproducer/wav_Files.zip?response-content-type=application%2Fzip&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEB0aCXVzLXdlc3QtMiJHMEUCIQCvPLiCasLDbsiII8XqnqTWuHpYQF0BaATbvo74OwefFgIgQA7rJ3uzeT%2BfkWlryrou79PBsOS8dxekKV5ROn%2BCxZ0quwUI1v%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgw0NTE4MjY5MTQxNTciDN11ovueSLGz6z2%2B7iqPBbZFjKqy1xCcSYtz76i7OrKCD0TIbtRBbu5xQInBXmttRxG8413Rx12lg%2Bc1Kawza%2FsW6yphalnsM4dG4iJcfr8SDyrvQl%2F8FvOVrPBwq13e1Vgchq1XhEgmpPVTPDvmcVue5tR37UQfkN%2BBRcJC5KNaca9IY23O120VW7JV5EBPm81h5rs

# Load Raven annotations and create label dataframes
The below shows the data munging process of reading in raven files, and using them to create dataframes we can use for training and tset sets for training our model. We will take the annotation files and turn them into a dataframe with 1-hot labels for each 3 second interval - one hot labels that are 1 if a species is present in the audio and 0 if the species is not present in that.

In [71]:
# set the current directory to where the dataset is downloaded
dataset_path = Path("./ecy3329-sup-0001-datas1/").resolve() 

# make a list of all of the selection table files
selections = glob(f"{dataset_path}/Annotation_Files/*/*.txt")

# Audio files have the same names as selection files
audio_files = [f.replace('Annotation_Files','Recordings').replace('.Table.1.selections.txt','.mp3') for f in selections]

### Loading raven annotations 
The BoxedAnnotations class stores frequency-time annotations in a table. It can parse and load Raven formatted selection tables with the `from_raven_files()` method. We pass the method a list of raven files and the corresponding list of audio files. 


In [89]:
annotations = BoxedAnnotations.from_raven_files(raven_paths=[selections[0]],audio_files=[audio_files[0]])
annotations.df.head(2)

Unnamed: 0,file,annotation,start_time,end_time,low_f,high_f,View,index,Channel,Selection
0,/Users/SML161/demos-for-opso/ecy3329-sup-0001-...,BTNW,0.913636,2.202273,4635.1,7439.0,Spectrogram 1,0,1,1
1,/Users/SML161/demos-for-opso/ecy3329-sup-0001-...,EATO,2.236363,2.693182,3051.9,4101.0,Spectrogram 1,1,1,2


This table contains one row per annotation created in Raven pro. 
We can easily convert this annotation format to a table of 0 (absent) or 1 (present) labels for a series of time-regions in each audio file. Each class will be a separate column. We can specify a list of classes, or let the function automatically create one class for each unique annotation in the Raven selection tables. 

Here, we need to make some choices: first, how many seconds is each audio "clip" that we want to generate a label for (clip_duration) and how many seconds of overlap should there be between consecutive clips (clip_overlap)? Here we'll choose 3 second clips with zero overlap. 

Second, how much does an annotation need to overlap with a clip for us to consider the annotation to apply to the clip (min_label_overlap)? For example, if an annotation spans 1-3.02 seconds, we might not want to consider it a part of a clip that spans 3-6 seconds, since only 0.02 seconds of that annotation overlap with the clip. Here, we'll choose a min_label_overlap of 0.25 seconds. 

In [90]:
# generate "one-hot" clip labels for the annotations loaded above
clip_labels = annotations.one_hot_clip_labels(
    clip_duration=3,
    clip_overlap=0,
    min_label_overlap=0.25)

# show the first few rows
clip_labels.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,BTNW,EATO,BHCO,AMCR,OVEN,RBWO,RCKI,AMGO,TUTI,BAWW
file,start_time,end_time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
/Users/SML161/demos-for-opso/ecy3329-sup-0001-datas1/Recordings/Recording_1/Recording_1_Segment_31.mp3,0.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
/Users/SML161/demos-for-opso/ecy3329-sup-0001-datas1/Recordings/Recording_1/Recording_1_Segment_31.mp3,3.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Load all annotations and create clip labels

Let's use the functions shown above to load all annotations from the downloaded dataset.

In [91]:
all_annotations = BoxedAnnotations.from_raven_files(selections,audio_files)

Now, let's generate the clip dataframes across all files and annotations in this labeled dataset. 

We'll use the same parameters as the cells above for creating one hot labels. 

In [92]:
truth_df = all_annotations.one_hot_clip_labels(
    clip_duration=3,
    clip_overlap=0,
    min_label_overlap=0.25)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["overlap"] = [


## split into training and validation sets

Our plan is to train a machine learning model on the files in folders `Recording_1`, `Recording_2` and `Recording_3` and test its performance on recordings in the folder `Recording_4`. Let's separate the labels into two sets called `train` and `validation`. We'll use the train set to train the CNN, and the validation set to check how it performs on data that it has not seen during training. 


In [95]:
# select all files from Recording_4 as a test set
mask = truth_df.reset_index()['file'].apply(lambda x: 'Recording_4' in x).values
test_set = truth_df[mask]

# all other files will be used as a training set
training_set = truth_df.drop(test_set.index)

Save .csv tables of the training and validation sets for use in training a model

In [96]:
training_set.to_csv("./resources/training_set.csv")
test_set.to_csv("./resources/test_set.csv")