# Raven annotations
[Raven Sound Analysis Software](https://ravensoundsoftware.com/) enables users to inspect spectrograms, draw time and frequency boxes around sounds of interest, and label these boxes with species identities. OpenSoundscape contains functionality to prepare and use these annotations for machine learning.

## Download annotated data
We published an example Raven-annotated dataset here: https://doi.org/10.1002/ecy.3329

Note: these files are in mp3 format. If you receive a NoBackEndError please make sure ffmpeg is installed on your machine.

In [1]:
from opensoundscape.commands import run_command
from pathlib import Path

Download the zipped data here:

In [2]:
link = "https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.3329&file=ecy3329-sup-0001-DataS1.zip"
name = 'powdermill_data.zip'
out = run_command(f"wget -O powdermill_data.zip {link}")

Unzip the files to a new directory, `powdermill_data/`

In [3]:
out = run_command("unzip powdermill_data.zip -d powdermill_data")

Keep track of the files we have now so we can delete them later.

In [4]:
files_to_delete = [Path("powdermill_data"), Path("powdermill_data.zip")]

## Preprocess Raven data

The `opensoundscape.raven` module contains preprocessing functions for Raven data, including:

* `annotation_check` - for all the selections files, make sure they all contain labels
* `lowercase_annotations` - lowercase all of the annotations
* `generate_class_corrections` - create a CSV to see whether there are any weird names
    * Modify the CSV as needed. If you need to look up files you can use `query_annotations`
    * Can be used in `SplitterDataset`
* `apply_class_corrections` - replace incorrect labels with correct labels
* `query_annotations` - look for files that contain a particular species or a typo

In [5]:
import pandas as pd
import opensoundscape.raven as raven
import opensoundscape.audio as audio

In [6]:
raven_files_raw = Path("./powdermill_data/Annotation_Files/")

### Check Raven files have labels

Check that all selections files contain labels under one column name. In this dataset the labels column is named `"species"`.

In [7]:
raven.annotation_check(directory=raven_files_raw, col='species')

All rows in powdermill_data/Annotation_Files contain labels in column `species`


### Create lowercase files

Convert all the text in the files to lowercase to standardize them. Save these to a new directory. They will be saved with the same filename but with ".lower" appended.

In [8]:
raven_directory = Path('./powdermill_data/Annotation_Files_Standardized')
if not raven_directory.exists(): raven_directory.mkdir()
raven.lowercase_annotations(directory=raven_files_raw, out_dir=raven_directory)

Check that the outputs are saved as expected.

In [9]:
list(raven_directory.glob("*.lower"))[:5]

[PosixPath('powdermill_data/Annotation_Files_Standardized/Recording_1_Segment_22.Table.1.selections.txt.lower'),
 PosixPath('powdermill_data/Annotation_Files_Standardized/Recording_4_Segment_15.Table.1.selections.txt.lower'),
 PosixPath('powdermill_data/Annotation_Files_Standardized/Recording_4_Segment_24.Table.1.selections.txt.lower'),
 PosixPath('powdermill_data/Annotation_Files_Standardized/Recording_1_Segment_13.Table.1.selections.txt.lower'),
 PosixPath('powdermill_data/Annotation_Files_Standardized/Recording_1_Segment_06.Table.1.selections.txt.lower')]

### Generate class corrections

This function generates a table that can be modified by hand to correct labels with typos in them. It identifies the unique labels in the provided column (here `"species"`) in all of the lowercase files in the directory `raven_directory`.

For instance, the generated table could be something like the following:
```
raw,corrected
sparrow,sparrow
sparow,sparow
goose,goose
```

In [10]:
print(raven.generate_class_corrections(directory=raven_directory, col='species'))

raw,corrected
amcr,amcr
amgo,amgo
amre,amre
amro,amro
baor,baor
baww,baww
bbwa,bbwa
bcch,bcch
bggn,bggn
bhco,bhco
bhvi,bhvi
blja,blja
brcr,brcr
btnw,btnw
bwwa,bwwa
cang,cang
carw,carw
cedw,cedw
cora,cora
coye,coye
cswa,cswa
dowo,dowo
eato,eato
eawp,eawp
hawo,hawo
heth,heth
howa,howa
kewa,kewa
lowa,lowa
nawa,nawa
noca,noca
nofl,nofl
oven,oven
piwo,piwo
rbgr,rbgr
rbwo,rbwo
rcki,rcki
revi,revi
rsha,rsha
rwbl,rwbl
scta,scta
swth,swth
tuti,tuti
veer,veer
wbnu,wbnu
witu,witu
woth,woth
ybcu,ybcu



The released dataset has no need for class corrections, but if it did, we could save the return text to a CSV and use the CSV to apply corrections to future dataframes.

### Query annotations
This function can be used to print all annotations of a particular class, e.g. "amro" (American Robin)

In [11]:
output = raven.query_annotations(directory=raven_directory, cls='amro', col='species', print_out=True)

powdermill_data/Annotation_Files_Standardized/Recording_4_Segment_16.Table.1.selections.txt.lower

     selection           view  channel  begin time (s)  end time (s)  \
85          86  spectrogram 1        1       77.634876     82.129659   
93          94  spectrogram 1        1       84.226733     86.313096   
98          99  spectrogram 1        1       88.825438     91.272182   
107        108  spectrogram 1        1       96.028977     97.552840   
111        112  spectrogram 1        1       99.990354    100.914517   
116        117  spectrogram 1        1      104.327755    108.656087   
122        123  spectrogram 1        1      109.525937    112.021391   
129        130  spectrogram 1        1      113.765766    117.386474   
137        138  spectrogram 1        1      121.053454    121.383161   
141        142  spectrogram 1        1      124.864220    129.139630   
154        155  spectrogram 1        1      132.583749    135.017840   
162        163  spectrogram 1        

## Split Raven annotations and audio files

The Raven module's `raven_audio_split_and_save` function enables splitting of both audio data and associated annotations. It requires that the annotation and audio filenames are unique, and that corresponding annotation and audiofilenames are named the same filenames as each other.

In [12]:
audio_directory = Path('./powdermill_data/Recordings/')
destination = Path('./powdermill_data/Split_Recordings')
out = raven.raven_audio_split_and_save(
    
    # Where to look for Raven files
    raven_directory = raven_directory,
    
    # Where to look for audio files
    audio_directory = audio_directory,
    
    # The destination to save clips and the labels CSV to 
    destination = destination,
    
    # The column name of the labels
    col = 'species',
    
    # Desired audio sample rate
    sample_rate = 22050,
    
    # Desired duration of clips
    clip_duration = 5,
    
    # Verbose (uncomment the next line to see progress--this cell takes a while to run)
    #verbose=True,
)

Found 77 sets of matching audio files and selection tables out of 77 audio files and 77 selection tables


The results of the splitting are saved in the destination folder under the name `labels.csv`.

In [13]:
labels = pd.read_csv(destination.joinpath("labels.csv"), index_col='filename')
labels.head()

Unnamed: 0_level_0,amcr,amgo,amre,amro,baor,baww,bbwa,bcch,bggn,bhco,...,rsha,rwbl,scta,swth,tuti,veer,wbnu,witu,woth,ybcu
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
powdermill_data/Split_Recordings/Recording_4_Segment_13_0.0s_5.0s.wav,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
powdermill_data/Split_Recordings/Recording_4_Segment_13_5.0s_10.0s.wav,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
powdermill_data/Split_Recordings/Recording_4_Segment_13_10.0s_15.0s.wav,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
powdermill_data/Split_Recordings/Recording_4_Segment_13_15.0s_20.0s.wav,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
powdermill_data/Split_Recordings/Recording_4_Segment_13_20.0s_25.0s.wav,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The `raven_audio_split_and_save` function contains several options. Notable options are:

* `clip_duration`: the length of the clips
* `clip_overlap`: the overlap, in seconds, between clips
* `final_clip`: what to do with the final clip if it is not exactly `clip_duration` in length (see API docs for more details)
* `labeled_clips_only`: whether to only save labeled clips
* `min_label_length`: minimum length, in seconds, of an annotation for a clip to be considered labeled. For instance, if an annotation only overlaps 0.1s with a 5s clip, you might want to exclude it with `min_label_length=0.2`.
* `species`: a subset of species to search for labels of (by default, finds all species labels in dataset)
* `dry_run`: if `True`, produces print statements and returns dataframe of labels, but does not save files.
* `verbose`: if `True`, prints more information, e.g. clip-by-clip progress.

For instance, let's extract labels for one species, American Redstart (AMRE) only saving clips that contain at least 0.5s of label for that species. The "verbose" flag causes the function to print progress splitting each clip.

In [14]:
btnw_split_dir = Path('./powdermill_data/btnw_recordings')
out = raven.raven_audio_split_and_save(
    raven_directory = raven_directory,
    audio_directory = audio_directory,
    destination = btnw_split_dir,
    col = 'species',
    sample_rate = 22050,
    clip_duration = 5,
    clip_overlap = 0,
    verbose=True,
    species='amre',
    labeled_clips_only=True,
    min_label_len=1
)

Found 77 sets of matching audio files and selection tables out of 77 audio files and 77 selection tables
Making directory powdermill_data/btnw_recordings
1. Finished powdermill_data/Recordings/Recording_4/Recording_4_Segment_13.mp3
2. Finished powdermill_data/Recordings/Recording_1/Recording_1_Segment_33.mp3
3. Finished powdermill_data/Recordings/Recording_1/Recording_1_Segment_26.mp3
4. Finished powdermill_data/Recordings/Recording_4/Recording_4_Segment_19.mp3
5. Finished powdermill_data/Recordings/Recording_1/Recording_1_Segment_11.mp3
6. Finished powdermill_data/Recordings/Recording_2/Recording_2_Segment_13.mp3
7. Finished powdermill_data/Recordings/Recording_1/Recording_1_Segment_29.mp3
8. Finished powdermill_data/Recordings/Recording_2/Recording_2_Segment_01.mp3
9. Finished powdermill_data/Recordings/Recording_1/Recording_1_Segment_15.mp3
10. Finished powdermill_data/Recordings/Recording_4/Recording_4_Segment_20.mp3
11. Finished powdermill_data/Recordings/Recording_1/Recording_1_S

The labels CSV only has a column for the species of interest:

In [15]:
btnw_labels = pd.read_csv(btnw_split_dir.joinpath("labels.csv"), index_col='filename')
btnw_labels.head()

Unnamed: 0_level_0,amre
filename,Unnamed: 1_level_1
powdermill_data/btnw_recordings/Recording_2_Segment_13_60.0s_65.0s.wav,1.0
powdermill_data/btnw_recordings/Recording_2_Segment_13_65.0s_70.0s.wav,1.0
powdermill_data/btnw_recordings/Recording_2_Segment_13_85.0s_90.0s.wav,1.0
powdermill_data/btnw_recordings/Recording_2_Segment_13_95.0s_100.0s.wav,1.0
powdermill_data/btnw_recordings/Recording_2_Segment_13_105.0s_110.0s.wav,1.0


The split files and associated labels csv can now be used to train machine learning models (see additional tutorials).

The command below cleans up after the tutorial is done -- only run it if you want to delete all of the files.

In [16]:
from shutil import rmtree
for file in files_to_delete:
    if file.is_dir():
        rmtree(file)
    else:
        file.unlink()