# Raven annotations
[Raven Sound Analysis Software](https://ravensoundsoftware.com/) enables users to inspect spectrograms, draw time and frequency boxes around sounds of interest, and label these boxes with species identities. OpenSoundscape contains functionality to prepare and use these annotations for machine learning.

## Download annotated data
We published an example Raven-annotated dataset here: https://doi.org/10.1002/ecy.3329

In [10]:
from pathlib import Path
import subprocess

Download the zipped data here:

In [11]:
url = "https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.3329&file=ecy3329-sup-0001-DataS1.zip"
name = 'powdermill_data.zip'
subprocess.run(['curl',url, '-L', '-o', name]) # Download the data

CompletedProcess(args=['curl', 'https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.3329&file=ecy3329-sup-0001-DataS1.zip', '-L', '-o', 'powdermill_data.zip'], returncode=0)

Unzip the files to a new directory, `powdermill_data/`

In [12]:
subprocess.run(["unzip", "powdermill_data.zip", "-d", "powdermill_data"])

CompletedProcess(args=['unzip', 'powdermill_data.zip', '-d', 'powdermill_data'], returncode=9)

Keep track of the files we have now so we can delete them later.

In [13]:
files_to_delete = [Path("powdermill_data"), Path("powdermill_data.zip")]

## Preprocess Raven data

The `opensoundscape.raven` module contains preprocessing functions for Raven data, including:

* `annotation_check` - for all the selections files, make sure they all contain labels
* `lowercase_annotations` - lowercase all of the annotations
* `generate_class_corrections` - create a CSV to see whether there are any weird names
    * Modify the CSV as needed. If you need to look up files you can use `query_annotations`
    * Can be used in `SplitterDataset`
* `apply_class_corrections` - replace incorrect labels with correct labels
* `query_annotations` - look for files that contain a particular species or a typo

In [14]:
import pandas as pd
import opensoundscape.raven as raven
import opensoundscape.audio as audio

ModuleNotFoundError: No module named 'opensoundscape.raven'

In [None]:
raven_files_raw = Path("./powdermill_data/Annotation_Files/")

### Check Raven files have labels

Check that all selections files contain labels under one column name. In this dataset the labels column is named `"species"`.

In [None]:
raven.annotation_check(directory=raven_files_raw, col='species')

### Create lowercase files

Convert all the text in the files to lowercase to standardize them. Save these to a new directory. They will be saved with the same filename but with ".lower" appended.

In [None]:
raven_directory = Path('./powdermill_data/Annotation_Files_Standardized')
if not raven_directory.exists(): raven_directory.mkdir()
raven.lowercase_annotations(directory=raven_files_raw, out_dir=raven_directory)

Check that the outputs are saved as expected.

In [None]:
list(raven_directory.glob("*.lower"))[:5]

### Generate class corrections

This function generates a table that can be modified by hand to correct labels with typos in them. It identifies the unique labels in the provided column (here `"species"`) in all of the lowercase files in the directory `raven_directory`.

For instance, the generated table could be something like the following:
```
raw,corrected
sparrow,sparrow
sparow,sparow
goose,goose
```

In [None]:
print(raven.generate_class_corrections(directory=raven_directory, col='species'))

The released dataset has no need for class corrections, but if it did, we could save the return text to a CSV and use the CSV to apply corrections to future dataframes.

### Query annotations
This function can be used to print all annotations of a particular class, e.g. "amro" (American Robin)

In [None]:
output = raven.query_annotations(directory=raven_directory, cls='amro', col='species', print_out=True)

## Split Raven annotations and audio files

The Raven module's `raven_audio_split_and_save` function enables splitting of both audio data and associated annotations. It requires that the annotation and audio filenames are unique, and that corresponding annotation and audiofilenames are named the same filenames as each other.

In [None]:
audio_directory = Path('./powdermill_data/Recordings/')
destination = Path('./powdermill_data/Split_Recordings')
out = raven.raven_audio_split_and_save(
    
    # Where to look for Raven files
    raven_directory = raven_directory,
    
    # Where to look for audio files
    audio_directory = audio_directory,
    
    # The destination to save clips and the labels CSV to 
    destination = destination,
    
    # The column name of the labels
    col = 'species',
    
    # Desired audio sample rate
    sample_rate = 22050,
    
    # Desired duration of clips
    clip_duration = 5,
    
    # Verbose (uncomment the next line to see progress--this cell takes a while to run)
    #verbose=True,
)

The results of the splitting are saved in the destination folder under the name `labels.csv`.

In [None]:
labels = pd.read_csv(destination.joinpath("labels.csv"), index_col='filename')
labels.head()

The `raven_audio_split_and_save` function contains several options. Notable options are:

* `clip_duration`: the length of the clips
* `clip_overlap`: the overlap, in seconds, between clips
* `final_clip`: what to do with the final clip if it is not exactly `clip_duration` in length (see API docs for more details)
* `labeled_clips_only`: whether to only save labeled clips
* `min_label_length`: minimum length, in seconds, of an annotation for a clip to be considered labeled. For instance, if an annotation only overlaps 0.1s with a 5s clip, you might want to exclude it with `min_label_length=0.2`.
* `species`: a subset of species to search for labels of (by default, finds all species labels in dataset)
* `dry_run`: if `True`, produces print statements and returns dataframe of labels, but does not save files.
* `verbose`: if `True`, prints more information, e.g. clip-by-clip progress.

For instance, let's extract labels for one species, American Redstart (AMRE) only saving clips that contain at least 0.5s of label for that species. The "verbose" flag causes the function to print progress splitting each clip.

In [None]:
btnw_split_dir = Path('./powdermill_data/btnw_recordings')
out = raven.raven_audio_split_and_save(
    raven_directory = raven_directory,
    audio_directory = audio_directory,
    destination = btnw_split_dir,
    col = 'species',
    sample_rate = 22050,
    clip_duration = 5,
    clip_overlap = 0,
    verbose=True,
    species='amre',
    labeled_clips_only=True,
    min_label_len=1
)

The labels CSV only has a column for the species of interest:

In [None]:
btnw_labels = pd.read_csv(btnw_split_dir.joinpath("labels.csv"), index_col='filename')
btnw_labels.head()

The split files and associated labels csv can now be used to train machine learning models (see additional tutorials).

The command below cleans up after the tutorial is done -- only run it if you want to delete all of the files.

In [None]:
from shutil import rmtree
for file in files_to_delete:
    if file.is_dir():
        rmtree(file)
    else:
        file.unlink()