In [7]:
import pandas as pd

# Setting up the enviroment

In [None]:
!conda env create -f ../environment.yml

# The format of the dataset

## Phase 1

### Training

The training data for the first phase consists of a large number of small audio files. The path to the audio files that contain the click, along with the click time should be in "dataset/p1_train_clicks.csv" in the following format:

In [8]:
pd.read_csv("dataset/p1_train_clicks.csv")

Unnamed: 0,audio_path,click_time
0,/data/vision/torralba/scratch/fjacob/simplifie...,5020
1,/data/vision/torralba/scratch/fjacob/simplifie...,5020
2,/data/vision/torralba/scratch/fjacob/simplifie...,5020
3,/data/vision/torralba/scratch/fjacob/simplifie...,5020
4,/data/vision/torralba/scratch/fjacob/simplifie...,5020
...,...,...
14759,/data/vision/torralba/scratch/fjacob/simplifie...,5020
14760,/data/vision/torralba/scratch/fjacob/simplifie...,5020
14761,/data/vision/torralba/scratch/fjacob/simplifie...,5020
14762,/data/vision/torralba/scratch/fjacob/simplifie...,5020


The path to the audio files that don't contain clicks should be in "dataset/p1_train_noise.csv" with the following format.

In [9]:
pd.read_csv("dataset/p1_train_noise.csv")

Unnamed: 0,audio_path
0,/data/vision/torralba/scratch/fjacob/simplifie...
1,/data/vision/torralba/scratch/fjacob/simplifie...
2,/data/vision/torralba/scratch/fjacob/simplifie...
3,/data/vision/torralba/scratch/fjacob/simplifie...
4,/data/vision/torralba/scratch/fjacob/simplifie...
...,...
1975992,/data/vision/torralba/scratch/fjacob/simplifie...
1975993,/data/vision/torralba/scratch/fjacob/simplifie...
1975994,/data/vision/torralba/scratch/fjacob/simplifie...
1975995,/data/vision/torralba/scratch/fjacob/simplifie...


If you only have long audio files, then you can set up the phase 2 dataset as explained below and then run the below cell / file to extract short audio files from it.

In [None]:
!python extract_short_audio.py

### Validation

The validation dataset for phase 1 is a single audio file stored in "dataset/p1_validation_audio.wav", and it's annotations are stored in "dataset/p1_validation_annotations.csv". The annotations csv file should contain columns with the click start time (TsTo) and the inter-click interval (ICIX for the Xth interclick interval), as in the following example:

In [10]:
pd.read_csv("dataset/p1_validation_annotations.csv").head(1)

Unnamed: 0,REC,nClicks,Duration,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,...,ICI33,ICI34,ICI35,ICI36,ICI37,ICI38,ICI39,ICI40,Whale,TsTo
0,CETI23-280,5,0.98399,0.31889,0.32649,0.1892,0.14941,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0.0,1,3.2977


The audio file should be at least multiple minutes long and contain multiple clicks.

## Phase 2

For each partition of the dataset (train / val / test) you need a file named "dataset/p2_[partition name]_dataset.csv" with the following format

In [11]:
pd.read_csv("dataset/p2_train_dataset.csv")

Unnamed: 0,file_name,part,first_context_start_frame,last_context_start_frame,audio_path
0,sw061b003,0,0,89630080,/raid/lingo/martinrm/original_data/dataset/201...
1,sw100a002,0,30050520,48107200,/raid/lingo/martinrm/original_data/dataset/201...
2,sw091b001,0,0,132048120,/raid/lingo/martinrm/original_data/dataset/201...
3,sw106a004,1,131236080,178467680,/raid/lingo/martinrm/original_data/dataset/201...
4,sw097a001,1,71600280,90439760,/raid/lingo/martinrm/original_data/dataset/201...
...,...,...,...,...,...
78,sw091b002,0,28198160,48821480,/raid/lingo/martinrm/original_data/dataset/201...
79,sw114b001,2,162627080,179025640,/raid/lingo/martinrm/original_data/dataset/201...
80,sw114b003,1,119066040,134428240,/raid/lingo/martinrm/original_data/dataset/201...
81,sw085a002,1,94663640,176517440,/raid/lingo/martinrm/original_data/dataset/201...


Each row represents a subset of an audio file. Part is a unique identifying number for each subset of an audio file. first_context_start_frame and last_context_start_frame indicate the beginning and ending time of the subset in frames, assuming a sample rate of 22050 frames per second.

The annotations for all the audio files need to be in the file "dataset/p2_all_annotations.csv". In the csv file each row represents a coda and must include at least the following columns:

1) "REC" identifies the audio file. The first 6 letters should be unique.
2) "TsTo" represents the start time of the Coda in seconds
3) "Whale" is an id that identifies the speaker
4) "nClicks" is the number of clicks
4) "ICI[X]" is the Xth inter-click interval

In [12]:
pd.read_csv("dataset/p2_all_annotations.csv").columns

Index(['REC', 'nClicks', 'Duration', 'ICI1', 'ICI2', 'ICI3', 'ICI4', 'ICI5',
       'ICI6', 'ICI7', 'ICI8', 'ICI9', 'ICI10', 'ICI11', 'ICI12', 'ICI13',
       'ICI14', 'ICI15', 'ICI16', 'ICI17', 'ICI18', 'ICI19', 'ICI20', 'ICI21',
       'ICI22', 'ICI23', 'ICI24', 'ICI25', 'ICI26', 'ICI27', 'ICI28', 'ICI29',
       'ICI30', 'ICI31', 'ICI32', 'ICI33', 'ICI34', 'ICI35', 'ICI36', 'ICI37',
       'ICI38', 'ICI39', 'ICI40', 'Whale', 'TsTo'],
      dtype='object')

In [13]:
pd.read_csv("dataset/p2_all_annotations.csv").head(1)

Unnamed: 0,REC,nClicks,Duration,ICI1,ICI2,ICI3,ICI4,ICI5,ICI6,ICI7,...,ICI33,ICI34,ICI35,ICI36,ICI37,ICI38,ICI39,ICI40,Whale,TsTo
0,sw106a002,9,0.386608,0.038208,0.042742,0.044058,0.041092,0.048275,0.049517,0.057583,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1725.051967


# Training new models

Once you have changed out the dataset files for your own, simply run the following cells in order (alternativly run the files in the command line).  The models will be saved in "phase_1_checkpoints" and "transformer_training_output."

## Train phase 1 models

In [None]:
!python click_candidate_detector_training.py

## Create phase 2 dataset

In [None]:
!python make_transformer_dataset.py

## Train phase 2 models

In [None]:
!python candidate_revision_training.py