<a href="https://colab.research.google.com/github/marathomas/meerkat/blob/master/01_0_marmoset_prepare_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Marmoset vocalization dataset custom parsing
- This dataset has:
    - A number of WAVs where naming convention stores the individuals vocalizing
    - Corresponding .mat files with the timing of each phee/call and the individual making the vocalization
- This notebook extracts periods of vocalization into new WAV files, and creates a corresponding JSON and TextGrid for each WAV with annotation information
- Dataset origin:
    - Recieved via correspondance with Miller Lab

In [0]:
from avgn.utils.general import prepare_env

In [0]:
prepare_env()

env: CUDA_VISIBLE_DEVICES=GPU


### Import relevant packages

In [3]:
from joblib import Parallel, delayed
from tqdm.autonotebook import tqdm
import pandas as pd
import librosa
from datetime import datetime
import json

  


In [0]:
import avgn
from avgn.custom_parsing.miller_marmoset import (
    parse_marmoset_data,
    parse_marmoset_calls,
    annotate_bouts,
    segment_wav_into_bouts
)
from avgn.utils.paths import DATA_DIR

[Mara:] these functions parse_marmoset_data etc are from here: https://drive.google.com/drive/folders/1298bW06MLbUDFNzSRCDI-6B3N4EqrFBt

### Load data in original format

In [4]:
# create a unique datetime identifier for the files output by this notebook
DT_ID = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
DT_ID

'2020-06-02_16-23-39'

In [0]:
DSLOC = avgn.utils.paths.Path('/mnt/cube/Datasets/Marmosets/FromMillerLab')

In [0]:
wavs = list(DSLOC.glob('*.wav'))
len(wavs), wavs[:3]

(186,
 [PosixPath('/mnt/cube/Datasets/Marmosets/FromMillerLab/han.todd.170621.wav'),
  PosixPath('/mnt/cube/Datasets/Marmosets/FromMillerLab/ares_spn_230217_203.wav'),
  PosixPath('/mnt/cube/Datasets/Marmosets/FromMillerLab/ares_ant_010317_33.wav')])

In [0]:
matfiles = list(DSLOC.glob("*.mat"))
len(matfiles), matfiles[:3]

(82,
 [PosixPath('/mnt/cube/Datasets/Marmosets/FromMillerLab/apollo_angel_140217.mat'),
  PosixPath('/mnt/cube/Datasets/Marmosets/FromMillerLab/jasmine.hermes.170622.mat'),
  PosixPath('/mnt/cube/Datasets/Marmosets/FromMillerLab/aladdin_banana_060317.mat')])

[Mara:] mat files store variables, but I haven't heard anything about mat files so far, don't know what's supposed to be stored in them.

### Parse data into dataframe

In [0]:
wav_df = parse_marmoset_data(wavs, _filetype = "wav")
print(len(wav_df))
display(wav_df[:3])

183


Unnamed: 0,monkey1,monkey2,date,date_idx,wav_loc
0,han,todd,170621,,/mnt/cube/Datasets/Marmosets/FromMillerLab/han...
1,ares,spn,230217,203.0,/mnt/cube/Datasets/Marmosets/FromMillerLab/are...
2,ares,ant,10317,33.0,/mnt/cube/Datasets/Marmosets/FromMillerLab/are...


[Mara:] the parse_marmoset only splits the filename containing monkey identifier, date etc and fills it into the corresponding columns of a dataframe.

In [0]:
mf_df = parse_marmoset_data(matfiles, _filetype = "mat")
print(len(mf_df))
display(mf_df[:3])

81


Unnamed: 0,monkey1,monkey2,date,date_idx,mat_loc
0,apollo,angel,140217,,/mnt/cube/Datasets/Marmosets/FromMillerLab/apo...
1,jasmine,hermes,170622,,/mnt/cube/Datasets/Marmosets/FromMillerLab/jas...
2,aladdin,banana,60317,,/mnt/cube/Datasets/Marmosets/FromMillerLab/ala...


In [0]:
# merge dataframes
mf_df = pd.merge(
    mf_df,
    wav_df,
    how="left",
    left_on=["monkey1", "monkey2", "date", "date_idx"],
    right_on=["monkey1", "monkey2", "date", "date_idx"],
    suffixes=(False, False),
)
# remove unlabelled wavs
mf_df = mf_df[mf_df.wav_loc.isnull() == False]
print(len(mf_df))
display(mf_df[:3])

80


Unnamed: 0,monkey1,monkey2,date,date_idx,mat_loc,wav_loc
0,apollo,angel,140217,,/mnt/cube/Datasets/Marmosets/FromMillerLab/apo...,/mnt/cube/Datasets/Marmosets/FromMillerLab/apo...
1,jasmine,hermes,170622,,/mnt/cube/Datasets/Marmosets/FromMillerLab/jas...,/mnt/cube/Datasets/Marmosets/FromMillerLab/jas...
2,aladdin,banana,60317,,/mnt/cube/Datasets/Marmosets/FromMillerLab/ala...,/mnt/cube/Datasets/Marmosets/FromMillerLab/ala...


[Mara:] ok so some data was present in mat files, some in wav files but now they're all together in the same df

### Parse matfiles into syllables

In [0]:
syllable_df = pd.concat(
    Parallel(n_jobs=-1, verbose=10)(
        delayed(parse_marmoset_calls)(row)
        for idx, row in tqdm(mf_df.iterrows(), total=len(mf_df))
    )
)

HBox(children=(IntProgress(value=0, max=80), HTML(value='')))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    5.5s





[Parallel(n_jobs=-1)]: Done  42 out of  80 | elapsed:    6.3s remaining:    5.7s
[Parallel(n_jobs=-1)]: Done  51 out of  80 | elapsed:    6.9s remaining:    3.9s
[Parallel(n_jobs=-1)]: Done  60 out of  80 | elapsed:    7.3s remaining:    2.4s
[Parallel(n_jobs=-1)]: Done  69 out of  80 | elapsed:    7.8s remaining:    1.3s
[Parallel(n_jobs=-1)]: Done  78 out of  80 | elapsed:    8.9s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:    9.5s finished


In [0]:
print(len(syllable_df))
display(syllable_df[:3])

14295


Unnamed: 0,indv,partner,date,call_type,wav_loc,call_num,pulse_n,pulse_start,pulse_end
0,apollo,angel,140217,phee,/mnt/cube/Datasets/Marmosets/FromMillerLab/apo...,0,0,14.038007,16.171723
1,apollo,angel,140217,phee,/mnt/cube/Datasets/Marmosets/FromMillerLab/apo...,1,0,107.359792,108.729595
2,apollo,angel,140217,phee,/mnt/cube/Datasets/Marmosets/FromMillerLab/apo...,1,1,109.060383,110.417463


[Mara:] So we've created this syllable dataframe containing all vocalizations, the individual that did it, the label (call type), the location of the wav file, the number of calls in the vocalization (?) and the start and stop times. (indv, partner, date, call_type, wav_loc, call_num, pulse_n, pulse_start, pulse_end).

### segment WAVs into 'bouts'
- There are a lot of periods of time in the original datasets that are not occupied by any vocalizations. Here, we segment out those time periods and create new sub-WAVs. For each sub-WAV, we generate a JSON with metadata and segment information. 

In [0]:
# HParams is just a python object storing a set of hyperparameters.
hparams = avgn.utils.general.HParams(
    bout_segmentation_min_s = 30,  # Minimum amount of seconds between vocal activity required to split a wavfile
    bout_pad_s = 5, # how much time to pad this bout with on either side
    # noise clip
    get_noise_clip = True, # if a noise clip preceding the vocalization should be grabbed to help reduce noise in analysis
    max_noise_clip_size_s = 10, # how large the noise clip can be
    min_noise_clip_size_s = 1, # how small the noise clip can be
    
)

[Mara:] **Very weird, but it looks like the noise clip is never used in the subsequent processing and analysis steps (0.0-marmoset-create-syllable-df.ipynb and 4.0-marmoset-dataset-umap.ipynb)!**

There's only a step in create-syllable-df.ipynp that "masks a spectrogram to be above some % of the maximum power" (mask_spec in avgn_paper/avgn/signalprocessing/create_spectrogram_dataset.py), can't find any other filtering or other.

**BUT: There's a package from Sainsberg called noise reduce, which does exactly what I would have expected here, removes noise from an audio file based on a prototypical "example file" for the noise: https://timsainburg.com/noise-reduction-python.html**

It looks super easy to use and is available here: https://github.com/timsainb/noisereduce

In [0]:
bout_dfs =  Parallel(n_jobs=-1, verbose=10)(
        delayed(segment_wav_into_bouts)(
            (
                syllable_df[syllable_df.wav_loc == wav_loc]
                .sort_values(by=["pulse_start"])
                .reset_index()
            ),
            hparams,
        )
        for wav_loc in tqdm(syllable_df.wav_loc.unique())
    )
bout_dfs = [item for sublist in bout_dfs for item in sublist]

[Mara:]For each wav, we're now creating a list of dataframes, each dataframe holding information about one 'bout' of the wav, one bout being one vocalization. Actually, I think the bout_df is just one row of the wav_df. There's only a filtering step, i.e. if there is gap smaller than bout_segmentation_min_s after one bout then the next bout is seen as part of the same bout. (bout_df has same colnames as syllable_df/wav_df)

### Save bouts and JSON files

In [0]:
Parallel(n_jobs=-1, verbose=10)(
    delayed(annotate_bouts)(
        DT_ID,
        bout_number,
        syllable_df[syllable_df.wav_loc == bout_df.iloc[0].wav_loc]
        .sort_values(by=["pulse_start"])
        .reset_index(),
        bout_df,
        hparams,
    )
    for bout_number, bout_df in tqdm(enumerate(bout_dfs), total=len(bout_dfs))
);

[Mara:] annotate_bouts takes as input:
- DT_ID datetime String
- bout_number (from enumerate)
- wav_df -> the row(s) of the syllable df that are assigned to the same wav file as the bout, sorted by pulse_start
- the bout_df (a one-row dataframe?)
- hparams

and does many things:
- pads the bouts by fixed bout_pad_s (from hparams)
- extracts noise clip before or after the vocalization if possible (if no other call is right before or after). 
- generates three types of output files: .WAV (call) (padded with bout_pad_s), .JSON and .WAV (noise) (saves them in folders wav_out, json_out, noise_out)
- For the JSON file, annotates each bout with calls from each possible individual. (Dataset can contain calls from more than one individual per bout, I think.) Luckily we don't have that problem, so I could simplify that part of the code. 

Explanation to extract_noise_clip. Tries to get one before, then after the bout. From: avgn_paper/avgn/custom_parsing/general.py

def extract_noise_pre():
        # try to get a noise clip from the time preceding this clip
        if bout_start > min_noise_clip_size_s:
            # get time of preceding pulses
            # Mara: td is array that contains distance of bout_start to all vocalizations in wav
            td = bout_start - voc_ends
            # Mara: only need those vocalizations PRIOR to the bout, i.e. with positive distance
            td = td[td > 0]
            # if there is anything within this timeframe, this timeframe is unusable
            # Mara: if any distance is shorter than minimum noise clip size, then we cannot clip a noise before the bout.
            if not np.any(td < min_noise_clip_size_s):
                # Mara: else, we select the maximum possible duration (either it's a distance to a prior vocalization, or, if all are far away, it's the maximum noise clip size)
                # get times for noise clip
                noise_start = bout_start - np.min(
                  # Mara: I think that this should actually be td-1 instead of td+1. td+1 allows overlap with a vocalization, whereas td-1 would set a minimum 1s after last vocalization and before noise starts. list(td+1) -> adds 1 to each item of td and turns it into list. +max_noise.. --> adds the max_noise to the td list
                    list(td + 1) + [max_noise_clip_size_s]
                )
                noise_end = bout_start

                # load the clip
                noise_clip, sr = librosa.load(
                    wav_loc,
                    mono=True,
                    sr=None,
                    offset=noise_start,
                    duration=noise_end - noise_start,
                )
                return noise_clip, sr
        return None, None