In [None]:
import scripts.functions.preprocessing as pp
from scripts.functions.metadata import metadata_longform_papio, meta_papio
from scripts.functions.hypermodel import WinWavTransferLearning
from scripts.functions.segmentation import segmentation, df_pred, wav_creation
import os
import tensorflow as tf
import kerastuner as kt
import pandas as pd
from datetime import datetime
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

We present here an exemple to replicate the model used in the paper https://arxiv.org/abs/2302.07640

The labeled data, as well as the longform audio recordings used as exemple, and all the vocalizations extracted during the month, are available at https://doi.org/10.5281/zenodo.7963124

The folder structure should be as follows: in the working directory,   
- a folder *data*. In this folder,
    - a folder *vocalizations*, in which we have the labeled data,
    - a folder *data_augmentation*, in which we save the modified vocalizations,
        - a folder *test*, for the modified recordings of the test partition,
        - a folder *train*, for the modified recordings of the train partition,
        - a folder *val*, for the modified recordings of the validation partition.  
        For each one, there are three folders, one per condition of modification of the original recordings:
            - a folder *noise*,
            - a folder *tonality*,
            - a folder *speed*,
    - a folder *longform_recordings*, in which we have the long form audio recordings,
    - a folder *segmentation_papio*, in which we will save the vocalizations detected.
- a folder *scripts*. In this folder,
    - a folder *functions*, in which we have all the functions used during the pipeline
    - a folder *learning*, for the scripts used for the training of the models presentated in the paper,
    - a folder *prediction*, for the scripts used for the prediction of all the long form audio recordings presented in the paper, that ouputs the data of the baboon and baby vocalizations.
    
The gitlab repository is already organized that way. The data available on zenodo are organized that way. The libraries to install can be found in the document requirements.txt   

We present here an exemple to replicate the model using a subset of the baboon recordings. We cannot provide all the month for legal reasons. We put 2 hours accessible, as well as the labeled data set used in the paper. We show how to train a model from the labeled dataset and how to use it for the segmentation of these two hours.   
The total output of the segmentation of the month is available on zenodo.   

For the baby data, none of the long form audio recordings are available, nor the output of the segmentation, for leagal reasons. The labeled dataset is BabbleCor and can be found here https://osf.io/rz4tx/   

# Learning

We start loading the metadata of the labeled data for the learning.

In [None]:
meta_train, meta_val, meta_test = meta_papio(os.getcwd(), data_augmentation=False, weighting_sampling=False)

Then, we prepare the data. This is done through the creation of a dataset, using the metadata of the labeled data. 
The recordings are loaded and resampled at 16 KhZ, in mono. 
We create independant frames through an 80% overlapping 1-second window. 
We use a resampling strategy to have an uniform distribution among classes during the learning. 
Because we use transfer-learning from YamNet, the frames are mapped to a log-mel spectrogram.
Data augmentation is done before, not during the learning, because we do not expect to have so much labeled data. Thus, we can gain time during the learning without being too expensive in term of memory.

In [None]:
train, steps_per_epoch = pp.preparation_data_set(meta_train, resample=True, batch_size=32, transfer_learning=True)
val = pp.preparation_data_set(meta_val, resample=False, batch_size=32, transfer_learning=True)

input_shape = next(iter(train.unbatch()))[0].shape

train = train.shuffle(1000)

We instantiate the hypermodel and the hyperparameters, as well as the callbacks.
The values set here are for the exemple and can be changed and increase for a "true" learning.

In [None]:
hypermodel = winWavTransferLearning(input_shape=input_shape, n_labels=6)
hp = kt.HyperParameters()

tuner = kt.tuners.bayesian.BayesianOptimization(hypermodel=hypermodel,
                                               hyperparameters=hp,
                                               objective=kt.Objective("val_loss", direction="min"),
                                                # increase to have more searching iterations. 
                                                # Set to 2 here for the exemple
                                                max_trials=2,
                                                num_initial_points=1,
                                                tune_new_entries=True,
                                                project_name="exemple")

earlystop = tf.keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0, 
                                             patience=5, verbose=1, 
                                             restore_best_weights=True)

checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath=os.path.join(os.getcwd(), "exemple/cp.hdf5"),
                                               monitor="val_loss", mode="min", 
                                                save_best_only=True, verbose=1)

history = tf.keras.callbacks.CSVLogger(os.path.join(os.getcwd(), "exemple/train.csv"),
                                      separator=",", append=False)


lr_schedule = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.2,
                                                  patience=2, verbose=1)

callbacks = [earlystop, checkpoint, history, lr_schedule]

We can start the learning.

In [None]:
# Disable AutoShard
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
train = train.with_options(options)
val = val.with_options(options)

print("Start of the learning")
start = datetime.now()
tuner.search(train, epochs=20, validation_data=val, callbacks=callbacks,
            steps_per_epoch=steps_per_epoch)
delta_training = datetime.now() - start

# Prediction

The model has been learned on the labeled data and now can be used to detect vocalizations in long form audio recordings.
We start loading the metadata of the longform audio recordings as well as their length and we prepare the data to be processed by the model.

In [None]:
meta_longform = metadata_longform_papio(os.path.join(os.getcwd(), "data/longform_recordings"), 
                                       length=True)

ds = pp.preparation_longform_papio(meta_longform, batch_size=32, transfer_learning=True)

We take the best model of the optimization process and we use it to find the segments of vocalizations in the recordings.

In [None]:
model = tuner.get_best_models()[0]
model.summary()

In [None]:
start = datetime.now()
y = model.predict(ds)
delta_pred = datetime.now() - start
print("Duration prediction:", delta_pred)

# Segmentation

Once we learned the model and used it to find the segments of vocalizations in the longform audio recordings, we extract the information.
First, we create txt files, one per recordings, in which we have the number of vocalizations found with the time in the recordings.

In [None]:
segmentation(meta_longform, y, baby=False)

Then, we take the information that we have in the txt files to create the wav files, one per vocalization.

In [None]:
wav_creation(baby=False)

We create a dataframe in which we have more information for each vocalization the model detected (the day, the hour, the duration, the probability of each label).

In [None]:
df = df_pred(meta_longform, y)