# Exploring deep sea acoustic events
## Whale song detector: pre-process target and background audio

### Feb 2020 PDSG Applied Data Science Meetup series<br>John Burt

### Session details

For February’s four session meetup series we’ll be working with long term hydrophone recordings from University of Hawaii's Aloha Cabled Observatory (ACO - http://aco-ssds.soest.hawaii.edu), located at a depth of 4728m off Oahu. The recordings span a year and contain many acoustic events: wave movements, the sound of rain, ship noise, possible bomb noises, geologic activity and whale calls and songs. There is a wide range of project topics to explore: identifying and counting acoustic events such as whale calls, measuring daily or seasonal noise trends, measuring wave hydrodynamics, etc.

### This notebook:

For model training and testing, I overlay whale song note example clips with recording background noise. In this notebook, I clean and prepare the training target sounds (whale song notes), and background sound, then save them in HDF5 files for quick loading during training runs.

Packages required:
- librosa


In [11]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np

import librosa
import librosa.display

import soundfile as sf



## Read soundfile paths

I'll combine waveform data from two sources: 
- background noise from the hydrophone recording, with no whalesong
- clips of humpback whale song notes


In [12]:
import os
import fnmatch

def get_pathlist(rootdir, exts):
    # search through root folder for files w/ exts
    paths = []
    for ext in exts:
        for root, dirnames, filenames in os.walk(rootdir):
            for filename in fnmatch.filter(filenames, ext):
                paths.append(os.path.join(root, filename).replace('\\','/'))
    return paths

                

## create or read the intermediate target / background wave files

I create a set of intermediate wave files, all with the selected sample rate and any other desired pre-processing. When I train and test the model, I will read these and merge them randomly to create the train/test samples. I also re-name the target file IDs for clarity.

Note: humpback whales produce sounds between about 80 - 4000 Hz, but the sounds with the highest amplitudes most commonly occur at 100-2000 Hz. I've chosen a sample rate of 5000 Hz, which gets most of the whale call but reduces the memory/processing overhead.


In [13]:
%%time

from sklearn.preprocessing import scale, minmax_scale

src_target = './data/model/source/target/'
dest_target = './data/model/preprocess/target/'

src_background = './data/model/source/background/'
dest_background = './data/model/preprocess/background/'

preprocess_waves = True # True = preprocess, False = bypass 

samprate = 5000

if preprocess_waves:
    print('preprocessing target samples...')
    # get paths to target sound clip files
    paths = get_pathlist(src_target, '*.wav')
    for i, path in zip(range(len(paths)), paths):
        y, sr = librosa.load(path, sr=samprate)
        sf.write(dest_target + '%04d,'%(i) + os.path.basename(path).split('.')[0]+'.wav', 
                minmax_scale(y, feature_range=(-1, 1)), sr, subtype='PCM_16')

    print('preprocessing background samples...')
    # get paths to background noise files
    paths = get_pathlist(src_background, '*.wav')
    for i, path in zip(range(len(paths)), paths):
        y, sr = librosa.load(path, sr=samprate)
        sf.write(dest_background + os.path.basename(path).split('.')[0]+'.wav', 
                minmax_scale(y, feature_range=(-1, 1)), sr, subtype='PCM_16')


preprocessing target samples...
preprocessing background samples...
Wall time: 11min 8s


## Read the preprocessed sound files

Read the processed target and background sound files, and save them as dataframes in HDF5 format. I do this because it will load very quickly in other notebooks where I train and test the model.

In [14]:
%%time

target_wavs = []
target_ids = []
target_names = [] 

backgnd_wavs = []
backgnd_names = []

print('reading target clips...')
paths = get_pathlist(dest_target, '*.wav')
for path in paths:
    target_ids.append(int(os.path.basename(path).split(',')[0]))
    target_names.append(os.path.basename(path).split('.')[0])
    target_wavs.append(librosa.load(path, sr=samprate)[0])
    
print('reading background sound...')
paths = get_pathlist(dest_background, '*.wav')
for path in paths:
    backgnd_wavs.append(librosa.load(path, sr=samprate)[0])
    backgnd_names.append(os.path.basename(path).split('.')[0])

print('converting to dataframe')
target_df = pd.DataFrame({'id':target_ids, 'name':target_names, 'wave':target_wavs})
background_df = pd.DataFrame({'id':[-1]*len(backgnd_wavs), 'name':backgnd_names, 'wave':backgnd_wavs})

print('saving target and background dataframes')
target_df.to_hdf('./data/model/wavdata_v1.h5', key='target')
background_df.to_hdf('./data/model/wavdata_v1.h5', key='background')

print('done')

reading target clips...
reading background sound...
converting to dataframe
saving target and background dataframes
done
Wall time: 1min 9s
