## Whale Sound Exploration

In this tutorial we will explore some data which contain right whale up-calls. The dataset was shared as part of a [2013 Kaggle competition](https://www.kaggle.com/c/whale-detection-challenge). Our goal is not to show the best winning algorithm to detect a call, but share a simple pipeline for processing oscillatory data, which possibly can be used on wide range of time series.

Objectives:
* read and extract features form audio data
* apply dimensionality reduction techiques
* perform supervised classification
* learn how to evaluate machine learning models
* train a neural network to detect whale calls


### Data Loading and Exploration
---

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# importing multiple visualization libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import mlab
import pylab as pl
import seaborn

In [3]:
# importing libraries to manipulate the data files
import os
from glob import glob

In [4]:
# importing scientific python packages
import numpy as np

In [5]:
# import a library to read the .aiff format
import aifc

The `train` folder contains many `.aiff` files (2 second snippets) and we have `.csv` document which contains the corresponding labels. 

In [6]:
!ls whale_data

[34mtest[m[m      test.csv  [34mtrain[m[m     train.csv


In [14]:
filenames = glob(os.path.join('../whale_data','train','*.aiff'))

In [15]:
print('There are '+str(len(filenames))+' files.' )

There are 0 files.


In [17]:
# read the labels
import pandas as pd
labels = pd.read_csv(os.path.join('../whale_data','train.csv'), index_col = 0)

FileNotFoundError: File b'../whale_data/train.csv' does not exist

The format of the labels is

In [None]:
labels.head(10)

In [None]:
# filenames which contain calls
# whale_labels = labels[labels['label'] == 1].index

In [None]:
# save a variable which only contains files with right whale calls
#X_whales = X.loc[whale_labels]
#X_whales.shape

In [None]:
#whale_labels[0]

In [None]:
#print('There are '+str(len(whale_labels))+' calls.')

Let's look at one of those files.

In [None]:
# reading the file info
#whale_sample_file = whale_labels[0] 
whale_sample_file = 'train6.aiff'
whale_aiff = aifc.open(os.path.join('whale_data','train',whale_sample_file),'r')
print ("Frames:", whale_aiff.getnframes() )
print ("Frame rate (frames per second):", whale_aiff.getframerate())

In [None]:
# reading the data
whale_strSig = whale_aiff.readframes(whale_aiff.getnframes())
whale_array = np.fromstring(whale_strSig, np.short).byteswap()
plt.plot(whale_array)

In [None]:
signal = whale_array.astype('float64')

In [None]:
# playing a whale upcall in the notebook
from IPython.display import Audio
Audio(signal, rate=3000, autoplay = True)# the rate is set to 3000 make the widget to run (seems the widget does not run with rate below 3000)

Working directly with the signals is hard (there is important frequency information). Let's calculate the spectrograms for each of the signals and use as features.

In [None]:
# a function for plotting spectrograms
def PlotSpecgram(P, freqs, bins):
    """Spectrogram"""
    Z = np.flipud(P) # flip rows so that top goes to bottom, bottom to top, etc.
    xextent = 0, np.amax(bins)
    xmin, xmax = xextent
    extent = xmin, xmax, freqs[0], freqs[-1]
    im = pl.imshow(Z, extent=extent,cmap = 'plasma')
    pl.axis('auto')
    pl.xlim([0.0, bins[-1]])
    pl.ylim([0, freqs[-1]])

In [None]:
params = {'NFFT':256, 'Fs':2000, 'noverlap':192}
P, freqs, bins = mlab.specgram(whale_array, **params)
PlotSpecgram(P, freqs, bins)

### Feature Extraction
---

We will go through all the files and extract the spectrograms from each of them.

In [None]:
# create a dictionary which contains all the spectrograms, labeled by the filename
spec_dict = {}

# threshold to cut higher frequencies
m = 60

# loop through all the files
for filename in filenames:
    # read the file
    aiff = aifc.open(filename,'r')
    whale_strSig = aiff.readframes(aiff.getnframes())
    whale_array = np.fromstring(whale_strSig, np.short).byteswap()
    # create the spectrogram
    P, freqs, bins = mlab.specgram(whale_array, **params)
    spec_dict[filename] = P[:m,:]

# save the dimensions of the spectrogram
spec_dim = P[:m,:].shape
print(spec_dim)
    

Most machine learning algorithms in Python expect the data to come in a format **observations** x **features**. In order to get the data in this format we need to convert the two-dimensional spectrogram into a long vector. For that we will use the `ravel` function.

In [None]:
# We will put the data in a dictionary
feature_dict = {}
for key in filenames:
    # vectorize the spectrogram
    feature_dict[key[17:]] = spec_dict[key].ravel()

# convert to a pandas dataframe
X = pd.DataFrame(feature_dict).T

In [None]:
X.head(5)

In [None]:
# we do not need these objects anymore so let's release them from memory
del feature_dict
del spec_dict

In [None]:
# let's save these variables for reuse
np.save('X.npy',X)
np.save('y.npy',np.array(labels['label'][X.index]))

### References:

https://www.kaggle.com/c/whale-detection-challenge

https://github.com/jaimeps/whale-sound-classification