# Model experiments - sample set

In the previous notebooks we have separated a small subset of our data, called "sample", on which we can now experiment with simple models to assess the effectiveness of our preprocessing & data augmentation techniques.

We do it this way to avoid spending too much time on training on the entire set, the assumption is that the methods which are effective on the sample will work well on a larger scale too. 

We will start by testing a couple of simple models on untouched sample data (as numpy arrays) and then proceed towards data augmentation and finally spectrograms.

In [1]:
# first make sure we're in the parent dictory of our data/sample folders.
!pwd

/home/paperspace/tensorflow_speech_recognition


## Import
We'll need a couple of additional libraries so let's import them.

In [2]:
# filter out warnings
import warnings
warnings.filterwarnings('ignore') 

In [62]:
import bcolz
import glob
import librosa
import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow

# utils
from importlib import reload
import utils; reload(utils)

# keras as tensorflow backend
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, BatchNormalization, Dropout, Convolution1D, MaxPooling1D, Flatten
from tensorflow.python.keras.layers import SimpleRNN, GRU, ConvLSTM2D
from tensorflow.python.keras.optimizers import Adam

# F1 and accuracy score metric
from sklearn.metrics import f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier

## Prepare data
The easiest way to work with data is by turning it into a list of numbers, in our case a numpy array. We can use one of the functions from utils to load the raw data or use the librosa.load() function. The difference lies in the fact that the former returns int16s whereas librosa returns float32s and uses its default sampling rate of 22050Hz, unless we explicitly tell it to use the file's original sampling rate of 16000Hz.

We should also consider normalizing our data (so that it all falls within the same scale) and using the preprocessing methods explored in the previous notebook (MFCCs, Mel spectrogram, fast fourier transform and tempogram). 

In [4]:
path_to_sample = "data/sample"

We'll have to go through each of the folders in our sample/train, cv and test sets, one-hot encode their label and load the 16K long array of raw data. The y data will be of shape (m, 12), where m is the number of examples, and the X data will be of shape (m, 16000) - at least for the raw .wav input.

Let's calculate **m** first. We will do that by using a function that create a list of all the .wav files within a directory.

### Create a list of paths
We will use the glob module that we learned about in the very first notebook and a function from util.py which can, given a directory, return a list of paths to .wav files within it. We will repeat the process for all 3 sets within sample, and every category subdirectory within those too.

In [5]:
# for example we can grab all .wav files from sample/train/stop
path_to_sample_train_stop = os.path.join(path_to_sample, "train", "stop")
utils.grab_wavs(path_to_sample_train_stop)[:5]

['data/sample/train/stop/01b4757a_nohash_0.wav',
 'data/sample/train/stop/3ac2e76f_nohash_0.wav',
 'data/sample/train/stop/3e31dffe_nohash_3.wav',
 'data/sample/train/stop/37bd115d_nohash_1.wav',
 'data/sample/train/stop/6c2dd2d5_nohash_0.wav']

In [6]:
# we'll need a list of all category folder names
categories_to_predict = ["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "silence", "unknown"]

In [7]:
# first grab the training set
path_to_train = os.path.join(path_to_sample, "train")
sample_train_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_train, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    sample_train_wavs.extend(category_files)
    
sample_train_wavs

['data/sample/train/yes/0f3f64d5_nohash_0.wav',
 'data/sample/train/yes/8a28231e_nohash_3.wav',
 'data/sample/train/yes/d3f22f0e_nohash_0.wav',
 'data/sample/train/yes/2d3c8dcb_nohash_1.wav',
 'data/sample/train/yes/d486fb84_nohash_0.wav',
 'data/sample/train/yes/61d3e51e_nohash_0.wav',
 'data/sample/train/yes/8f811bbc_nohash_0.wav',
 'data/sample/train/yes/b43de700_nohash_0.wav',
 'data/sample/train/yes/66a412a7_nohash_0.wav',
 'data/sample/train/yes/92e17cc4_nohash_0.wav',
 'data/sample/train/yes/6c9223bd_nohash_0.wav',
 'data/sample/train/yes/d5356b9a_nohash_0.wav',
 'data/sample/train/yes/e7d0eb3f_nohash_1.wav',
 'data/sample/train/yes/712e4d58_nohash_2.wav',
 'data/sample/train/yes/a2fefcb4_nohash_0.wav',
 'data/sample/train/yes/324210dd_nohash_1.wav',
 'data/sample/train/yes/742d6431_nohash_0.wav',
 'data/sample/train/yes/70a00e98_nohash_1.wav',
 'data/sample/train/yes/28e47b1a_nohash_4.wav',
 'data/sample/train/yes/7014b07e_nohash_1.wav',
 'data/sample/train/no/f0ebef1b_nohash_0

In [8]:
# repeat for cv
path_to_cv = os.path.join(path_to_sample, "cv")
sample_cv_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_cv, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_cv_wavs.extend(category_files)

# repeat for test
path_to_test = os.path.join(path_to_sample, "test")
sample_test_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_test, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_test_wavs.extend(category_files)

### One-hot encode the y

Now that we have the 3 lists of files from each set (train, cv and test) we can construct our train_y, cv_y and test_y numpy arrays. These will be matrices of size (m, 12), one-hot encoded. E.g. if a row belongs to the category "up" it will take the form of an array of zeros, where the entry at index 2 (the third from the left) will become a 1.

We will use a function from the utils that takes a path to a .wav, the index at which the category name starts within it (we want to control this because we will eventually use this for the main set, not just the sample) and a list of categories to predict. For our current example, the category name in the paths belonging to "train" starts at the 18th index (separators count as one char).

In [9]:
# let's grab a single path (this one is an "left")
a_wav = sample_train_wavs[80]
a_wav

'data/sample/train/left/4ec7d027_nohash_0.wav'

In [10]:
# let's see if the 1 is correctly placed
utils.one_hot_encode_path(a_wav, 18, categories_to_predict)

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])

The path belonged to the fifth category ("left") and the one-hot encoding correctly placed the 1 at index 4 (zero-indexed).

We want to repeat this for all examples in each of the 3 subsets, adding each new one-hot encoded numpy array as a new row of the y matrix, in order.

In [11]:
# figure out the dimensions of train_y
rows = len(sample_train_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
dimensions

(240, 12)

In [12]:
# create train_y as empty array
train_y = np.array([])

# append each row to train_y
for path_to_wav in sample_train_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 18, categories_to_predict)
    
    # append the new row
    train_y = np.append(train_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*train_y.shape))

# let's reshape it
train_y = np.reshape(train_y, dimensions)
print("New shape: {}".format(train_y.shape))

Current shape: 2880
New shape: (240, 12)


In [13]:
# show the train_y matrix to confirm
train_y

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

We can see that the first 3 entries have the 1 at 0th index, which means they belong to category "up" and the last three have the 1 at the last index, which is also correct given the fact that our list of paths was also ordered.

We should bear in mind that by default the np.array contains float64s and our functions for loading a .wav return int16s.

Since this is a highly-repetitive task we'll want to use the utils function for obtaining the y.

Repeat for **CV set**.

In [14]:
# figure out the dimensions
rows = len(sample_cv_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
cv_y = utils.get_y(sample_cv_wavs, 15, categories_to_predict)
print("Received shape: {}".format(cv_y.shape))

Target dimensions: (60, 12)
Received shape: (60, 12)


Repeat for **Test set**.

In [15]:
# figure out the dimensions
rows = len(sample_test_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
test_y = utils.get_y(sample_test_wavs, 17, categories_to_predict)
print("Received shape: {}".format(test_y.shape))

Target dimensions: (60, 12)
Received shape: (60, 12)


In [16]:
test_y[0]

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### Get the X
We have the y - the one-hot encoded vectors representing the category for each training, cv and test example in the sample set. We need the feature vectors, conventionally referred to as X. We will use both the simplest way of extracting the .wav data and the preprocessing techniques - MFCCs, Mel spectrogram, FFT and tempogram.

Let's start by defining a simple helper function for just the raw .wav data. Since our samples are of slightly differing lengths but each row of our X always has to have the same length, we will **add padding by default.**

In [17]:
# get the desired number of columns (n)
n = len(utils.get_wav_info(path_to_wav)[1])
n

16000

#### Raw .wav data

In [18]:
# define a simple helper function
def get_X_with_padding(list_of_paths, columns=16000):
    
    # get shape data
    rows = len(list_of_paths)
    dimensions = (rows, columns)
    
    # create placeholder
    X = np.array([])
    
    # go through every file path in the list
    for path_to_wav in list_of_paths:

        # get raw array of signed ints
        row = utils.get_wav_info(path_to_wav)[1]
        
        # some of our sample have less (or slightly more) than 16000 values, so let's adjust them
        # trim to fixed length
        row = row[:columns]
        
        # pad with zeros, calculating amount of padding needed
        padding = columns - len(row)
        row = np.pad(row, (0, padding), mode='constant', constant_values=0)

        # append the new row
        X = np.append(X, row)
    
    # reshape (unroll)
    X = np.reshape(X, dimensions)
    
    return X

In [19]:
# get the X for each set
train_X = utils.get_X(sample_train_wavs, n)
cv_X = utils.get_X(sample_cv_wavs, n)
test_X = utils.get_X(sample_test_wavs, n)

print("Train: ", train_X.shape)
print("CV: ", cv_X.shape)
print("Test: ",test_X.shape)

Train:  (240, 16000)
CV:  (60, 16000)
Test:  (60, 16000)


In [20]:
train_X[0][:5]

array([-11., -21., -25., -42., -33.])

#### MFCCs

We can also do the same for the MFCCs. We have a choice of whether or not we want to get returned only the mean value (1D) for the MFCCs. For now let's obtain both the 1D (mean) and 2D version.

In [21]:
# let's start with a reasonable number of mfccs to return
n_mfcc = 100

In [22]:
train_X_mfccs_1D = utils.get_X_mfccs(sample_train_wavs, shape=(n_mfcc, 32), mean=True)
cv_X_mfccs_1D = utils.get_X_mfccs(sample_cv_wavs, shape=(n_mfcc, 32), mean=True)
test_X_mfccs_1D = utils.get_X_mfccs(sample_test_wavs, shape=(n_mfcc, 32), mean=True)

print("Train mfccs: ", train_X_mfccs_1D.shape)
print("CV mfccs: ", cv_X_mfccs_1D.shape)
print("Test mfccs: ",test_X_mfccs_1D.shape)

Train mfccs:  (240, 100)
CV mfccs:  (60, 100)
Test mfccs:  (60, 100)


In [23]:
train_X_mfccs_1D[0][:5]

array([-445.35308297,   31.80853315,   -3.24210645,   18.580648  ,
          0.94357346])

And now for the 2-dim output.

In [24]:
train_X_mfccs_2D = utils.get_X_mfccs(sample_train_wavs, shape=(n_mfcc, 32), mean=False)
cv_X_mfccs_2D = utils.get_X_mfccs(sample_cv_wavs, shape=(n_mfcc, 32), mean=False)
test_X_mfccs_2D = utils.get_X_mfccs(sample_test_wavs, shape=(n_mfcc, 32), mean=False)

print("Train mfccs: ", train_X_mfccs_2D.shape)
print("CV mfccs: ", cv_X_mfccs_2D.shape)
print("Test mfccs: ",test_X_mfccs_2D.shape)

Train mfccs:  (240, 100, 32)
CV mfccs:  (60, 100, 32)
Test mfccs:  (60, 100, 32)


In [25]:
train_X_mfccs_2D[0][0][:5]

array([-576.51036871, -570.90724179, -544.55853427, -553.88026846,
       -578.52544577])

#### Mel spectrogam

In case of Mel spectrograms we expect to get a matrix from a vector, therefore our final X will be 3 dimensional.

In [26]:
# let's see the difference in dimensions
sr, raw_data = utils.get_wav_info(path_to_wav)
print("Raw data shape: {}".format(raw_data.shape))
x = librosa.feature.melspectrogram(raw_data, sr)
print("Mel spectrogram shape: {}".format(x.shape))

Raw data shape: (16000,)
Mel spectrogram shape: (128, 32)


In [27]:
# here's the function we'll use (via utils.py)
def get_X_mel_spectrogram(list_of_paths, shape=(128, 32)):

    # get shape data
    rows = len(list_of_paths)

    # create placeholder
    result = np.array([])

    # go through every file path in the list
    for path_to_wav in list_of_paths:
        
        # get raw array of signed ints
        sr, raw_data = utils.get_wav_info(path_to_wav)
        mel_spectrogram = librosa.feature.melspectrogram(raw_data, sr)

        # some of our samples have less (or slightly more) than the expected amount of values,
        # so let's adjust them
        placeholder = np.array([])
        for row in mel_spectrogram:
            
            # trim to fixed length
            row = row[:shape[1]]

            # pad with zeros, calculating amount of padding needed
            padding = shape[1] - len(row)
            row = np.pad(row, (0, padding), mode='constant', constant_values=0)

            # append the new row
            placeholder = np.append(placeholder, row)
        
        # append the new unrolled matrix to the final result array
        result = np.append(result, placeholder)
    
    # reshape into a 3-dim matrix
    result = np.reshape(result, (len(list_of_paths), shape[0], shape[1]))
    
    return result

Let's obtain the Mel spectrograms for all sample sets.

In [28]:
train_X_mel_spectrogram = utils.get_X_mel_spectrogram(sample_train_wavs)
cv_X_mel_spectrogram = utils.get_X_mel_spectrogram(sample_cv_wavs)
test_X_mel_spectrogram = utils.get_X_mel_spectrogram(sample_test_wavs)

print("Train mel spectrogram: ", train_X_mel_spectrogram.shape)
print("CV mel spectrogram: ", cv_X_mel_spectrogram.shape)
print("Test mel spectrogram: ",test_X_mel_spectrogram.shape)

Train mel spectrogram:  (240, 128, 32)
CV mel spectrogram:  (60, 128, 32)
Test mel spectrogram:  (60, 128, 32)


In [29]:
# each row is a 2D matrix (hence double-indexing)
train_X_mel_spectrogram[0][0]

array([1457703.31342286,  427352.21837965,  201594.17632909,
        238226.53307341,   65491.03447233,  178705.29710062,
        412570.67348324,  378435.94871593,  258376.05295182,
        187066.96734191,  239451.9017311 ,   56142.51095658,
         42836.42147842,  139791.55001964,  102884.20436902,
        167352.41037348,  321818.54914338,  559749.40569307,
        989871.95717842,  918093.81247816, 1827327.90723131,
       1677686.13316353,  673552.71678095,  419856.38671465,
        384360.98528384,  454044.27309286,  670187.30942213,
        427213.30004477,  395041.75416788,  548740.11881148,
        294776.165019  ,  336066.61377942])

#### FFT (Fast Fourier Transform)

Let's obtain the FFT of our raw data too. For simplicity the utils.get_X_fft() function casts the complex numbers to the numpy float64.

In [30]:
# here the shapes are the same
x = utils.extract_fft(path_to_wav)
x.shape

(16000,)

In [31]:
train_X_fft = utils.get_X_fft(sample_train_wavs)
cv_X_fft = utils.get_X_fft(sample_cv_wavs)
test_X_fft = utils.get_X_fft(sample_test_wavs)

print("Train fft: ", train_X_fft.shape)
print("CV fft: ", cv_X_fft.shape)
print("Test fft: ",test_X_fft.shape)

Train fft:  (240, 16000)
CV fft:  (60, 16000)
Test fft:  (60, 16000)


In [32]:
# no longer complex numbers
print(type(test_X_fft[0][0]))
test_X_fft[0][:5]

<class 'numpy.float64'>


array([-11519.1640625 ,   -783.21451673,  -2758.71579679,   4396.02183472,
         3284.95650186])

#### Tempogram

With tempogram we have to do some reshaping to get a 3D matrix, just like with Mel spectrograms. We will also have to do a little bit of padding and trimming, to account for small differences in the length of the original sample.

In [33]:
# let's see the difference in dimensions
x = utils.extract_tempogram(path_to_wav)
print("Tempogram: {}".format(x.shape))

Tempogram: (384, 32)


In [34]:
train_X_tempogram = utils.get_X_tempogram(sample_train_wavs)
cv_X_tempogram = utils.get_X_tempogram(sample_cv_wavs)
test_X_tempogram = utils.get_X_tempogram(sample_test_wavs)

print("Train tempogram: ", train_X_tempogram.shape)
print("CV tempogram: ", cv_X_tempogram.shape)
print("Test tempogram: ",test_X_tempogram.shape)

Train tempogram:  (240, 384, 32)
CV tempogram:  (60, 384, 32)
Test tempogram:  (60, 384, 32)


In [35]:
# each row is a 2D matrix (hence double-indexing)
train_X_tempogram[0][5]

array([0.1764312 , 0.17643258, 0.17643404, 0.17643558, 0.17643721,
       0.17643893, 0.17644072, 0.1764426 , 0.17644456, 0.17644661,
       0.17644875, 0.17645096, 0.17645327, 0.17645565, 0.17645813,
       0.17646069, 0.17646334, 0.17646607, 0.1764689 , 0.17647181,
       0.17647481, 0.1764779 , 0.17648109, 0.17648437, 0.17648774,
       0.1764912 , 0.17649476, 0.17649842, 0.17650217, 0.17650602,
       0.17650997, 0.17651403])

## Persist the preprocessed X and y
It's good practice to persist your preprocessed datasets so that we don't have to recalculate all of the preprocessing (which in large datasets can be time-consuming). 

A great library for this purpose is the bcolz library (for binary columns).

In [5]:
# define the bcolz array saving functions
def bcolz_save(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode='w'); c.flush()
def bcolz_load(fname): return bcolz.open(fname)[:]

In [6]:
!pwd

/home/paperspace/tensorflow_speech_recognition


In [7]:
path_to_sample_preprocessed = os.path.join(path_to_sample, "preprocessed")
path_to_sample_preprocessed

'data/sample/preprocessed'

In [39]:
# create the directory if it's not there already
# !mkdir $path_to_sample_preprocessed

#### Persist the y

In [40]:
# save the y
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_y" + ".bc", train_y)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_y" + ".bc", cv_y)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_y" + ".bc", test_y)

#### Persist the X

In [41]:
# save the X
# raw data
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X" + ".bc", train_X)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X" + ".bc", cv_X)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X" + ".bc", test_X)

In [42]:
# MFCCs (1dim and 2dim)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_1D" + ".bc", train_X_mfccs_1D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_1D" + ".bc", cv_X_mfccs_1D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_1D" + ".bc", test_X_mfccs_1D)

bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_2D" + ".bc", train_X_mfccs_2D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_2D" + ".bc", cv_X_mfccs_2D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_2D" + ".bc", test_X_mfccs_2D)

In [43]:
# Mel spectrogram
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_mel_spectrogram" + ".bc", train_X_mel_spectrogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_mel_spectrogram" + ".bc", cv_X_mel_spectrogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_mel_spectrogram" + ".bc", test_X_mel_spectrogram)

In [44]:
# FFT
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_fft" + ".bc", train_X_fft)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_fft" + ".bc", cv_X_fft)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_fft" + ".bc", test_X_fft)

In [45]:
# Tempogram
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_tempogram" + ".bc", train_X_tempogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_tempogram" + ".bc", cv_X_tempogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_tempogram" + ".bc", test_X_tempogram)

## Reload the preprocessed X and y
In order not to have to re-run the entire notebook to obtain the preprocessed X and the corresponding y matrices, let's reload them and then proceed to train simple models.

If you're reloading the X & y after restarting the notebook you will also have to run the cells that define the bcolz functions and the path names.

#### Reload the y

In [8]:
# load the y
train_y = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_y" + ".bc")
cv_y = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_y" + ".bc")
test_y = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_y" + ".bc")

In [9]:
train_y.shape

(240, 12)

#### Reload the X

In [10]:
# load the X
# raw data
train_X = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X" + ".bc")
cv_X = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X" + ".bc")
test_X = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X" + ".bc")
train_X.shape

(240, 16000)

In [11]:
# MFCCs (1D and 2D)
train_X_mfccs_1D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_1D" + ".bc")
cv_X_mfccs_1D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_1D" + ".bc")
test_X_mfccs_1D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_1D" + ".bc")
print(train_X_mfccs_1D.shape)

train_X_mfccs_2D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_2D" + ".bc")
cv_X_mfccs_2D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_2D" + ".bc")
test_X_mfccs_2D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_2D" + ".bc")
print(train_X_mfccs_2D.shape)

(240, 100)
(240, 100, 32)


In [12]:
# Mel spectrogram
train_X_mel_spectrogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_mel_spectrogram" + ".bc")
cv_X_mel_spectrogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_mel_spectrogram" + ".bc")
test_X_mel_spectrogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_mel_spectrogram" + ".bc")
train_X_mel_spectrogram.shape

(240, 128, 32)

In [13]:
# FFT
train_X_fft = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_fft" + ".bc")
cv_X_fft = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_fft" + ".bc")
test_X_fft = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_fft" + ".bc")
train_X_fft.shape

(240, 16000)

In [14]:
# Tempogram
train_X_tempogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_tempogram" + ".bc")
cv_X_tempogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_tempogram" + ".bc")
test_X_tempogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_tempogram" + ".bc")
train_X_tempogram.shape

(240, 384, 32)

## Train simple models
We will start by training the simplest models and then try out more and more complex architectures, aiming for the highest possible accuracy and F1 score.

The simplest model we can try is a linear model, which we can obtain by using the Keras Dense layer followed by an activation function such as softmax (as in our case categories are mutually exclusive).

Since we have 12 mutually exclusive categories, we need to get an **accuracy of more than 0.833%** to beat random guessing.

#### Linear Model
We'll need to keep track of the dimensions that we pass into our models, so lets assign their values to separate variables.

In [18]:
# we'll need the number of parameters and the output categories
num_features = train_X.shape[1]
num_categories = train_y.shape[1]
print("Input features: {}\nCategories to predict: {}".format(num_features, num_categories))

Input features: 16000
Categories to predict: 12


In [48]:
# design & compile the model
linear_model = Sequential([
    Dense(input_shape=(num_features,), units = num_categories, activation="softmax")
])

# we choose the Adam optimizer with a specific learning rate
linear_model.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [49]:
# let's evaluate our loss before fitting the model
initial_score = linear_model.evaluate(test_X, test_y, verbose=0)
categorical_crossentropy = initial_score[0]
accuracy = initial_score[1]

print("Based on random weights initialization (values will change everytime you compile the model)\nCategorical crossentropy (loss): {:.4f}\nAccuracy: {:.2f}".format(categorical_crossentropy, accuracy))

Based on random weights initialization (values will change everytime you compile the model)
Categorical crossentropy (loss): 15.0436
Accuracy: 0.07


Let's fit our simple linear model for a couple of epochs and see the **F1 score** and **accuracy**.

In [50]:
# we pass our training data and our cross-validation data to see if we're not overfitting
history = linear_model.fit(train_X, train_y, batch_size=32, epochs=5, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [51]:
# show latest results
best_training_accuracy = max(history.history["acc"])
best_validation_accuracy = max(history.history["val_acc"])
print("Best scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best scores
Train acc: 0.0833
CV acc: 0.0667


Depending on the random initialization of weights we should have an **accuracy** score within 0.05 and 0.15 on both the training and cross-validation set. Let's also calculate the **F1 score**.

In [52]:
# first use the model to predict the labels
pred_cv_y = linear_model.predict(cv_X, batch_size=32)

In [53]:
pred_cv_y.shape

(60, 12)

In [54]:
# check if shape matches expectation (number of examples, number of categories to predict)
pred_cv_y.shape

(60, 12)

In [61]:
# we use softmax to get a result towards one-hot encoding, but not all rows will necessarily be just zeroes and one 1
pred_cv_y[:10]

array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

So before we pass our predictions to the sklearn's f1 score function we need to make sure that all of our rows are actually one-hot encoded.

In [62]:
pred_cv_y = utils.one_hot_encode(pred_cv_y)
pred_cv_y[:10]

array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [63]:
# we can also use sklearn directly to get accuracy
sk_cv_accuracy = accuracy_score(cv_y, pred_cv_y)
print("Final linear model CV accuracy via sklearn: {:.4f}".format(sk_cv_accuracy))

Final linear model CV accuracy via sklearn: 0.0667


In [64]:
# because we're dealing with a mutliclass classification challenge, we need to change the default value of average
# (which is binary)
cv_f1_score = f1_score(cv_y, pred_cv_y, average="weighted")
print("Linear model f1 score (CV): {:.4f}".format(cv_f1_score))

Linear model f1 score (CV): 0.0513


In summary, our accuracy and F1 score for the simplest possible model fall within 0.05 - 0.15. This is our earliest benchmark to beat, and it's **not much better than random guessing**, which given 12 categories would give us an accuracy of 0.08333.

#### Random Forest
It is also useful to try other ML methods before jumping into neural networks and deep learning. Random Forests are a simple but very often quite effective (and computationally inexpensive) method of obtaining a good benchmark.

For the sklearn implementation of Random Forest we actually do not want our target to be one-hot encoded.

In [65]:
# reverse the one-hot encoding
rf_train_y = utils.reverse_one_hot_encoding(train_y)
rf_cv_y = utils.reverse_one_hot_encoding(cv_y)
rf_test_y = utils.reverse_one_hot_encoding(test_y)

In [66]:
rand_forest = RandomForestClassifier(max_depth=20, random_state=0)
rand_forest.fit(train_X, rf_train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [67]:
rf_predicted_cv_y = rand_forest.predict(cv_X)
rf_predicted_cv_y

array([ 4.,  1.,  6.,  8.,  1.,  3.,  4.,  6.,  4.,  1.,  7., 11., 12.,
        3.,  1.,  3.,  9.,  2.,  3.,  8.,  9.,  4.,  1.,  4.,  1.,  5.,
        2.,  8.,  5.,  6.,  8.,  3.,  3.,  6.,  1.,  1.,  2.,  8.,  1.,
        5.,  6.,  3.,  7.,  1.,  6., 10.,  5.,  2.,  9.,  6., 11.,  4.,
       11.,  5.,  6.,  2.,  2.,  9.,  4.,  5.])

In [68]:
# calculate accuracy and F1 for Random Forest
rf_cv_f1_score = f1_score(rf_cv_y, rf_predicted_cv_y, average="weighted")
rf_cv_accuracy = accuracy_score(rf_cv_y, rf_predicted_cv_y)

print("Random forest f1 score (CV): {:.3f}".format(rf_cv_f1_score))
print("Random forest accuracy (CV): {:.3f}".format(rf_cv_accuracy))

Random forest f1 score (CV): 0.135
Random forest accuracy (CV): 0.133


For the Random Forest method, using only default parameters (except for max depth), we are getting an **F1 score and accuracy around 0.10 - 0.15**.<br/> Slightly better than random, nowhere near good enough.

In [69]:
# set benchmark
best_cv_acc = 0.15

## Train Neural Networks
Now that we have a benchmark obtained via simple linear and Random Forest models we can proceed towards trying to outdo it with MLPs and deep learning models.

#### MLP - multi-layer perceptron
Let's start with the simplest possible neural network of just 2 dense layers. We'll be working only on the mfccs data from now on, as it tends to produce better results. We will also add **batch normalization** and **dropout** to reduce overfitting.

In [91]:
# design & compile the model
num_nodes = 2000
mlp = Sequential([
    Dense(input_shape=(num_features,), units = num_nodes, activation="relu"),
    BatchNormalization(),
    Dropout(0.96),
    Dense(num_categories, activation='softmax')
])

# we choose the Adam optimizer with a specific learning rate
mlp.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [92]:
mlp_results = mlp.fit(train_X, train_y, batch_size=32, epochs=10, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [93]:
# show latest results
best_training_accuracy = max(mlp_results.history["acc"])
best_validation_accuracy = max(mlp_results.history["val_acc"])
print("Best MLP scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best MLP scores
Train acc: 0.3000
CV acc: 0.1500


In [94]:
# predict and one-hot encode
mlp_pred_cv_y = mlp.predict(cv_X, batch_size=32)
mlp_pred_cv_y = utils.one_hot_encode(mlp_pred_cv_y)
mlp_pred_cv_y.shape

(60, 12)

In [95]:
# we can also use sklearn directly to get accuracy
mlp_cv_accuracy = accuracy_score(cv_y, mlp_pred_cv_y)
mlp_cv_f1_score = f1_score(cv_y, mlp_pred_cv_y, average="weighted")
print("MLP accuracy via sklearn (CV): {:.4f}".format(mlp_cv_accuracy))
print("MLP f1 score (CV): {:.4f}".format(mlp_cv_f1_score))

MLP accuracy via sklearn (CV): 0.1500
MLP f1 score (CV): 0.1499


We can see that a simple MLP model reaches a very similar accuracy score to our previous benchmark of 0.15. Both this one and the previous ones can be tuned to reach approximately 0.25 but let's save fine-tuning for when we have a more promising approach - we are also already overfitting.

#### Deep Neural Networks
Let's try adding more layers to capture more complex interactions.

In [103]:
dnn = Sequential([
    Dense(input_shape=(num_features,), units = 4000, activation="relu"),
    BatchNormalization(),
    Dropout(0.8),
    Dense(3000, activation="relu"),
    BatchNormalization(),
    Dropout(0.85),
    Dense(2000, activation="relu"),
    BatchNormalization(),
    Dropout(0.9),
    Dense(num_categories, activation='softmax')
])

# we choose the Adam optimizer with a specific learning rate
dnn.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [104]:
dnn_results = dnn.fit(train_X, train_y, batch_size=64, epochs=10, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [105]:
# show latest results
best_training_accuracy = max(dnn_results.history["acc"])
best_validation_accuracy = max(dnn_results.history["val_acc"])
print("Best DNN scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best DNN scores
Train acc: 0.1500
CV acc: 0.1500


In [106]:
# predict and one-hot encode
dnn_pred_cv_y = dnn.predict(cv_X, batch_size=32)
dnn_pred_cv_y = utils.one_hot_encode(dnn_pred_cv_y)
dnn_pred_cv_y.shape

(60, 12)

In [107]:
# we can also use sklearn directly to get accuracy
dnn_cv_accuracy = accuracy_score(cv_y, dnn_pred_cv_y)
dnn_cv_f1_score = f1_score(cv_y, dnn_pred_cv_y, average="weighted")
print("DNN accuracy via sklearn (CV): {:.4f}".format(dnn_cv_accuracy))
print("DNN f1 score (CV): {:.4f}".format(dnn_cv_f1_score))

DNN accuracy via sklearn (CV): 0.1500
DNN f1 score (CV): 0.1133


### Convolutional Models
Seems we're stuck around 0.15 accuracy. That makes sense because the actual "no" and other words may come at any place in the vector, we can't really keep being attached to specific indexes when training (which we currently are). Let's try convolutional layers, which can find certain patterns regardless of whether they appear at the start or end of the file.

We will also move towards using our preprocessed data as convolutions work better with data that conveys dimensionality, beginning with mean MFCCs.

In [15]:
# In order to use convolutions we have to reshape our X -> expand it to 3 dimensions
conv_train_X_mfccs_1D = np.expand_dims(train_X_mfccs_1D, axis=2)
conv_train_X_mfccs_1D.shape

(240, 100, 1)

In [16]:
# repeat for cv & test
conv_cv_X_mfccs_1D = np.expand_dims(cv_X_mfccs_1D, axis=2)
conv_test_X_mfccs_1D = np.expand_dims(test_X_mfccs_1D, axis=2)

In [169]:
cnn1 = Sequential([
        Convolution1D(input_shape=(conv_train_X_mfccs_1D.shape[1], 1), kernel_size=64, filters=64, padding="same", activation="relu"),
        Dropout(0.12),
        MaxPooling1D(),
        Convolution1D(kernel_size=64, filters=64, padding="same", activation="relu"),
        Dropout(0.12),
        MaxPooling1D(),
        Flatten(),
        Dense(2000, activation="relu"),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

cnn1.compile(Adam(lr=0.0001),loss="categorical_crossentropy", metrics=["accuracy"])

In [170]:
cnn1_results = cnn1.fit(conv_train_X_mfccs_1D, train_y, batch_size=64, epochs=80, 
                        validation_data=(conv_cv_X_mfccs_1D, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/80
Epoch 2/80
Epoch 3/80
Epoch 4/80
Epoch 5/80
Epoch 6/80
Epoch 7/80
Epoch 8/80
Epoch 9/80
Epoch 10/80
Epoch 11/80
Epoch 12/80
Epoch 13/80
Epoch 14/80
Epoch 15/80
Epoch 16/80
Epoch 17/80
Epoch 18/80
Epoch 19/80
Epoch 20/80
Epoch 21/80
Epoch 22/80
Epoch 23/80
Epoch 24/80
Epoch 25/80
Epoch 26/80
Epoch 27/80
Epoch 28/80
Epoch 29/80
Epoch 30/80
Epoch 31/80
Epoch 32/80
Epoch 33/80
Epoch 34/80
Epoch 35/80
Epoch 36/80
Epoch 37/80
Epoch 38/80
Epoch 39/80
Epoch 40/80
Epoch 41/80
Epoch 42/80
Epoch 43/80
Epoch 44/80
Epoch 45/80
Epoch 46/80
Epoch 47/80
Epoch 48/80
Epoch 49/80
Epoch 50/80
Epoch 51/80
Epoch 52/80
Epoch 53/80
Epoch 54/80
Epoch 55/80
Epoch 56/80
Epoch 57/80
Epoch 58/80
Epoch 59/80
Epoch 60/80
Epoch 61/80
Epoch 62/80
Epoch 63/80
Epoch 64/80
Epoch 65/80
Epoch 66/80
Epoch 67/80
Epoch 68/80
Epoch 69/80
Epoch 70/80
Epoch 71/80
Epoch 72/80
Epoch 73/80
Epoch 74/80
Epoch 75/80
Epoch 76/80
Epoch 77/80
Epoch 78/80
Epoch 79/80
Epoch 80/80


In [171]:
# show best results
best_training_accuracy = max(cnn1_results.history["acc"])
best_validation_accuracy = max(cnn1_results.history["val_acc"])
print("Best CNN 1 scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best CNN 1 scores
Train acc: 0.3500
CV acc: 0.3000


This CNN architecture should get to 0.3 accuracy within 50-70 epochs and then start to overfit.

In [173]:
# predict and one-hot encode
cnn1_pred_cv_y_mfccs_1D = cnn1.predict(conv_cv_X_mfccs_1D, batch_size=32)
cnn1_pred_cv_y_mfccs_1D = utils.one_hot_encode(cnn1_pred_cv_y_mfccs_1D)
cnn1_pred_cv_y_mfccs_1D.shape

(60, 12)

In [174]:
# we can also use sklearn directly to get accuracy
cnn1_cv_accuracy = accuracy_score(cv_y, cnn1_pred_cv_y_mfccs_1D)
cnn1_cv_f1_score = f1_score(cv_y, cnn1_pred_cv_y_mfccs_1D, average="weighted")
print("CNN 1 accuracy via sklearn (CV): {:.4f}".format(cnn1_cv_accuracy))
print("CNN 1 f1 score (CV): {:.4f}".format(cnn1_cv_f1_score))

CNN 1 accuracy via sklearn (CV): 0.2167
CNN 1 f1 score (CV): 0.2299


#### Convolutional models (1D) with FFT
We have another form of preprocessing that results in a 1D vector - the Fast Fourier Transform. Let's see how our convolutional model might perform in that area.

In [47]:
# In order to use convolutions we have to reshape our X -> expand it to 3 dimensions
conv_train_X_fft = np.expand_dims(train_X_fft, axis=2)
conv_train_X_fft.shape

(240, 16000, 1)

Note that the FFT results in a 16K column vector - which will also require a lot more computational resources to process.

In [48]:
# repeat for cv & test
conv_cv_X_fft = np.expand_dims(cv_X_fft, axis=2)
conv_test_X_fft = np.expand_dims(test_X_fft, axis=2)

In [52]:
cnn_fft = Sequential([
    Convolution1D(input_shape=(conv_train_X_fft.shape[1], 1), kernel_size=64, filters=64, padding="same", activation="relu"),
    Dropout(0.12),
    MaxPooling1D(),
    Convolution1D(kernel_size=64, filters=64, padding="same", activation="relu"),
    Dropout(0.12),
    MaxPooling1D(),
    Flatten(),
    Dense(2000, activation="relu"),
    Dropout(.7),
    Dense(num_categories, activation="softmax")
])

cnn_fft.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [53]:
cnn_fft_results = cnn_fft.fit(conv_train_X_fft, train_y, batch_size=64, epochs=20, 
                        validation_data=(conv_cv_X_fft, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


As we see the number of columns in the 1D vector results in a significant increase in the training time per epoch and we don't seem to be converging on better predictions (can't even fit the training set well with a fairly basic setup of the layers - most probably our kernel sizes are too small given the size of the vector). Let's leave this approach for now. Important practical aspect of ML: human time is ultimately the most valuable resource.

#### Convolutional models (2D) withMFCCs
Let's try a similar model on the _2D MFCC data._

In [178]:
# Our data is already 2D, we don't need to expand dimensions
train_X_mfccs_2D.shape

(240, 100, 32)

In [226]:
cnn2 = Sequential([
        Convolution1D(input_shape=(train_X_mfccs_2D.shape[1], train_X_mfccs_2D.shape[2]), 
                      kernel_size=12, filters=128, padding="same", activation="relu"),
        Dropout(0.11),
        MaxPooling1D(),
        Convolution1D(kernel_size=12, filters=128, padding="same", activation="relu"),
        Dropout(0.13),
        MaxPooling1D(),
        Flatten(),
        Dense(2000, activation="relu"),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

cnn2.compile(Adam(lr=0.0001),loss="categorical_crossentropy", metrics=["accuracy"])

In [227]:
cnn2_results = cnn2.fit(train_X_mfccs_2D, train_y, batch_size=64, 
                        epochs=100, validation_data=(cv_X_mfccs_2D, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Ep

In [228]:
# show best results
best_training_accuracy = max(cnn2_results.history["acc"])
best_validation_accuracy = max(cnn2_results.history["val_acc"])
print("Best CNN 2 results\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best CNN 2 results
Train acc: 0.9458
CV acc: 0.3833


We can see that our **best convolutional model's accuracy above 0.35**. The model's performance is very brittle though, highly dependent on random weights initialization. It is also already overfitting. Let's calculate the F1 score for our latest model.

In [229]:
# predict and one-hot encode
cnn2_pred_cv_y_mfccs_2D = cnn2.predict(cv_X_mfccs_2D, batch_size=32)
cnn2_pred_cv_y_mfccs_2D = utils.one_hot_encode(cnn2_pred_cv_y_mfccs_2D)
cnn2_pred_cv_y_mfccs_2D.shape

(60, 12)

In [230]:
# we can also use sklearn directly to get accuracy
cnn2_cv_accuracy = accuracy_score(cv_y, cnn2_pred_cv_y_mfccs_2D)
cnn2_cv_f1_score = f1_score(cv_y, cnn2_pred_cv_y_mfccs_2D, average="weighted")
print("CNN 2 accuracy via sklearn (CV): {:.4f}".format(cnn2_cv_accuracy))
print("CNN 2 f1 score (CV): {:.4f}".format(cnn2_cv_f1_score))

CNN 2 accuracy via sklearn (CV): 0.3500
CNN 2 f1 score (CV): 0.3767


#### Convolutional models (2D) with Tempogram data
We have 2 other forms of preprocessed data with dimensions that lend themselves to 2D convolutions - MEL Spectrogram and Tempogram. Initial experiments with the MEL Spectrogram data didn't yield promising initial results, let's try Tempogram instead.

In [91]:
# Our data is already 2D, we don't need to expand dimensions
print("Tempogram shape: ", train_X_tempogram.shape)

Tempogram shape:  (240, 384, 32)


In [97]:
cnn3 = Sequential([
        Convolution1D(input_shape=(train_X_tempogram.shape[1], train_X_tempogram.shape[2]), 
                      kernel_size=32, filters=128, padding="same", activation="relu"),
        Dropout(0.11),
        MaxPooling1D(),
        Convolution1D(kernel_size=12, filters=128, padding="same", activation="relu"),
        Dropout(0.13),
        MaxPooling1D(),
        Flatten(),
        Dense(2000, activation="relu"),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

cnn3.compile(Adam(lr=0.0001),loss="categorical_crossentropy", metrics=["accuracy"])

In [98]:
cnn3_results = cnn3.fit(train_X_tempogram, train_y, batch_size=64, 
                        epochs=50, validation_data=(cv_X_tempogram, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [99]:
# show best results
best_training_accuracy = max(cnn3_results.history["acc"])
best_validation_accuracy = max(cnn3_results.history["val_acc"])
print("Best CNN 3 results\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best CNN 3 results
Train acc: 0.3833
CV acc: 0.3500


We can see that our **best convolutional model with the tempogram data reaches an accuracy around 0.35**. This model's performance is far less brittle than the 2D MFCCs. After 50 epochs we aren't particularly overfitting. Let's check the F1 score and continue training for a couple more epochs.

In [100]:
# predict and one-hot encode
cnn3_pred_cv_y_tempogram = cnn3.predict(cv_X_tempogram, batch_size=32)
cnn3_pred_cv_y_tempogram = utils.one_hot_encode(cnn3_pred_cv_y_tempogram)
cnn3_pred_cv_y_tempogram.shape

(60, 12)

In [101]:
# we can also use sklearn directly to get accuracy
cnn3_cv_accuracy = accuracy_score(cv_y, cnn3_pred_cv_y_tempogram)
cnn3_cv_f1_score = f1_score(cv_y, cnn3_pred_cv_y_tempogram, average="weighted")
print("CNN 3 (tempogram) accuracy via sklearn (CV): {:.4f}".format(cnn3_cv_accuracy))
print("CNN 3 (tempogram) f1 score (CV): {:.4f}".format(cnn3_cv_f1_score))

CNN 3 (tempogram) accuracy via sklearn (CV): 0.3333
CNN 3 (tempogram) f1 score (CV): 0.2913


In [102]:
cnn3_results = cnn3.fit(train_X_tempogram, train_y, batch_size=64, 
                        epochs=50, validation_data=(cv_X_tempogram, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [103]:
# show best results after another 50 epochs
best_training_accuracy = max(cnn3_results.history["acc"])
best_validation_accuracy = max(cnn3_results.history["val_acc"])
print("Best CNN 3 results\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best CNN 3 results
Train acc: 0.5083
CV acc: 0.4000


In [104]:
# predict and one-hot encode
cnn3_pred_cv_y_tempogram = cnn3.predict(cv_X_tempogram, batch_size=32)
cnn3_pred_cv_y_tempogram = utils.one_hot_encode(cnn3_pred_cv_y_tempogram)

# we can also use sklearn directly to get accuracy
cnn3_cv_accuracy = accuracy_score(cv_y, cnn3_pred_cv_y_tempogram)
cnn3_cv_f1_score = f1_score(cv_y, cnn3_pred_cv_y_tempogram, average="weighted")
print("Final CNN 3 (tempogram) accuracy via sklearn (CV): {:.4f}".format(cnn3_cv_accuracy))
print("Final CNN 3 (tempogram) f1 score (CV): {:.4f}".format(cnn3_cv_f1_score))

Final CNN 3 (tempogram) accuracy via sklearn (CV): 0.3833
Final CNN 3 (tempogram) f1 score (CV): 0.3357


We have reached the **best CV accuracy so far - 0.4**, but we're beginning to overfit. Given that we're working on a relatively small sample, this could be a viable starting point for training models on the entire training set.

### Recurrent Models
We can also try to take advantage of the architectures specifically designed for time sequences: RNNs. We will start with the basic keras implementations of simple RNN. After that we can consider moving on to GRUs & LSTMs.

#### Simple RNN

In [51]:
rnn_1 = Sequential([
        SimpleRNN(input_shape=(conv_train_X_mfccs_1D.shape[1], 1), units=1000, activation='relu'),
        Dropout(.4),
        Dense(2000, activation="relu"),
        BatchNormalization(),
        Dropout(.8),
        Dense(num_categories, activation="softmax")
    ])

rnn_1.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [52]:
rnn_1_results = rnn_1.fit(conv_train_X_mfccs_1D, train_y, batch_size=32, epochs=50, validation_data=(conv_cv_X_mfccs_1D, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [54]:
# show best results after another 50 epochs
best_training_accuracy = max(rnn_1_results.history["acc"])
best_validation_accuracy = max(rnn_1_results.history["val_acc"])
print("Best RNN 1 results\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best RNN 1 results
Train acc: 0.9458
CV acc: 0.1667


Even after some experiments to reduce overfitting our SimpleRNN doesn't seem able to get a good CV accuracy.

#### GRU
Let's try the Gated Recurrent Unit network.

In [118]:
rnn_2 = Sequential([
        GRU(input_shape=(conv_train_X_mfccs_1D.shape[1], 1), units=1000, activation='tanh'),
        Dense(1000, activation="relu"),
        BatchNormalization(),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

rnn_2.compile(Adam(lr=0.003),loss="categorical_crossentropy", metrics=["accuracy"])

In [119]:
rnn_2_results = rnn_2.fit(conv_train_X_mfccs_1D, train_y, batch_size=32, epochs=10, validation_data=(conv_cv_X_mfccs_1D, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
