# Model experiments - sample set

In the previous notebooks we have separated a small subset of our data, called "sample", on which we can now experiment with simple models to assess the effectiveness of our preprocessing & data augmentation techniques.

We do it this way to avoid spending too much time on training on the entire set, the assumption is that the methods which are effective on the sample will work well on a larger scale too. 

We will start by testing a couple of simple models on untouched sample data (as numpy arrays) and then proceed towards data augmentation and finally spectrograms.

In [1]:
# first make sure we're in the parent dictory of our data/sample folders.
!pwd

/home/paperspace/tensorflow_speech_recognition


## Import
We'll need a couple of additional libraries so let's import them.

In [2]:
# filter out warnings
import warnings
warnings.filterwarnings('ignore') 

In [3]:
import bcolz
import glob
import librosa
import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow

# utils
from importlib import reload
import utils; reload(utils)

# keras as tensorflow backend
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, BatchNormalization, Dropout, Convolution1D, MaxPooling1D, Flatten
from tensorflow.python.keras.optimizers import Adam

# F1 and accuracy score metric
from sklearn.metrics import f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier

## Prepare data
The easiest way to work with data is by turning it into a list of numbers, in our case a numpy array. We can use one of the functions from utils to load the raw data or use the librosa.load() function. The difference lies in the fact that the former returns int16s whereas librosa returns float32s and uses its default sampling rate of 22050Hz, unless we explicitly tell it to use the file's original sampling rate of 16000Hz.

We should also consider normalizing our data (so that it all falls within the same scale) and using the preprocessing methods explored in the previous notebook (MFCCs, Mel spectrogram, fast fourier transform and tempogram). 

In [4]:
path_to_sample = "data/sample"

We'll have to go through each of the folders in our sample/train, cv and test sets, one-hot encode their label and load the 16K long array of raw data. The y data will be of shape (m, 12), where m is the number of examples, and the X data will be of shape (m, 16000) - at least for the raw .wav input.

Let's calculate **m** first. We will do that by using a function that create a list of all the .wav files within a directory.

### Create a list of paths
We will use the glob module that we learned about in the very first notebook and a function from util.py which can, given a directory, return a list of paths to .wav files within it. We will repeat the process for all 3 sets within sample, and every category subdirectory within those too.

In [5]:
# for example we can grab all .wav files from sample/train/stop
path_to_sample_train_stop = os.path.join(path_to_sample, "train", "stop")
utils.grab_wavs(path_to_sample_train_stop)[:5]

['data/sample/train/stop/01b4757a_nohash_0.wav',
 'data/sample/train/stop/3ac2e76f_nohash_0.wav',
 'data/sample/train/stop/3e31dffe_nohash_3.wav',
 'data/sample/train/stop/37bd115d_nohash_1.wav',
 'data/sample/train/stop/6c2dd2d5_nohash_0.wav']

In [6]:
# we'll need a list of all category folder names
categories_to_predict = ["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "silence", "unknown"]

In [7]:
# first grab the training set
path_to_train = os.path.join(path_to_sample, "train")
sample_train_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_train, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    sample_train_wavs.extend(category_files)
    
sample_train_wavs

['data/sample/train/yes/0f3f64d5_nohash_0.wav',
 'data/sample/train/yes/8a28231e_nohash_3.wav',
 'data/sample/train/yes/d3f22f0e_nohash_0.wav',
 'data/sample/train/yes/2d3c8dcb_nohash_1.wav',
 'data/sample/train/yes/d486fb84_nohash_0.wav',
 'data/sample/train/yes/61d3e51e_nohash_0.wav',
 'data/sample/train/yes/8f811bbc_nohash_0.wav',
 'data/sample/train/yes/b43de700_nohash_0.wav',
 'data/sample/train/yes/66a412a7_nohash_0.wav',
 'data/sample/train/yes/92e17cc4_nohash_0.wav',
 'data/sample/train/yes/6c9223bd_nohash_0.wav',
 'data/sample/train/yes/d5356b9a_nohash_0.wav',
 'data/sample/train/yes/e7d0eb3f_nohash_1.wav',
 'data/sample/train/yes/712e4d58_nohash_2.wav',
 'data/sample/train/yes/a2fefcb4_nohash_0.wav',
 'data/sample/train/yes/324210dd_nohash_1.wav',
 'data/sample/train/yes/742d6431_nohash_0.wav',
 'data/sample/train/yes/70a00e98_nohash_1.wav',
 'data/sample/train/yes/28e47b1a_nohash_4.wav',
 'data/sample/train/yes/7014b07e_nohash_1.wav',
 'data/sample/train/no/f0ebef1b_nohash_0

In [8]:
# repeat for cv
path_to_cv = os.path.join(path_to_sample, "cv")
sample_cv_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_cv, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_cv_wavs.extend(category_files)

# repeat for test
path_to_test = os.path.join(path_to_sample, "test")
sample_test_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_test, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_test_wavs.extend(category_files)

### One-hot encode the y

Now that we have the 3 lists of files from each set (train, cv and test) we can construct our train_y, cv_y and test_y numpy arrays. These will be matrices of size (m, 12), one-hot encoded. E.g. if a row belongs to the category "up" it will take the form of an array of zeros, where the entry at index 2 (the third from the left) will become a 1.

We will use a function from the utils that takes a path to a .wav, the index at which the category name starts within it (we want to control this because we will eventually use this for the main set, not just the sample) and a list of categories to predict. For our current example, the category name in the paths belonging to "train" starts at the 18th index (separators count as one char).

In [9]:
# let's grab a single path (this one is an "left")
a_wav = sample_train_wavs[80]
a_wav

'data/sample/train/left/4ec7d027_nohash_0.wav'

In [10]:
# let's see if the 1 is correctly placed
utils.one_hot_encode_path(a_wav, 18, categories_to_predict)

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])

The path belonged to the fifth category ("left") and the one-hot encoding correctly placed the 1 at index 4 (zero-indexed).

We want to repeat this for all examples in each of the 3 subsets, adding each new one-hot encoded numpy array as a new row of the y matrix, in order.

In [11]:
# figure out the dimensions of train_y
rows = len(sample_train_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
dimensions

(240, 12)

In [12]:
# create train_y as empty array
train_y = np.array([])

# append each row to train_y
for path_to_wav in sample_train_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 18, categories_to_predict)
    
    # append the new row
    train_y = np.append(train_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*train_y.shape))

# let's reshape it
train_y = np.reshape(train_y, dimensions)
print("New shape: {}".format(train_y.shape))

Current shape: 2880
New shape: (240, 12)


In [13]:
# show the train_y matrix to confirm
train_y

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

We can see that the first 3 entries have the 1 at 0th index, which means they belong to category "up" and the last three have the 1 at the last index, which is also correct given the fact that our list of paths was also ordered.

We should bear in mind that by default the np.array contains float64s and our functions for loading a .wav return int16s.

Since this is a highly-repetitive task we'll want to use the utils function for obtaining the y.

Repeat for **CV set**.

In [14]:
# figure out the dimensions
rows = len(sample_cv_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
cv_y = utils.get_y(sample_cv_wavs, 15, categories_to_predict)
print("Received shape: {}".format(cv_y.shape))

Target dimensions: (60, 12)
Received shape: (60, 12)


Repeat for **Test set**.

In [15]:
# figure out the dimensions
rows = len(sample_test_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
test_y = utils.get_y(sample_test_wavs, 17, categories_to_predict)
print("Received shape: {}".format(test_y.shape))

Target dimensions: (60, 12)
Received shape: (60, 12)


In [16]:
test_y[0]

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### Get the X
We have the y - the one-hot encoded vectors representing the category for each training, cv and test example in the sample set. We need the feature vectors, conventionally referred to as X. We will use both the simplest way of extracting the .wav data and the preprocessing techniques - MFCCs, Mel spectrogram, FFT and tempogram.

Let's start by defining a simple helper function for just the raw .wav data. Since our samples are of slightly differing lengths but each row of our X always has to have the same length, we will **add padding by default.**

In [17]:
# get the desired number of columns (n)
n = len(utils.get_wav_info(path_to_wav)[1])
n

16000

#### Raw .wav data

In [18]:
# define a simple helper function
def get_X_with_padding(list_of_paths, columns=16000):
    
    # get shape data
    rows = len(list_of_paths)
    dimensions = (rows, columns)
    
    # create placeholder
    X = np.array([])
    
    # go through every file path in the list
    for path_to_wav in list_of_paths:

        # get raw array of signed ints
        row = utils.get_wav_info(path_to_wav)[1]
        
        # some of our sample have less (or slightly more) than 16000 values, so let's adjust them
        # trim to fixed length
        row = row[:columns]
        
        # pad with zeros, calculating amount of padding needed
        padding = columns - len(row)
        row = np.pad(row, (0, padding), mode='constant', constant_values=0)

        # append the new row
        X = np.append(X, row)
    
    # reshape (unroll)
    X = np.reshape(X, dimensions)
    
    return X

In [19]:
# get the X for each set
train_X = utils.get_X(sample_train_wavs, n)
cv_X = utils.get_X(sample_cv_wavs, n)
test_X = utils.get_X(sample_test_wavs, n)

print("Train: ", train_X.shape)
print("CV: ", cv_X.shape)
print("Test: ",test_X.shape)

Train:  (240, 16000)
CV:  (60, 16000)
Test:  (60, 16000)


In [20]:
train_X[0][:5]

array([-11., -21., -25., -42., -33.])

#### MFCCs

We can also do the same for the MFCCs. We have a choice of whether or not we want to get returned only the mean value (1D) for the MFCCs. For now let's obtain both the 1D (mean) and 2D version.

In [21]:
# let's start with a reasonable number of mfccs to return
n_mfcc = 100

In [22]:
train_X_mfccs_1D = utils.get_X_mfccs(sample_train_wavs, shape=(n_mfcc, 32), mean=True)
cv_X_mfccs_1D = utils.get_X_mfccs(sample_cv_wavs, shape=(n_mfcc, 32), mean=True)
test_X_mfccs_1D = utils.get_X_mfccs(sample_test_wavs, shape=(n_mfcc, 32), mean=True)

print("Train mfccs: ", train_X_mfccs_1D.shape)
print("CV mfccs: ", cv_X_mfccs_1D.shape)
print("Test mfccs: ",test_X_mfccs_1D.shape)

Train mfccs:  (240, 100)
CV mfccs:  (60, 100)
Test mfccs:  (60, 100)


In [23]:
train_X_mfccs_1D[0][:5]

array([-445.35308297,   31.80853315,   -3.24210645,   18.580648  ,
          0.94357346])

And now for the 2-dim output.

In [24]:
train_X_mfccs_2D = utils.get_X_mfccs(sample_train_wavs, shape=(n_mfcc, 32), mean=False)
cv_X_mfccs_2D = utils.get_X_mfccs(sample_cv_wavs, shape=(n_mfcc, 32), mean=False)
test_X_mfccs_2D = utils.get_X_mfccs(sample_test_wavs, shape=(n_mfcc, 32), mean=False)

print("Train mfccs: ", train_X_mfccs_2D.shape)
print("CV mfccs: ", cv_X_mfccs_2D.shape)
print("Test mfccs: ",test_X_mfccs_2D.shape)

Train mfccs:  (240, 100, 32)
CV mfccs:  (60, 100, 32)
Test mfccs:  (60, 100, 32)


In [25]:
train_X_mfccs_2D[0][0][:5]

array([-576.51036871, -570.90724179, -544.55853427, -553.88026846,
       -578.52544577])

#### Mel spectrogam

In case of Mel spectrograms we expect to get a matrix from a vector, therefore our final X will be 3 dimensional.

In [26]:
# let's see the difference in dimensions
sr, raw_data = utils.get_wav_info(path_to_wav)
print("Raw data shape: {}".format(raw_data.shape))
x = librosa.feature.melspectrogram(raw_data, sr)
print("Mel spectrogram shape: {}".format(x.shape))

Raw data shape: (16000,)
Mel spectrogram shape: (128, 32)


In [27]:
# here's the function we'll use (via utils.py)
def get_X_mel_spectrogram(list_of_paths, shape=(128, 32)):

    # get shape data
    rows = len(list_of_paths)

    # create placeholder
    result = np.array([])

    # go through every file path in the list
    for path_to_wav in list_of_paths:
        
        # get raw array of signed ints
        sr, raw_data = utils.get_wav_info(path_to_wav)
        mel_spectrogram = librosa.feature.melspectrogram(raw_data, sr)

        # some of our samples have less (or slightly more) than the expected amount of values,
        # so let's adjust them
        placeholder = np.array([])
        for row in mel_spectrogram:
            
            # trim to fixed length
            row = row[:shape[1]]

            # pad with zeros, calculating amount of padding needed
            padding = shape[1] - len(row)
            row = np.pad(row, (0, padding), mode='constant', constant_values=0)

            # append the new row
            placeholder = np.append(placeholder, row)
        
        # append the new unrolled matrix to the final result array
        result = np.append(result, placeholder)
    
    # reshape into a 3-dim matrix
    result = np.reshape(result, (len(list_of_paths), shape[0], shape[1]))
    
    return result

Let's obtain the Mel spectrograms for all sample sets.

In [28]:
train_X_mel_spectrogram = utils.get_X_mel_spectrogram(sample_train_wavs)
cv_X_mel_spectrogram = utils.get_X_mel_spectrogram(sample_cv_wavs)
test_X_mel_spectrogram = utils.get_X_mel_spectrogram(sample_test_wavs)

print("Train mel spectrogram: ", train_X_mel_spectrogram.shape)
print("CV mel spectrogram: ", cv_X_mel_spectrogram.shape)
print("Test mel spectrogram: ",test_X_mel_spectrogram.shape)

Train mel spectrogram:  (240, 128, 32)
CV mel spectrogram:  (60, 128, 32)
Test mel spectrogram:  (60, 128, 32)


In [29]:
# each row is a 2D matrix (hence double-indexing)
train_X_mel_spectrogram[0][0]

array([1457703.31342286,  427352.21837965,  201594.17632909,
        238226.53307341,   65491.03447233,  178705.29710062,
        412570.67348324,  378435.94871593,  258376.05295182,
        187066.96734191,  239451.9017311 ,   56142.51095658,
         42836.42147842,  139791.55001964,  102884.20436902,
        167352.41037348,  321818.54914338,  559749.40569307,
        989871.95717842,  918093.81247816, 1827327.90723131,
       1677686.13316353,  673552.71678095,  419856.38671465,
        384360.98528384,  454044.27309286,  670187.30942213,
        427213.30004477,  395041.75416788,  548740.11881148,
        294776.165019  ,  336066.61377942])

#### FFT (Fast Fourier Transform)

Let's obtain the FFT of our raw data too. For simplicity the utils.get_X_fft() function casts the complex numbers to the numpy float64.

In [30]:
# here the shapes are the same
x = utils.extract_fft(path_to_wav)
x.shape

(16000,)

In [31]:
train_X_fft = utils.get_X_fft(sample_train_wavs)
cv_X_fft = utils.get_X_fft(sample_cv_wavs)
test_X_fft = utils.get_X_fft(sample_test_wavs)

print("Train fft: ", train_X_fft.shape)
print("CV fft: ", cv_X_fft.shape)
print("Test fft: ",test_X_fft.shape)

Train fft:  (240, 16000)
CV fft:  (60, 16000)
Test fft:  (60, 16000)


In [32]:
# no longer complex numbers
print(type(test_X_fft[0][0]))
test_X_fft[0][:5]

<class 'numpy.float64'>


array([-11519.1640625 ,   -783.21451673,  -2758.71579679,   4396.02183472,
         3284.95650186])

#### Tempogram

With tempogram we have to do some reshaping to get a 3D matrix, just like with Mel spectrograms. We will also have to do a little bit of padding and trimming, to account for small differences in the length of the original sample.

In [33]:
# let's see the difference in dimensions
x = utils.extract_tempogram(path_to_wav)
print("Tempogram: {}".format(x.shape))

Tempogram: (384, 32)


In [34]:
train_X_tempogram = utils.get_X_tempogram(sample_train_wavs)
cv_X_tempogram = utils.get_X_tempogram(sample_cv_wavs)
test_X_tempogram = utils.get_X_tempogram(sample_test_wavs)

print("Train tempogram: ", train_X_tempogram.shape)
print("CV tempogram: ", cv_X_tempogram.shape)
print("Test tempogram: ",test_X_tempogram.shape)

Train tempogram:  (240, 384, 32)
CV tempogram:  (60, 384, 32)
Test tempogram:  (60, 384, 32)


In [35]:
# each row is a 2D matrix (hence double-indexing)
train_X_tempogram[0][5]

array([0.1764312 , 0.17643258, 0.17643404, 0.17643558, 0.17643721,
       0.17643893, 0.17644072, 0.1764426 , 0.17644456, 0.17644661,
       0.17644875, 0.17645096, 0.17645327, 0.17645565, 0.17645813,
       0.17646069, 0.17646334, 0.17646607, 0.1764689 , 0.17647181,
       0.17647481, 0.1764779 , 0.17648109, 0.17648437, 0.17648774,
       0.1764912 , 0.17649476, 0.17649842, 0.17650217, 0.17650602,
       0.17650997, 0.17651403])

## Persist the preprocessed X and y
It's good practice to persist your preprocessed datasets so that we don't have to recalculate all of the preprocessing (which in large datasets can be time-consuming). 

A great library for this purpose is the bcolz library (for binary columns).

In [36]:
# define the bcolz array saving functions
def bcolz_save(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode='w'); c.flush()
def bcolz_load(fname): return bcolz.open(fname)[:]

In [37]:
!pwd

/home/paperspace/tensorflow_speech_recognition


In [38]:
path_to_sample_preprocessed = os.path.join(path_to_sample, "preprocessed")
path_to_sample_preprocessed

'data/sample/preprocessed'

In [39]:
# create the directory if it's not there already
# !mkdir $path_to_sample_preprocessed

#### Persist the y

In [40]:
# save the y
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_y" + ".bc", train_y)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_y" + ".bc", cv_y)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_y" + ".bc", test_y)

#### Persist the X

In [41]:
# save the X
# raw data
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X" + ".bc", train_X)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X" + ".bc", cv_X)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X" + ".bc", test_X)

In [42]:
# MFCCs (1dim and 2dim)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_1D" + ".bc", train_X_mfccs_1D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_1D" + ".bc", cv_X_mfccs_1D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_1D" + ".bc", test_X_mfccs_1D)

bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_2D" + ".bc", train_X_mfccs_2D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_2D" + ".bc", cv_X_mfccs_2D)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_2D" + ".bc", test_X_mfccs_2D)

In [43]:
# Mel spectrogram
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_mel_spectrogram" + ".bc", train_X_mel_spectrogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_mel_spectrogram" + ".bc", cv_X_mel_spectrogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_mel_spectrogram" + ".bc", test_X_mel_spectrogram)

In [44]:
# FFT
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_fft" + ".bc", train_X_fft)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_fft" + ".bc", cv_X_fft)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_fft" + ".bc", test_X_fft)

In [45]:
# Tempogram
bcolz_save(path_to_sample_preprocessed + os.path.sep + "train_X_tempogram" + ".bc", train_X_tempogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "cv_X_tempogram" + ".bc", cv_X_tempogram)
bcolz_save(path_to_sample_preprocessed + os.path.sep + "test_X_tempogram" + ".bc", test_X_tempogram)

## Reload the preprocessed X and y
In order not to have to re-run the entire notebook to obtain the preprocessed X and the corresponding y matrices, let's reload them and then proceed to train simple models.

If you're reloading the X & y after restarting the notebook you will also have to run the cells that define the bcolz functions and the path names.

#### Reload the y

In [46]:
# load the y
train_y = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_y" + ".bc")
cv_y = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_y" + ".bc")
test_y = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_y" + ".bc")

In [47]:
train_y.shape

(240, 12)

#### Reload the X

In [48]:
# load the X
# raw data
train_X = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X" + ".bc")
cv_X = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X" + ".bc")
test_X = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X" + ".bc")
train_X.shape

(240, 16000)

In [51]:
# MFCCs (1D and 2D)
train_X_mfccs_1D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_1D" + ".bc")
cv_X_mfccs_1D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_1D" + ".bc")
test_X_mfccs_1D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_1D" + ".bc")
print(train_X_mfccs_1D.shape)

train_X_mfccs_2D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_mfccs_2D" + ".bc")
cv_X_mfccs_2D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_mfccs_2D" + ".bc")
test_X_mfccs_2D = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_mfccs_2D" + ".bc")
print(train_X_mfccs_2D.shape)

(240, 100)
(240, 100, 32)


In [52]:
# Mel spectrogram
train_X_mel_spectrogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_mel_spectrogram" + ".bc")
cv_X_mel_spectrogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_mel_spectrogram" + ".bc")
test_X_mel_spectrogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_mel_spectrogram" + ".bc")
train_X_mel_spectrogram.shape

(240, 128, 32)

In [53]:
# FFT
train_X_fft = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_fft" + ".bc")
cv_X_fft = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_fft" + ".bc")
test_X_fft = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_fft" + ".bc")
train_X_fft.shape

(240, 16000)

In [54]:
# Tempogram
train_X_tempogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "train_X_tempogram" + ".bc")
cv_X_tempogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "cv_X_tempogram" + ".bc")
test_X_tempogram = bcolz_load(path_to_sample_preprocessed + os.path.sep + "test_X_tempogram" + ".bc")
train_X_tempogram.shape

(240, 384, 32)

## Train simple models
We will start by training the simplest models and then try out more and more complex architectures, aiming for the highest possible accuracy and F1 score.

The simplest model we can try is a linear model, which we can obtain by using the Keras Dense layer followed by an activation function such as softmax (as in our case categories are mutually exclusive).

Since we have 12 mutually exclusive categories, we need to get an **accuracy of more than 0.833%** to beat random guessing.

#### Linear Model
We'll need to keep track of the dimensions that we pass into our models, so lets assign their values to separate variables.

In [55]:
# we'll need the number of parameters and the output categories
num_features = train_X.shape[1]
num_categories = train_y.shape[1]
print("Input features: {}\nCategories to predict: {}".format(num_features, num_categories))

Input features: 16000
Categories to predict: 12


In [56]:
# design & compile the model
linear_model = Sequential([
    Dense(input_shape=(num_features,), units = num_categories, activation="softmax")
])

# we choose the Adam optimizer with a specific learning rate
linear_model.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [57]:
# let's evaluate our loss before fitting the model
initial_score = linear_model.evaluate(test_X, test_y, verbose=0)
categorical_crossentropy = initial_score[0]
accuracy = initial_score[1]

print("Based on random weights initialization (values will change everytime you compile the model)\nCategorical crossentropy (loss): {:.4f}\nAccuracy: {:.2f}".format(categorical_crossentropy, accuracy))

Based on random weights initialization (values will change everytime you compile the model)
Categorical crossentropy (loss): 14.7749
Accuracy: 0.08


Let's fit our simple linear model for a couple of epochs and see the **F1 score** and **accuracy**.

In [62]:
# we pass our training data and our cross-validation data to see if we're not overfitting
history = linear_model.fit(train_X, train_y, batch_size=32, epochs=5, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [63]:
# show latest results
best_training_accuracy = max(history.history["acc"])
best_validation_accuracy = max(history.history["val_acc"])
print("Best scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best scores
Train acc: 0.1292
CV acc: 0.0833


Depending on the random initialization of weights we should have an **accuracy** score within 0.05 and 0.15 on both the training and cross-validation set. Let's also calculate the **F1 score**.

In [64]:
# first use the model to predict the labels
pred_cv_y = linear_model.predict(cv_X, batch_size=32)

In [65]:
pred_cv_y.shape

(60, 12)

In [66]:
# check if shape matches expectation (number of examples, number of categories to predict)
pred_cv_y.shape

(60, 12)

In [67]:
# we use softmax to get a result towards one-hot encoding, but not all rows will necessarily be just zeroes and one 1
pred_cv_y[:10]

array([[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.7456033e-25,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        1.0000000e+00, 0.0000000e+00, 0.0000

So before we pass our predictions to the sklearn's f1 score function we need to make sure that all of our rows are actually one-hot encoded.

In [68]:
pred_cv_y = utils.one_hot_encode(pred_cv_y)
pred_cv_y[:10]

array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])

In [69]:
# we can also use sklearn directly to get accuracy
sk_cv_accuracy = accuracy_score(cv_y, pred_cv_y)
print("Final linear model CV accuracy via sklearn: {:.4f}".format(sk_cv_accuracy))

Final linear model CV accuracy via sklearn: 0.0833


In [70]:
# because we're dealing with a mutliclass classification challenge, we need to change the default value of average
# (which is binary)
cv_f1_score = f1_score(cv_y, pred_cv_y, average="weighted")
print("Linear model f1 score (CV): {:.4f}".format(cv_f1_score))

Linear model f1 score (CV): 0.0789


In summary, our accuracy and F1 score for the simplest possible model fall within 0.05 - 0.15. This is our earliest benchmark to beat, and it's **not much better than random guessing**, which given 12 categories would give us an accuracy of 0.08333.

#### Random Forest
It is also useful to try other ML methods before jumping into neural networks and deep learning. Random Forests are a simple but very often quite effective (and computationally inexpensive) method of obtaining a good benchmark.

For the sklearn implementation of Random Forest we actually do not want our target to be one-hot encoded.

In [79]:
# reverse the one-hot encoding
rf_train_y = utils.reverse_one_hot_encoding(train_y)
rf_cv_y = utils.reverse_one_hot_encoding(cv_y)
rf_test_y = utils.reverse_one_hot_encoding(test_y)

In [92]:
rand_forest = RandomForestClassifier(max_depth=20, random_state=0)
rand_forest.fit(train_X, rf_train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [93]:
rf_predicted_cv_y = rand_forest.predict(cv_X)
rf_predicted_cv_y

array([ 4.,  1.,  6.,  8.,  1.,  3.,  4.,  6.,  4.,  1.,  7., 11., 12.,
        3.,  1.,  3.,  9.,  2.,  3.,  8.,  9.,  4.,  1.,  4.,  1.,  5.,
        2.,  8.,  5.,  6.,  8.,  3.,  3.,  6.,  1.,  1.,  2.,  8.,  1.,
        5.,  6.,  3.,  7.,  1.,  6., 10.,  5.,  2.,  9.,  6., 11.,  4.,
       11.,  5.,  6.,  2.,  2.,  9.,  4.,  5.])

In [94]:
# calculate accuracy and F1 for Random Forest
rf_cv_f1_score = f1_score(rf_cv_y, rf_predicted_cv_y, average="weighted")
rf_cv_accuracy = accuracy_score(rf_cv_y, rf_predicted_cv_y)

print("Random forest f1 score (CV): {:.3f}".format(rf_cv_f1_score))
print("Random forest accuracy (CV): {:.3f}".format(rf_cv_accuracy))

Random forest f1 score (CV): 0.135
Random forest accuracy (CV): 0.133


For the Random Forest method, using only default parameters (except for max depth), we are getting an **F1 score and accuracy around 0.10 - 0.15**.<br/> Slightly better than random, nowhere near good enough.

In [109]:
# set benchmark
best_cv_acc = 0.15

## Train Neural Networks
Now that we have a benchmark obtained via simple linear and Random Forest models we can proceed towards trying to outdo it with MLPs and deep learning models.

#### MLP - multi-layer perceptron
Let's start with the simplest possible neural network of just 2 dense layers. We'll be working only on the mfccs data from now on, as it tends to produce better results. We will also add **batch normalization** and **dropout** to reduce overfitting.

In [18]:
# design & compile the model
num_nodes = 2000
mlp = Sequential([
    Dense(input_shape=(num_features,), units = num_nodes, activation="relu"),
    BatchNormalization(),
    Dropout(0.95),
    Dense(num_categories, activation='softmax')
])

# we choose the Adam optimizer with a specific learning rate
mlp.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [19]:
mlp_results = mlp.fit(train_X, train_y, batch_size=32, epochs=10, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
# show latest results
best_training_accuracy = max(mlp_results.history["acc"])
best_validation_accuracy = max(mlp_results.history["val_acc"])
print("Best MLP scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best MLP scores
Train acc: 0.3208
CV acc: 0.1833


In [23]:
# predict and one-hot encode
mlp_pred_cv_y = mlp.predict(cv_X, batch_size=32)
mlp_pred_cv_y = utils.one_hot_encode(mlp_pred_cv_y)
mlp_pred_cv_y.shape

(60, 12)

In [24]:
# we can also use sklearn directly to get accuracy
mlp_cv_accuracy = accuracy_score(cv_y, mlp_pred_cv_y)
mlp_cv_f1_score = f1_score(cv_y, mlp_pred_cv_y, average="weighted")
print("MLP accuracy via sklearn (CV): {:.4f}".format(mlp_cv_accuracy))
print("MLP f1 score (CV): {:.4f}".format(mlp_cv_f1_score))

MLP accuracy via sklearn (CV): 0.1667
MLP f1 score (CV): 0.1654


We can see that a simple MLP model reaches a very similar accuracy score to our previous benchmark of 0.15. Both this one and the previous ones can be tuned to reach approximately 0.25 but let's save fine-tuning for when we have a more promising approach - we are also already overfitting.

#### Deep Neural Networks
Let's try adding more layers to capture more complex interactions.

In [48]:
dnn = Sequential([
    Dense(input_shape=(num_features,), units = 4000, activation="relu"),
    BatchNormalization(),
    Dropout(0.8),
    Dense(3000, activation="relu"),
    BatchNormalization(),
    Dropout(0.8),
    Dense(2000, activation="relu"),
    BatchNormalization(),
    Dropout(0.8),
    Dense(num_categories, activation='softmax')
])

# we choose the Adam optimizer with a specific learning rate
dnn.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [49]:
dnn_results = dnn.fit(train_X, train_y, batch_size=64, epochs=10, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [50]:
# show latest results
best_training_accuracy = max(dnn_results.history["acc"])
best_validation_accuracy = max(dnn_results.history["val_acc"])
print("Best DNN scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best DNN scores
Train acc: 0.2292
CV acc: 0.1667


In [51]:
# predict and one-hot encode
dnn_pred_cv_y = dnn.predict(cv_X, batch_size=32)
dnn_pred_cv_y = utils.one_hot_encode(dnn_pred_cv_y)
dnn_pred_cv_y.shape

(60, 12)

In [52]:
# we can also use sklearn directly to get accuracy
dnn_cv_accuracy = accuracy_score(cv_y, dnn_pred_cv_y)
dnn_cv_f1_score = f1_score(cv_y, dnn_pred_cv_y, average="weighted")
print("DNN accuracy via sklearn (CV): {:.4f}".format(dnn_cv_accuracy))
print("DNN f1 score (CV): {:.4f}".format(dnn_cv_f1_score))

DNN accuracy via sklearn (CV): 0.1667
DNN f1 score (CV): 0.1209


#### Convolutional Models
Seems we're stuck around 0.15 accuracy. That makes sense because the actual "no" and other words may come at any place in the vector, we can't really keep being attached to specific indexes when training (which we currently are). Let's try convolutional layers, which can find certain patterns regardless of whether they appear at the start or end of the file.

In [56]:
sr, tmp = utils.get_wav_info("data/sample/train/stop/01b4757a_nohash_0.wav")
tmp.shape

(11606,)

In [59]:
mfcc_tmp = librosa.feature.mfcc(tmp, sr)
mfcc_tmp

array([[ 7.91494527e+02,  7.97293842e+02,  7.96024977e+02,
         8.13325323e+02,  8.29413353e+02,  8.38056295e+02,
         8.33492077e+02,  8.20852905e+02,  8.92750655e+02,
         9.64842738e+02,  9.80583510e+02,  9.79446237e+02,
         9.71215749e+02,  9.69958534e+02,  9.64338481e+02,
         9.51096909e+02,  8.99682175e+02,  8.45057228e+02,
         8.39876141e+02,  8.52041665e+02,  8.64775350e+02,
         9.02867648e+02,  8.82603551e+02],
       [ 7.15329464e+01,  7.46569244e+01,  6.51241728e+01,
         4.65793993e+01,  3.24022451e+01,  2.84170819e+01,
         3.95254722e+01,  5.62807149e+01,  9.04809104e+01,
         1.24341812e+02,  1.36372590e+02,  1.39704835e+02,
         1.32579195e+02,  1.26472082e+02,  1.20161508e+02,
         1.28909614e+02,  1.25371020e+02,  1.00925500e+02,
         9.80818789e+01,  1.08411666e+02,  7.19331180e+01,
         1.87766598e+01,  2.47276794e+01],
       [ 6.33864938e+00,  5.12257677e+00,  5.64245337e-01,
        -6.88541402e-01,  9.4

In [95]:
# In order to use convolutions we have reshape our X -> expand it to 3 dimensions
conv_train_X_mfccs = np.expand_dims(train_X_mfccs, axis=2)
conv_train_X_mfccs.shape

(240, 16000, 1)

In [96]:
# repeat for cv & test
conv_cv_X_mfccs = np.expand_dims(cv_X_mfccs, axis=2)
conv_test_X_mfccs = np.expand_dims(test_X_mfccs, axis=2)

In [97]:
cnn1 = Sequential([
        Convolution1D(input_shape=(num_features, 1), kernel_size=32, filters=8, padding="same", activation="relu"),
        Dropout(0.1),
        MaxPooling1D(),
        Convolution1D(kernel_size=64, filters=16, padding="same", activation="relu"),
        Dropout(0.1),
        MaxPooling1D(),
        Flatten(),
        Dense(500, activation="relu"),
        Dropout(.6),
        Dense(num_categories, activation="softmax")
    ])

cnn1.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

This CNN architecture should get to 0.367 accuracy around the 35 epoch and then start to overfit.

In [98]:
cnn1_results = cnn1.fit(conv_train_X_mfccs, train_y, batch_size=32, epochs=60, validation_data=(conv_cv_X_mfccs, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60

KeyboardInterrupt: 

In [99]:
# show best results
best_training_accuracy = max(cnn1_results.history["acc"])
best_validation_accuracy = max(cnn1_results.history["val_acc"])
print("Best CNN 1 scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

NameError: name 'cnn1_results' is not defined

In [100]:
# predict and one-hot encode
cnn1_pred_cv_y = cnn1.predict(conv_cv_X_mfccs, batch_size=32)
cnn1_pred_cv_y = utils.one_hot_encode(cnn1_pred_cv_y)
cnn1_pred_cv_y.shape

(60, 12)

In [101]:
# we can also use sklearn directly to get accuracy
cnn1_cv_accuracy = accuracy_score(cv_y, cnn1_pred_cv_y)
cnn1_cv_f1_score = f1_score(cv_y, cnn1_pred_cv_y, average="weighted")
print("CNN 1 accuracy via sklearn (CV): {:.4f}".format(cnn1_cv_accuracy))
print("CNN 1 f1 score (CV): {:.4f}".format(cnn1_cv_f1_score))

CNN 1 accuracy via sklearn (CV): 0.1500
CNN 1 f1 score (CV): 0.0981


Let's increase the kernel size - patterns in speech mighr require more than e.g. 32 single samplings to be recognizable.

In [102]:
cnn2 = Sequential([
        Convolution1D(input_shape=(num_features, 1), kernel_size=256, filters=32, padding="same", activation="relu"),
        Dropout(0.2),
        MaxPooling1D(),
        Convolution1D(kernel_size=512, filters=32, padding="same", activation="relu"),
        Dropout(0.2),
        MaxPooling1D(),
        Flatten(),
        Dense(500, activation="relu"),
        Dropout(.6),
        Dense(num_categories, activation="softmax")
    ])

cnn2.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [103]:
cnn2_results = cnn2.fit(conv_train_X_mfccs, train_y, batch_size=32, epochs=50, validation_data=(conv_cv_X_mfccs, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/50
Epoch 2/50

KeyboardInterrupt: 

#### Recurrent Models
We can also try to take advantage of the architectures specifically designed for time sequences: RNNs. We will start with the basic keras implementations of simple RNN and then move on to GRUs & LSTMs.

In [None]:
rnn_1 = Sequential([
        SimpleRNN(input_shape=(num_features, 1), units=100, activation='relu'),
        Dense(500, activation="relu"),
        BatchNormalization(),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

rnn_1.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
rnn_1_results = rnn_1.fit(conv_train_X_mfccs, train_y, batch_size=32, epochs=50, validation_data=(conv_cv_X_mfccs, cv_y))

## Action plan
X) turn the sample data into numpy arrays with X and y normally <br>
X) turn sample data into numpy arrays with X and y via mfccs<br>
X) Use linear model? (towards first benchmark)<br>
X) Use random forest?<br>
X) Use MLP<br>
X) Use multiple dense layers<br>
4c) Use convolutions (try the increased kernel sie that takes 400s per epoch)<br>
4d) USE RNN -> like in Nietzsche [https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/]<br>
5) Add preprocessing and test a couple of the best models<br>

6) Consider splitting the work on images into separate notebook depending on how bulky this gets<br>
7) Experiments on images without data augmentation<br>
8) Experiments on images with data augmentation<br>

9) Decide on e.g. 3 most promising methods<br>

And then:<br>
10) Move to writing the most promising models in tensorflow<br>
11) Include tensorboard visualization of training & graph<br>
12) Code for turning results into kaggle format of results to get score<br>
13) Obtain a good score on kaggle<br>
14) Re-read everything from start to finish and adjust<br>
15) Write a good Readme for markdown<br>
16) Add to CV<br>