# Model experiments - sample set

In the previous notebooks we have separated a small subset of our data, called "sample", on which we can now experiment with simple models to assess the effectiveness of our preprocessing & data augmentation techniques.

We do it this way to avoid spending too much time on training on the entire set, the assumption is that the methods which are effective on the sample will work well on a larger scale too. 

We will start by testing a couple of simple models on untouched sample data (as numpy arrays) and then proceed towards data augmentation and finally spectrograms.

In [1]:
# first make sure we're in the parent dictory of our data/sample folders.
!pwd

/c/Users/mateusz/Documents/Mateusz/Career/Machine Learning & AI/tensorflow_speech_recognition/tensorflow_speech_recognition


## Import
We'll need a couple of additional libraries so let's import them.

In [2]:
# filter out warnings
import warnings
warnings.filterwarnings('ignore') 

In [3]:
import glob
import librosa
import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow

# keras as tensorflow backend
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, BatchNormalization, Dropout, Convolution1D, MaxPooling1D, Flatten
from tensorflow.python.keras.optimizers import Adam

# F1 and accuracy score metric
from sklearn.metrics import f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier

# utils
from importlib import reload
import utils; reload(utils)

<module 'utils' from 'C:\\Users\\mateusz\\Documents\\Mateusz\\Career\\Machine Learning & AI\\tensorflow_speech_recognition\\tensorflow_speech_recognition\\utils.py'>

## Prepare data
The easiest way to work with data is by turning it into a list of numbers, in our case a numpy array. We can use one of the functions from utils to load the raw data or use the librosa.load() function. The difference lies in the fact that the former returns int16s whereas librosa returns float32s and uses its default sampling rate of 22050Hz, unless we explicitly tell it to use the file's original sampling rate of 16000Hz.

We should also consider normalizing our data (so that it all falls within the same scale) and extracting a 1D mel-frequency cepstrum.

In [4]:
path_to_sample = "data\\sample"

We'll have to go through each of the folders in our sample/train, cv and test sets, one-hot encode their label and load the 16K long array of raw data. The y data will be of shape (m, 12), where m is the number of examples, and the X data will be of shape (m, 16000).

Let's calculate **m** first. We will do that by using a function that create a list of all the .wav files within a directory.

### Create a list of paths
We will use the glob module that we learned about in the very first notebook and a function from util.py which can, given a directory, return a list of paths to .wav files within it. We will repeat the process for all 3 sets within sample, and every category subdirectory within those too.

In [5]:
# for example we can grab all .wav files from sample/train/stop
path_to_sample_train_stop = os.path.join(path_to_sample, "train", "stop")
utils.grab_wavs(path_to_sample_train_stop)[:5]

['data\\sample\\train\\stop\\01bcfc0c_nohash_1.wav',
 'data\\sample\\train\\stop\\17cc40ee_nohash_1.wav',
 'data\\sample\\train\\stop\\2da58b32_nohash_2.wav',
 'data\\sample\\train\\stop\\2da58b32_nohash_4.wav',
 'data\\sample\\train\\stop\\311fde72_nohash_2.wav']

In [6]:
# we'll need a list of all category folder names
categories_to_predict = ["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "silence", "unknown"]

In [7]:
# first grab the training set
path_to_train = os.path.join(path_to_sample, "train")
sample_train_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_train, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    sample_train_wavs.extend(category_files)
    
sample_train_wavs

['data\\sample\\train\\yes\\023a61ad_nohash_0.wav',
 'data\\sample\\train\\yes\\0f3f64d5_nohash_0.wav',
 'data\\sample\\train\\yes\\190821dc_nohash_4.wav',
 'data\\sample\\train\\yes\\28ed6bc9_nohash_1.wav',
 'data\\sample\\train\\yes\\324210dd_nohash_5.wav',
 'data\\sample\\train\\yes\\32561e9e_nohash_0.wav',
 'data\\sample\\train\\yes\\3fdafe25_nohash_0.wav',
 'data\\sample\\train\\yes\\48e8b82a_nohash_1.wav',
 'data\\sample\\train\\yes\\493392c6_nohash_1.wav',
 'data\\sample\\train\\yes\\589bce2c_nohash_1.wav',
 'data\\sample\\train\\yes\\5c237956_nohash_0.wav',
 'data\\sample\\train\\yes\\65c73b55_nohash_0.wav',
 'data\\sample\\train\\yes\\89f680f3_nohash_0.wav',
 'data\\sample\\train\\yes\\953fe1ad_nohash_1.wav',
 'data\\sample\\train\\yes\\b43de700_nohash_1.wav',
 'data\\sample\\train\\yes\\b7669804_nohash_0.wav',
 'data\\sample\\train\\yes\\e48a80ed_nohash_2.wav',
 'data\\sample\\train\\yes\\f5c3de1b_nohash_0.wav',
 'data\\sample\\train\\yes\\f839238a_nohash_1.wav',
 'data\\samp

In [8]:
# repeat for cv
path_to_cv = os.path.join(path_to_sample, "cv")
sample_cv_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_cv, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_cv_wavs.extend(category_files)

# repeat for test
path_to_test = os.path.join(path_to_sample, "test")
sample_test_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_test, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_test_wavs.extend(category_files)

### One-hot encode the y

Now that we have the 3 lists of files from each set (train, cv and test) we can construct our train_y, cv_y and test_y numpy arrays. These will be matrices of size (m, 12), one-hot encoded. E.g. if a row belongs to the category "up" it will take the form of an array of zeros, where the entry at index 2 (the third from the left) will become a 1.

We will use a function from the utils that takes a path to a .wav, the index at which the category name starts within it (we want to control this because we will eventually use this for the main set, not just the sample) and a list of categories to predict. For our current example, the category name in the paths belonging to "train" starts at the 18th index (separators count as one char).

In [9]:
# let's grab a single path (this one is an "up")
a_wav = sample_train_wavs[0]

In [10]:
# let's see if the 1 is correctly placed
utils.one_hot_encode_path(a_wav, 18, categories_to_predict)

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

The path belonged to the first category ("up") and the one-hot encoding correctly placed the 1 at index 0.

We want to repeat this for all examples in each of the 3 subsets, adding each new one-hot encoded numpy array as a new row of the y matrix, in order.

In [11]:
# figure out the dimensions of train_y
rows = len(sample_train_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
dimensions

(240, 12)

In [12]:
# create train_y as empty array
train_y = np.array([])

# append each row to train_y
for path_to_wav in sample_train_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 18, categories_to_predict)
    
    # append the new row
    train_y = np.append(train_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*train_y.shape))

# let's reshape it
train_y = np.reshape(train_y, dimensions)
print("New shape: {}".format(train_y.shape))

Current shape: 2880
New shape: (240, 12)


In [13]:
# show the train_y matrix to confirm
train_y

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

We can see that the first 3 entries have the 1 at 0th index, which means they belong to category "up" and the last three have the 1 at the last index, which is also correct given the fact that our list of paths was also ordered.

We should bear in mind that by default the np.array contains float64s and our functions for loading a .wav return int16s.

Repeat for **CV set**.

In [14]:
# figure out the dimensions
rows = len(sample_cv_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# empy array
cv_y = np.array([])

for path_to_wav in sample_cv_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 15, categories_to_predict)
    
    # append the new row
    cv_y = np.append(cv_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*cv_y.shape))

# let's reshape it
cv_y = np.reshape(cv_y, dimensions)
print("New shape: {}".format(cv_y.shape))

Target dimensions: (60, 12)
Current shape: 720
New shape: (60, 12)


Repeat for **Test set**.

In [15]:
# figure out the dimensions
rows = len(sample_test_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# empy array
test_y = np.array([])

for path_to_wav in sample_test_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 17, categories_to_predict)
    
    # append the new row
    test_y = np.append(test_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*test_y.shape))

# let's reshape it
test_y = np.reshape(test_y, dimensions)
print("New shape: {}".format(test_y.shape))

Target dimensions: (60, 12)
Current shape: 720
New shape: (60, 12)


### Get the X
We have the y - the one-hot encoded vectors representing the category for each training, cv and test example in the sample set. We need the feature vectors, conventionally referred to as X. We will use both the simplest way of extracting the .wav data and the 1D mel frequency cepstrum (mfccs).

In [16]:
len(librosa.core.load(sample_train_wavs[0], sr=16000)[0])

16000

In [17]:
# define a simple helper function
def get_X_with_padding(list_of_paths, columns=16000):
    
    # get shape data
    rows = len(list_of_paths)
    dimensions = (rows, columns)
    
    # create placeholder
    X = np.array([])
    
    # go through every file path in the list
    for path_to_wav in list_of_paths:

        # get raw array of signed ints
        row = utils.get_wav_info(path_to_wav)[1]
        
        # some of our sample have less (or slightly more) than 16000 values, so let's adjust them
        # trim to fixed length
        row = row[:columns]
        
        # pad with zeros, calculating amount of padding needed
        padding = columns - len(row)
        row = np.pad(row, (0, padding), mode='constant', constant_values=0)

        # append the new row
        X = np.append(X, row)
    
    # reshape (unroll)
    X = np.reshape(X, dimensions)
    
    return X

In [18]:
# get the X for each set
train_X = get_X_with_padding(sample_train_wavs)
cv_X = get_X_with_padding(sample_cv_wavs)
test_X = get_X_with_padding(sample_test_wavs)

print("Train: ", train_X.shape)
print("CV: ", cv_X.shape)
print("Test: ",test_X.shape)

Train:  (240, 16000)
CV:  (60, 16000)
Test:  (60, 16000)


We can also do the same for the 1D mel frequency cepstrum.

In [19]:
train_X_mfccs = utils.get_X_with_padding_mfccs(sample_train_wavs)
cv_X_mfccs = utils.get_X_with_padding_mfccs(sample_cv_wavs)
test_X_mfccs = utils.get_X_with_padding_mfccs(sample_test_wavs)

print("Train mfccs: ", train_X_mfccs.shape)
print("CV mfccs: ", cv_X_mfccs.shape)
print("Test mfccs: ",test_X_mfccs.shape)

Train mfccs:  (240, 16000)
CV mfccs:  (60, 16000)
Test mfccs:  (60, 16000)


## Train simple models
We will start by training the simplest models and then try out more and more complex architectures, aiming for the highest possible accuracy and F1 score.

The simplest model we can try is a linear model, which we can obtain by using the Keras Dense layer followed by an activation function such as softmax (as in our case categories are mutually exclusive).

#### Linear Model
We'll need to keep track of the dimensions that we pass into our models, so lets assign their values to separate variables.

In [20]:
# we'll need the number of parameters and the output categories
num_features = train_X.shape[1]
num_categories = train_y.shape[1]
print("Input features: {}\nCategories to predict: {}".format(num_features, num_categories))

Input features: 16000
Categories to predict: 12


In [21]:
# design & compile the model
linear_model = Sequential([
    Dense(input_shape=(num_features,), units = num_categories, activation="softmax")
])

# we choose the Adam optimizer with a specific learning rate
linear_model.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [22]:
# let's evaluate our loss before fitting the model
initial_score = linear_model.evaluate(test_X, test_y, verbose=0)
categorical_crossentropy = initial_score[0]
accuracy = initial_score[1]

print("Based on random weights initialization (values will change everytime you compile the model)\nCategorical crossentropy (loss): {:.4f}\nAccuracy: {:.2f}".format(categorical_crossentropy, accuracy))

Based on random weights initialization (values will change everytime you compile the model)
Categorical crossentropy (loss): 15.5808
Accuracy: 0.03


Let's fit our simple linear model for a couple of epochs and see the **F1 score** and **accuracy**.

In [23]:
# we pass our training data and our cross-validation data to see if we're not overfitting
history = linear_model.fit(train_X, train_y, batch_size=32, epochs=5, validation_data=(cv_X, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [24]:
# show latest results
best_training_accuracy = max(history.history["acc"])
best_validation_accuracy = max(history.history["val_acc"])
print("Best scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best scores
Train acc: 0.1083
CV acc: 0.0833


Depending on the random initialization of weights we should have an **accuracy** score within 0.05 and 0.15 on both the training and cross-validation set. Let's also calculate the **F1 score**.

In [25]:
# first use the model to predict the labels
pred_cv_y = linear_model.predict(cv_X, batch_size=32)

In [26]:
pred_cv_y.shape

(60, 12)

In [27]:
# check if shape matches expectation (number of examples, number of categories to predict)
pred_cv_y.shape

(60, 12)

In [28]:
# we use softmax to get a result towards one-hot encoding, but not all rows will be just zeroes and one 1
pred_cv_y[:10]

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

So before we pass our predictions to the sklearn's f1 score function we need to make sure that all of our rows are actually one-hot encoded.

In [29]:
pred_cv_y = utils.one_hot_encode(pred_cv_y)
pred_cv_y[:10]

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

In [30]:
# we can also use sklearn directly to get accuracy
sk_cv_accuracy = accuracy_score(cv_y, pred_cv_y)
print("Final linear model CV accuracy via sklearn: {:.4f}".format(sk_cv_accuracy))

Final linear model CV accuracy via sklearn: 0.0667


In [31]:
# because we're dealing with a mutliclass classification challenge, we need to change the default value of average
# (which is binary)
cv_f1_score = f1_score(cv_y, pred_cv_y, average="weighted")
print("Linear model f1 score (CV): {:.4f}".format(cv_f1_score))

Linear model f1 score (CV): 0.0601


In summary, our accuracy and F1 score for the simplest possible model fall within 0.5 - 0.15. This is our earliest benchmark to beat, and it's not much better than **random guessing**, which given 12 categories would give us an accuracy of 0.08333.

#### Random Forest
It is also useful to try other ML methods before jumping into neural networks and deep learning. Random Forests are a simple but very often quite effective (and computationally inexpensive) method of obtaining a good benchmark.

For the sklearn implementation of Random Forest we actually do not want our target to be one-hot encoded.

In [32]:
# reverse the one-hot encoding
rf_train_y = utils.reverse_one_hot_encoding(train_y)
rf_cv_y = utils.reverse_one_hot_encoding(cv_y)
rf_test_y = utils.reverse_one_hot_encoding(test_y)

In [33]:
rand_forest = RandomForestClassifier(max_depth=7, random_state=0)
rand_forest.fit(train_X, rf_train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [34]:
rf_predicted_cv_y = rand_forest.predict(cv_X)
rf_predicted_cv_y

array([10.,  6., 11., 11., 11.,  7.,  2.,  6., 12.,  2.,  1., 12., 10.,
        2.,  1.,  1., 11., 10., 11.,  1., 12., 10.,  4.,  1.,  3.,  6.,
        4.,  4.,  5.,  1.,  8.,  5.,  7.,  5.,  4.,  1.,  4.,  1.,  6.,
        4.,  1.,  4.,  3.,  6.,  3.,  6.,  4.,  3., 12., 12.,  3.,  6.,
        7., 11., 11.,  2., 12., 10.,  1.,  5.])

In [35]:
# calculate accuracy and F1 for Random Forest
rf_cv_f1_score = f1_score(rf_cv_y, rf_predicted_cv_y, average="weighted")
rf_cv_accuracy = accuracy_score(rf_cv_y, rf_predicted_cv_y)

print("Random forest f1 score (CV): {:.3f}".format(rf_cv_f1_score))
print("Random forest accuracy (CV): {:.3f}".format(rf_cv_accuracy))

Random forest f1 score (CV): 0.115
Random forest accuracy (CV): 0.117


For the Random Forest method, using only default parameters (except for a max depth of 7), we are getting an **F1 score and accuracy around 0.11**. Slightly better than random, nowhere near good enough.

#### MFCCS Linear Model & Random Forest
Let's see if our methods result in a higher score for the 1D mel frequency cepstrum coefficients.

In [36]:
# design & compile the model
mfcc_linear_model = Sequential([
    Dense(input_shape=(num_features,), units = num_categories, activation="softmax")
])

# we choose the Adam optimizer with a specific learning rate
mfcc_linear_model.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [37]:
# let's evaluate our loss before fitting the model
initial_score = mfcc_linear_model.evaluate(cv_X_mfccs, cv_y, verbose=0)
categorical_crossentropy = initial_score[0]
accuracy = initial_score[1]

print("MFCCs\nBased on random weights initialization (values will change everytime you compile the model)\nCategorical crossentropy (loss): {:.4f}\nAccuracy: {:.2f}".format(categorical_crossentropy, accuracy))

MFCCs
Based on random weights initialization (values will change everytime you compile the model)
Categorical crossentropy (loss): 7.8995
Accuracy: 0.10


In [38]:
# we pass our training data and our cross-validation data to see if we're not overfitting
history = mfcc_linear_model.fit(train_X_mfccs, train_y, batch_size=32, epochs=10, validation_data=(cv_X_mfccs, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We can quickly observe that the linear model based on the mfcc data is much better at fitting the training data - getting to a train accuracy of 0.44 after 10 epochs and a validation accuracy of around 0.233 (compared to the raw data linear model not progressing beyond train and cv accuracy of 0.15).

Let's run the linear model for 10 more epochs to see if we can get a better cv accuracy, despite clearly already overfitting.

In [39]:
# 10 more epochs
history = mfcc_linear_model.fit(train_X_mfccs, train_y, batch_size=32, epochs=10, validation_data=(cv_X_mfccs, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Indeed we managed to reach a cv accuracy of over 0.3. If you do further experiments you can also stop on an epoch where the cv accuracy was around 0.35.

Let's see what accuracy and F1 score we can obtain from this model on the cv set.

In [40]:
# first use the model to predict the labels
mfccs_pred_cv_y = mfcc_linear_model.predict(cv_X_mfccs, batch_size=32)
mfccs_pred_cv_y.shape

(60, 12)

In [41]:
# make sure it's one-hot encoded
mfccs_pred_cv_y = utils.one_hot_encode(mfccs_pred_cv_y)
mfccs_pred_cv_y.shape

(60, 12)

In [42]:
# we can also use sklearn directly to get accuracy
mfccs_cv_accuracy = accuracy_score(cv_y, mfccs_pred_cv_y)
mfccs_cv_f1_score = f1_score(cv_y, mfccs_pred_cv_y, average="weighted")
print("MFCCs Linear model accuracy via sklearn (CV): {:.4f}".format(mfccs_cv_accuracy))
print("MFCCs Linear model f1 score (CV): {:.4f}".format(mfccs_cv_f1_score))

MFCCs Linear model accuracy via sklearn (CV): 0.2833
MFCCs Linear model f1 score (CV): 0.2476


We have a new benchmark - the **linear model based on the mfccs** has an accuracy and F1 score of around **0.3**.

Let's try the random forest approach on the mfccs.

In [43]:
# initialize the random forest and fit it to mfcc X
mfccs_rand_forest = RandomForestClassifier(max_depth=7, random_state=0)
mfccs_rand_forest.fit(train_X_mfccs, rf_train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [44]:
mfccs_rf_predicted_cv_y = mfccs_rand_forest.predict(cv_X_mfccs)
mfccs_rf_predicted_cv_y

array([ 4.,  4.,  8.,  4.,  5., 12.,  9.,  2., 11.,  4.,  6., 11.,  9.,
        7.,  5.,  4.,  4.,  3.,  2., 11., 10.,  1.,  1.,  7.,  5.,  9.,
        5.,  4.,  4., 10.,  1.,  7.,  3.,  8.,  4.,  8.,  8.,  8.,  2.,
        9., 12., 12., 12.,  4.,  5., 12., 10.,  7.,  2.,  5., 11., 11.,
       11., 11., 11.,  2., 12.,  9., 12.,  3.])

In [45]:
# calculate accuracy and F1 for Random Forest
mfccs_rf_cv_f1_score = f1_score(rf_cv_y, mfccs_rf_predicted_cv_y, average="weighted")
mfccs_rf_cv_accuracy = accuracy_score(rf_cv_y, mfccs_rf_predicted_cv_y)

print("Random forest f1 score (CV): {:.3f}".format(mfccs_rf_cv_f1_score))
print("Random forest accuracy (CV): {:.3f}".format(mfccs_rf_cv_accuracy))

Random forest f1 score (CV): 0.235
Random forest accuracy (CV): 0.267


The Random Forest model with all default parameters and a max_depth of 7 is able to make more accurate predictions based on the mfccs data than on the raw data, but not more accuracte than the linear model.

Let's experiment a little with some of the other parameters of our Random Forest to see if we can get a better result than our mfccs linear model.

In [46]:
# initialize the random forest and fit it to mfcc X
optimized_mfccs_rand_forest = RandomForestClassifier(
    n_estimators=50,
    max_depth=7,
    max_features = 3000,
    random_state=0)
optimized_mfccs_rand_forest.fit(train_X_mfccs, rf_train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features=3000, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [47]:
optimized_mfccs_rf_predicted_cv_y = optimized_mfccs_rand_forest.predict(cv_X_mfccs)
optimized_mfccs_rf_predicted_cv_y

array([12.,  4.,  7.,  7.,  1.,  5., 12., 10.,  3.,  4.,  8., 10.,  6.,
        5.,  3.,  4.,  2.,  6.,  2.,  9.,  5.,  7.,  5., 10., 10., 12.,
        1.,  4., 10.,  1.,  7.,  7.,  2.,  2.,  6.,  8.,  2.,  7.,  4.,
        9., 10.,  9.,  6.,  9.,  2.,  2., 10.,  2.,  7.,  2., 11., 11.,
       11., 11., 11., 10.,  7., 12.,  1.,  5.])

In [48]:
# calculate accuracy and F1 for Random Forest
optimized_mfccs_rf_cv_f1_score = f1_score(rf_cv_y, optimized_mfccs_rf_predicted_cv_y, average="weighted")
optimized_mfccs_rf_cv_accuracy = accuracy_score(rf_cv_y, optimized_mfccs_rf_predicted_cv_y)

print("Random forest f1 score (CV): {:.3f}".format(optimized_mfccs_rf_cv_f1_score))
print("Random forest accuracy (CV): {:.3f}".format(optimized_mfccs_rf_cv_accuracy))

Random forest f1 score (CV): 0.293
Random forest accuracy (CV): 0.283


After a little bit of tweaking we can get a Random Forest with accuracy and F1 score approaching 0.3, right around our current benchmark.

In [49]:
# set benchmark
best_cv_acc = 0.3

## Train Neural Networks
Now that we have a benchmark obtained via simple linear and Random Forest models we can proceed towards trying to outdo it with MLPs and deep learning models.

#### MLP - multi-layer perceptron
Let's start with the simplest possible neural network of just 2 dense layers. We'll be working only on the mfccs data from now on, as it tends to produce better results. We will also add **batch normalization** and **dropout** to reduce overfitting.

In [50]:
# design & compile the model
num_nodes = 1000
mlp = Sequential([
    Dense(input_shape=(num_features,), units = num_nodes, activation="relu"),
    BatchNormalization(),
    Dropout(0.8),
    Dense(num_categories, activation='softmax')
])

# we choose the Adam optimizer with a specific learning rate
mlp.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [51]:
# let's train 
for i in range(30):
    print("Actual epoch: {}".format(i + 1))
    mlp_results = mlp.fit(train_X_mfccs, train_y, batch_size=32, epochs=1, validation_data=(cv_X_mfccs, cv_y))
    # stop if we exceed previous best (benchmark)
    current_cv_acc = mlp_results.history["val_acc"][0] 
    if current_cv_acc > best_cv_acc:
        break

Actual epoch: 1
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 2
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 3
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 4
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 5
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 6
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 7
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 8
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 9
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 10
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 11
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 12
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 13
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 14
Train on 240 samples, validate on 60 samples
Epoch 1/1
A

In [52]:
# show latest results
last_training_accuracy = max(mlp_results.history["acc"])
last_validation_accuracy = max(mlp_results.history["val_acc"])
print("Last MLP scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(last_training_accuracy, last_validation_accuracy))

Last MLP scores
Train acc: 0.4583
CV acc: 0.3000


In [53]:
# predict and one-hot encode
mlp_pred_cv_y = mlp.predict(cv_X_mfccs, batch_size=32)
mlp_pred_cv_y = utils.one_hot_encode(mlp_pred_cv_y)
mlp_pred_cv_y.shape

(60, 12)

In [54]:
# we can also use sklearn directly to get accuracy
mlp_cv_accuracy = accuracy_score(cv_y, mlp_pred_cv_y)
mlp_cv_f1_score = f1_score(cv_y, mlp_pred_cv_y, average="weighted")
print("MLP accuracy via sklearn (CV): {:.4f}".format(mlp_cv_accuracy))
print("MLP f1 score (CV): {:.4f}".format(mlp_cv_f1_score))

MLP accuracy via sklearn (CV): 0.3000
MLP f1 score (CV): 0.2717


We can see that a simple MLP model reaches a very similar accuracy score to our previous benchmark of 0.3. Both this one and the previous ones can be tuned to reach approximately 0.35 but let's save fine-tuning for when we have a more promising approach - we are also already overfitting.

#### Deep Neural Networks
Let's try adding more layers to capture more complex interactions.

In [55]:
dnn = Sequential([
    Dense(input_shape=(num_features,), units = 1500, activation="relu"),
    BatchNormalization(),
    Dropout(0.87),
    Dense(1000, activation="relu"),
    BatchNormalization(),
    Dropout(0.87),
    Dense(num_categories, activation='softmax')
])

# we choose the Adam optimizer with a specific learning rate
dnn.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [56]:
# let's train 
for i in range(30):
    print("Actual epoch: {}".format(i + 1))
    dnn_results = dnn.fit(train_X_mfccs, train_y, batch_size=32, epochs=1, validation_data=(cv_X_mfccs, cv_y))
    # stop if we exceed or meet previous best (benchmark) - you can then re-run to see if we're overfitting or not
    current_cv_acc = dnn_results.history["val_acc"][0] 
    if current_cv_acc >= best_cv_acc:
        break

Actual epoch: 1
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 2
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 3
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 4
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 5
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 6
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 7
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 8
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 9
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 10
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 11
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 12
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 13
Train on 240 samples, validate on 60 samples
Epoch 1/1
Actual epoch: 14
Train on 240 samples, validate on 60 samples
Epoch 1/1
A

In [57]:
# show latest results
last_training_accuracy = max(dnn_results.history["acc"])
last_validation_accuracy = max(dnn_results.history["val_acc"])
print("Last DNN scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(last_training_accuracy, last_validation_accuracy))

Last DNN scores
Train acc: 0.2458
CV acc: 0.2667


In [58]:
# predict and one-hot encode
dnn_pred_cv_y = dnn.predict(cv_X_mfccs, batch_size=32)
dnn_pred_cv_y = utils.one_hot_encode(dnn_pred_cv_y)
dnn_pred_cv_y.shape

(60, 12)

In [59]:
# we can also use sklearn directly to get accuracy
dnn_cv_accuracy = accuracy_score(cv_y, dnn_pred_cv_y)
dnn_cv_f1_score = f1_score(cv_y, dnn_pred_cv_y, average="weighted")
print("DNN accuracy via sklearn (CV): {:.4f}".format(dnn_cv_accuracy))
print("DNN f1 score (CV): {:.4f}".format(dnn_cv_f1_score))

DNN accuracy via sklearn (CV): 0.2667
DNN f1 score (CV): 0.2264


#### Convolutional Models
Seems we're stuck around 0.3 accuracy. That makes sense because the actual "no" and other words may come at any place in the vector, we can't really keep being attached to specific indexes when training (which we currently are). Let's try convolutional layers, which can find certain patterns regardless of whether they appear at the start or end of the file.

In [64]:
# In order to use convolutions we have reshape our X -> expand it to 3 dimensions
conv_train_X_mfccs = np.expand_dims(train_X_mfccs, axis=2)
conv_train_X_mfccs.shape

(240, 16000, 1)

In [65]:
# repeat for cv & test
conv_cv_X_mfccs = np.expand_dims(cv_X_mfccs, axis=2)
conv_test_X_mfccs = np.expand_dims(test_X_mfccs, axis=2)

In [81]:
cnn1 = Sequential([
        Convolution1D(input_shape=(num_features, 1), kernel_size=32, filters=8, padding="same", activation="relu"),
        Dropout(0.1),
        MaxPooling1D(),
        Convolution1D(kernel_size=64, filters=16, padding="same", activation="relu"),
        Dropout(0.1),
        MaxPooling1D(),
        Flatten(),
        Dense(500, activation="relu"),
        Dropout(.6),
        Dense(num_categories, activation="softmax")
    ])

cnn1.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

This CNN architecture should get to 0.367 accuracy around the 35 epoch and then start to overfit.

In [82]:
cnn1_results = cnn1.fit(conv_train_X_mfccs, train_y, batch_size=32, epochs=60, validation_data=(conv_cv_X_mfccs, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [89]:
# show best results
best_training_accuracy = max(cnn1_results.history["acc"])
best_validation_accuracy = max(cnn1_results.history["val_acc"])
print("Best CNN 1 scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Last DNN scores
Train acc: 0.5750
CV acc: 0.3667


In [91]:
# predict and one-hot encode
cnn1_pred_cv_y = cnn1.predict(conv_cv_X_mfccs, batch_size=32)
cnn1_pred_cv_y = utils.one_hot_encode(cnn1_pred_cv_y)
cnn1_pred_cv_y.shape

(60, 12)

In [92]:
# we can also use sklearn directly to get accuracy
cnn1_cv_accuracy = accuracy_score(cv_y, cnn1_pred_cv_y)
cnn1_cv_f1_score = f1_score(cv_y, cnn1_pred_cv_y, average="weighted")
print("CNN 1 accuracy via sklearn (CV): {:.4f}".format(cnn1_cv_accuracy))
print("CNN 1 f1 score (CV): {:.4f}".format(cnn1_cv_f1_score))

DNN accuracy via sklearn (CV): 0.3167
DNN f1 score (CV): 0.3039


Let's increase the kernel size - patterns in speech mighr require more than e.g. 32 single samplings to be recognizable.

In [95]:
cnn2 = Sequential([
        Convolution1D(input_shape=(num_features, 1), kernel_size=256, filters=32, padding="same", activation="relu"),
        Dropout(0.2),
        MaxPooling1D(),
        Convolution1D(kernel_size=512, filters=32, padding="same", activation="relu"),
        Dropout(0.2),
        MaxPooling1D(),
        Flatten(),
        Dense(500, activation="relu"),
        Dropout(.6),
        Dense(num_categories, activation="softmax")
    ])

cnn2.compile(Adam(lr=0.001),loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
cnn2_results = cnn2.fit(conv_train_X_mfccs, train_y, batch_size=32, epochs=50, validation_data=(conv_cv_X_mfccs, cv_y))

Train on 240 samples, validate on 60 samples
Epoch 1/50
Epoch 2/50

## Action plan
X) turn the sample data into numpy arrays with X and y normally <br>
X) turn sample data into numpy arrays with X and y via mfccs<br>
X) Use linear model? (towards first benchmark)<br>
X) Use random forest?<br>
X) Use MLP<br>
X) Use multiple dense layers<br>
4c) Use convolutions (try the increased kernel sie that takes 400s per epoch)<br>
4d) USE RNN -> like in Nietzsche [https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/]<br>
5) Add preprocessing and test a couple of the best models<br>

6) Consider splitting the work on images into separate notebook depending on how bulky this gets<br>
7) Experiments on images without data augmentation<br>
8) Experiments on images with data augmentation<br>

9) Decide on e.g. 3 most promising methods<br>

And then:<br>
10) Move to writing the most promising models in tensorflow<br>
11) Include tensorboard visualization of training & graph<br>
12) Code for turning results into kaggle format of results to get score<br>
13) Obtain a good score on kaggle<br>
14) Re-read everything from start to finish and adjust<br>
15) Write a good Readme for markdown<br>
16) Add to CV<br>

You can start by trying a simple model on the 1D mfccs -> even a linear model,then maybe 1D convolutions on keras, then move on to actual 2D stuff.

**If we work on 1D data (like mfccs/waveforms) we can use the data augmentation done by the guy here:https://www.kaggle.com/CVxTz/audio-data-augmentation when passing our files into the Keras DataGenerator, but if we decide to work with the MEL images we can just use the same image augmentation as in fastai**

se very simple linear model / keras network to see how we do on current sample, then experiment with different preprocessing

In [14]:
import librosa
import numpy as np
import os
import matplotlib.pyplot as plt

def extract_mfccs(wav_file):
    """
    Take a file and return the mel-frequency cepstrum.
    """
    X, sample_rate = librosa.load(wav_file, res_type='kaiser_fast', sr=None)
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    return mfccs

In [15]:
path_to_sample = "data\\sample"
path_to_a_wav = os.path.join(path_to_sample, "cv\\unknown\\9db2bfe9_nohash_4_five.wav")

In [16]:
extract_mfccs(path_to_a_wav)

array([-4.06296364e+02,  6.31600698e+01, -2.38641127e+01, -4.86630969e+00,
       -3.53521586e+01, -3.80595467e+00, -1.06260360e+01, -5.45357225e+00,
       -5.38032267e-01, -3.13763738e+00, -1.61412864e+00, -3.92968492e+00,
       -5.57078467e+00, -4.21382641e+00, -8.39318905e+00,  2.59598676e+00,
       -1.21718174e+01,  6.58169994e+00, -6.52752377e+00,  2.20022835e+00,
       -4.70370097e+00, -7.75634867e-01, -2.45838166e+00, -1.27684907e+00,
       -9.24384769e-01, -2.84166555e+00, -2.06350172e+00, -8.51055474e-01,
       -6.62192168e-01, -1.39785145e+00, -1.65039538e+00,  3.35274945e-03,
        1.09041363e+00, -5.96439092e-01,  5.99651357e-01, -2.19326520e+00,
        7.19763870e-01,  1.33843908e+00,  1.59644506e-01, -9.80777004e-01])

In [17]:
# and that's what the waveform uses, I think
librosa.core.load(path_to_a_wav, sr=None)

(array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
        -3.0517578e-05, -3.0517578e-05, -3.0517578e-05], dtype=float32), 16000)

In [18]:
from utils import get_wav_info



In [19]:
len(get_wav_info(path_to_a_wav)[1])

16000

In [20]:
len(librosa.core.load(path_to_a_wav)[0])

22050