# Experiments with simple models on the sample

In the previous notebooks we have separated a small subset of our data, called "sample", on which we can now experiment with simple models to assess the effectiveness of our preprocessing & data augmentation techniques.

We do it this way to avoid spending too much time on training on the entire set, the assumption is that the methods which are effective on the sample will work well on a larger scale too. 

We will start by testing a couple of simple models on untouched sample data (as numpy arrays) and then proceed towards data augmentation and finally spectrograms.

In [1]:
# first make sure we're in the parent dictory of our data/sample folders.
!pwd

/c/Users/mateusz/Documents/Mateusz/Career/Machine Learning & AI/tensorflow_speech_recognition/tensorflow_speech_recognition


## Import
We'll need a couple of additional libraries so let's import them.

In [2]:
# filter out warnings
import warnings
warnings.filterwarnings('ignore') 

In [3]:
import glob
import librosa
import matplotlib.pyplot as plt
import numpy as np
import os

# utils
from importlib import reload
import utils; reload(utils)

<module 'utils' from 'C:\\Users\\mateusz\\Documents\\Mateusz\\Career\\Machine Learning & AI\\tensorflow_speech_recognition\\tensorflow_speech_recognition\\utils.py'>

## Prepare data
The easiest way to work with data is by turning it into a list of numbers, in our case a numpy array. We can use one of the functions from utils to load the raw data or use the librosa.load() function. The difference lies in the fact that the former returns int16s whereas librosa returns float32s and uses its default sampling rate of 22050Hz, unless we explicitly tell it to use the file's original sampling rate of 16000Hz.

We should also consider normalizing our data (so that it all falls within the same scale) and extracting a 1D mel-frequency cepstrum.

In [4]:
path_to_sample = "data\\sample"

We'll have to go through each of the folders in our sample/train, cv and test sets, one-hot encode their label and load the 16K long array of raw data. The y data will be of shape (m, 12), where m is the number of examples, and the X data will be of shape (m, 16000).

Let's calculate **m** first. We will do that by using a function that create a list of all the .wav files within a directory.

### Create a list of paths
We will use the glob module that we learned about in the very first notebook and a function from util.py which can, given a directory, return a list of paths to .wav files within it. We will repeat the process for all 3 sets within sample, and every category subdirectory within those too.

In [5]:
# for example we can grab all .wav files from sample/train/stop
path_to_sample_train_stop = os.path.join(path_to_sample, "train", "stop")
utils.grab_wavs(path_to_sample_train_stop)[:5]

['data\\sample\\train\\stop\\01bcfc0c_nohash_1.wav',
 'data\\sample\\train\\stop\\17cc40ee_nohash_1.wav',
 'data\\sample\\train\\stop\\2da58b32_nohash_2.wav',
 'data\\sample\\train\\stop\\2da58b32_nohash_4.wav',
 'data\\sample\\train\\stop\\311fde72_nohash_2.wav']

In [6]:
# we'll need a list of all category folder names
categories_to_predict = ["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "silence", "unknown"]

In [7]:
# first grab the training set
path_to_train = os.path.join(path_to_sample, "train")
sample_train_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_train, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    sample_train_wavs.extend(category_files)
    
sample_train_wavs

['data\\sample\\train\\yes\\023a61ad_nohash_0.wav',
 'data\\sample\\train\\yes\\0f3f64d5_nohash_0.wav',
 'data\\sample\\train\\yes\\190821dc_nohash_4.wav',
 'data\\sample\\train\\yes\\28ed6bc9_nohash_1.wav',
 'data\\sample\\train\\yes\\324210dd_nohash_5.wav',
 'data\\sample\\train\\yes\\32561e9e_nohash_0.wav',
 'data\\sample\\train\\yes\\3fdafe25_nohash_0.wav',
 'data\\sample\\train\\yes\\48e8b82a_nohash_1.wav',
 'data\\sample\\train\\yes\\493392c6_nohash_1.wav',
 'data\\sample\\train\\yes\\589bce2c_nohash_1.wav',
 'data\\sample\\train\\yes\\5c237956_nohash_0.wav',
 'data\\sample\\train\\yes\\65c73b55_nohash_0.wav',
 'data\\sample\\train\\yes\\89f680f3_nohash_0.wav',
 'data\\sample\\train\\yes\\953fe1ad_nohash_1.wav',
 'data\\sample\\train\\yes\\b43de700_nohash_1.wav',
 'data\\sample\\train\\yes\\b7669804_nohash_0.wav',
 'data\\sample\\train\\yes\\e48a80ed_nohash_2.wav',
 'data\\sample\\train\\yes\\f5c3de1b_nohash_0.wav',
 'data\\sample\\train\\yes\\f839238a_nohash_1.wav',
 'data\\samp

In [8]:
# repeat for cv
path_to_cv = os.path.join(path_to_sample, "cv")
sample_cv_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_cv, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_cv_wavs.extend(category_files)

# repeat for test
path_to_test = os.path.join(path_to_sample, "test")
sample_test_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_test, category)
    category_files = utils.grab_wavs(path_to_category)
    sample_test_wavs.extend(category_files)

### One-hot encode the y

Now that we have the 3 lists of files from each set (train, cv and test) we can construct our train_y, cv_y and test_y numpy arrays. These will be matrices of size (m, 12), one-hot encoded. E.g. if a row belongs to the category "up" it will take the form of an array of zeros, where the entry at index 2 (the third from the left) will become a 1.

We will use a function from the utils that takes a path to a .wav, the index at which the category name starts within it (we want to control this because we will eventually use this for the main set, not just the sample) and a list of categories to predict. For our current example, the category name in the paths belonging to "train" starts at the 18th index (separators count as one char).

In [9]:
# let's grab a single path (this one is an "up")
a_wav = sample_train_wavs[0]

In [10]:
# let's see if the 1 is correctly placed
utils.one_hot_encode_path(a_wav, 18, categories_to_predict)

array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

The path belonged to the first category ("up") and the one-hot encoding correctly placed the 1 at index 0.

We want to repeat this for all examples in each of the 3 subsets, adding each new one-hot encoded numpy array as a new row of the y matrix, in order.

In [16]:
# figure out the dimensions of train_y
rows = len(sample_train_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
dimensions

(240, 12)

In [44]:
# create train_y as empty array
train_y = np.array([])

# append each row to train_y
for path_to_wav in sample_train_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 18, categories_to_predict)
    
    # append the new row
    train_y = np.append(train_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*train_y.shape))

# let's reshape it
train_y = np.reshape(train_y, dimensions)
print("New shape: {}".format(train_y.shape))

Current shape: 2880
New shape: (240, 12)


In [47]:
# show the train_y matrix to confirm
train_y

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.]])

We can see that the first 3 entries have the 1 at 0th index, which means they belong to category "up" and the last three have the 1 at the last index, which is also correct given the fact that our list of paths was also ordered.

We should bear in mind that by default the np.array contains float64s and our functions for loading a .wav return int16s.

Repeat for **CV set**.

In [54]:
# figure out the dimensions
rows = len(sample_cv_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# empy array
cv_y = np.array([])

for path_to_wav in sample_cv_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 15, categories_to_predict)
    
    # append the new row
    cv_y = np.append(cv_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*cv_y.shape))

# let's reshape it
cv_y = np.reshape(cv_y, dimensions)
print("New shape: {}".format(cv_y.shape))

Target dimensions: (60, 12)
Current shape: 720
New shape: (60, 12)


Repeat for **Test set**.

In [57]:
# figure out the dimensions
rows = len(sample_test_wavs)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# empy array
test_y = np.array([])

for path_to_wav in sample_test_wavs:
    row = utils.one_hot_encode_path(path_to_wav, 17, categories_to_predict)
    
    # append the new row
    test_y = np.append(test_y, row)
    
# we currently have a flattened vector
print("Current shape: {}".format(*test_y.shape))

# let's reshape it
test_y = np.reshape(test_y, dimensions)
print("New shape: {}".format(test_y.shape))

Target dimensions: (60, 12)
Current shape: 720
New shape: (60, 12)


## Action plan
1) turn the sample data into numpy arrays with X and y normally <br>
1b) turn sample data into numpy arrays with X and y via mfccs<br>
2) Use random forest?<br>
3) Use linear model? (towards first benchmark)<br>
4) Use various keras CNNs?<br>
5) Add preprocessing and test a couple of the best models<br>
6) Experiments on images without data augmentation<br>
7) Experiments on images with data augmentation<br>
8) Decide on e.g. 3 most promising methods<br>

You can start by trying a simple model on the 1D mfccs -> even a linear model,then maybe 1D convolutions on keras, then move on to actual 2D stuff.

**If we work on 1D data (like mfccs/waveforms) we can use the data augmentation done by the guy here:https://www.kaggle.com/CVxTz/audio-data-augmentation when passing our files into the Keras DataGenerator, but if we decide to work with the MEL images we can just use the same image augmentation as in fastai**

se very simple linear model / keras network to see how we do on current sample, then experiment with different preprocessing

In [14]:
import librosa
import numpy as np
import os
import matplotlib.pyplot as plt

def extract_mfccs(wav_file):
    """
    Take a file and return the mel-frequency cepstrum.
    """
    X, sample_rate = librosa.load(wav_file, res_type='kaiser_fast', sr=None)
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    return mfccs

In [15]:
path_to_sample = "data\\sample"
path_to_a_wav = os.path.join(path_to_sample, "cv\\unknown\\9db2bfe9_nohash_4_five.wav")

In [16]:
extract_mfccs(path_to_a_wav)

array([-4.06296364e+02,  6.31600698e+01, -2.38641127e+01, -4.86630969e+00,
       -3.53521586e+01, -3.80595467e+00, -1.06260360e+01, -5.45357225e+00,
       -5.38032267e-01, -3.13763738e+00, -1.61412864e+00, -3.92968492e+00,
       -5.57078467e+00, -4.21382641e+00, -8.39318905e+00,  2.59598676e+00,
       -1.21718174e+01,  6.58169994e+00, -6.52752377e+00,  2.20022835e+00,
       -4.70370097e+00, -7.75634867e-01, -2.45838166e+00, -1.27684907e+00,
       -9.24384769e-01, -2.84166555e+00, -2.06350172e+00, -8.51055474e-01,
       -6.62192168e-01, -1.39785145e+00, -1.65039538e+00,  3.35274945e-03,
        1.09041363e+00, -5.96439092e-01,  5.99651357e-01, -2.19326520e+00,
        7.19763870e-01,  1.33843908e+00,  1.59644506e-01, -9.80777004e-01])

In [17]:
# and that's what the waveform uses, I think
librosa.core.load(path_to_a_wav, sr=None)

(array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
        -3.0517578e-05, -3.0517578e-05, -3.0517578e-05], dtype=float32), 16000)

In [18]:
from utils import get_wav_info



In [19]:
len(get_wav_info(path_to_a_wav)[1])

16000

In [20]:
len(librosa.core.load(path_to_a_wav)[0])

22050