# Experiments with simple models on the sample

In the previous notebooks we have separated a small subset of our data, called "sample", on which we can now experiment with simple models to assess the effectiveness of our preprocessing & data augmentation techniques.

We do it this way to avoid spending too much time on training on the entire set, the assumption is that the methods which are effective on the sample will work well on a larger scale too. 

We will start by testing a couple of simple models on untouched sample data (as numpy arrays) and then proceed towards data augmentation and finally spectrograms.

In [25]:
# first make sure we're in the parent dictory of our data/sample folders.
!pwd

/c/Users/mateusz/Documents/Mateusz/Career/Machine Learning & AI/tensorflow_speech_recognition/tensorflow_speech_recognition


## Import
We'll need a couple of additional libraries so let's import them.

In [26]:
# filter out warnings
import warnings
warnings.filterwarnings('ignore') 

In [30]:
import glob
import librosa
import matplotlib.pyplot as plt
import numpy as np
import os

# utils
from importlib import reload
import utils; reload(utils)

<module 'utils' from 'C:\\Users\\mateusz\\Documents\\Mateusz\\Career\\Machine Learning & AI\\tensorflow_speech_recognition\\tensorflow_speech_recognition\\utils.py'>

## Prepare data
The easiest way to work with data is by turning it into a list of numbers, in our case a numpy array. We can use one of the functions from utils to load the raw data or use the librosa.load() function. The difference lies in the fact that the former returns int16s whereas librosa returns float32s and uses its default sampling rate of 22050Hz, unless we explicitly tell it to use the file's original sampling rate of 16000Hz.

We should also consider normalizing our data (so that it all falls within the same scale) and extracting a 1D mel-frequency cepstrum.

In [31]:
path_to_sample = "data\\sample"

We'll have to go through each of the folders in our sample/train, cv and test sets, one-hot encode their label and load the 16K long array of raw data. The y data will be of shape (m, 12), where m is the number of examples, and the X data will be of shape (m, 16000).

Let's calculate **m** first.

In [35]:
train_X = np.zeros((100, 16000))
train_X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
train_y = np.zeros(100, 12)

## Action plan
1) turn the sample data into numpy arrays with X and y normally <br>
1b) turn sample data into numpy arrays with X and y via mfccs<br>
2) Use random forest?<br>
3) Use linear model? (towards first benchmark)<br>
4) Use various keras CNNs?<br>
5) Add preprocessing and test a couple of the best models<br>
6) Experiments on images without data augmentation<br>
7) Experiments on images with data augmentation<br>
8) Decide on e.g. 3 most promising methods<br>

You can start by trying a simple model on the 1D mfccs -> even a linear model,then maybe 1D convolutions on keras, then move on to actual 2D stuff.

**If we work on 1D data (like mfccs/waveforms) we can use the data augmentation done by the guy here:https://www.kaggle.com/CVxTz/audio-data-augmentation when passing our files into the Keras DataGenerator, but if we decide to work with the MEL images we can just use the same image augmentation as in fastai**

se very simple linear model / keras network to see how we do on current sample, then experiment with different preprocessing

In [14]:
import librosa
import numpy as np
import os
import matplotlib.pyplot as plt

def extract_mfccs(wav_file):
    """
    Take a file and return the mel-frequency cepstrum.
    """
    X, sample_rate = librosa.load(wav_file, res_type='kaiser_fast', sr=None)
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    return mfccs

In [15]:
path_to_sample = "data\\sample"
path_to_a_wav = os.path.join(path_to_sample, "cv\\unknown\\9db2bfe9_nohash_4_five.wav")

In [16]:
extract_mfccs(path_to_a_wav)

array([-4.06296364e+02,  6.31600698e+01, -2.38641127e+01, -4.86630969e+00,
       -3.53521586e+01, -3.80595467e+00, -1.06260360e+01, -5.45357225e+00,
       -5.38032267e-01, -3.13763738e+00, -1.61412864e+00, -3.92968492e+00,
       -5.57078467e+00, -4.21382641e+00, -8.39318905e+00,  2.59598676e+00,
       -1.21718174e+01,  6.58169994e+00, -6.52752377e+00,  2.20022835e+00,
       -4.70370097e+00, -7.75634867e-01, -2.45838166e+00, -1.27684907e+00,
       -9.24384769e-01, -2.84166555e+00, -2.06350172e+00, -8.51055474e-01,
       -6.62192168e-01, -1.39785145e+00, -1.65039538e+00,  3.35274945e-03,
        1.09041363e+00, -5.96439092e-01,  5.99651357e-01, -2.19326520e+00,
        7.19763870e-01,  1.33843908e+00,  1.59644506e-01, -9.80777004e-01])

In [17]:
# and that's what the waveform uses, I think
librosa.core.load(path_to_a_wav, sr=None)

(array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
        -3.0517578e-05, -3.0517578e-05, -3.0517578e-05], dtype=float32), 16000)

In [18]:
from utils import get_wav_info



In [19]:
len(get_wav_info(path_to_a_wav)[1])

16000

In [20]:
len(librosa.core.load(path_to_a_wav)[0])

22050