Our simple yes/no recognizer used a neural network architecture consisting of four convolutional layers and three fully connected layers (over 3 million trainable weights). It got about 96% training and testing accuracy. What do you think would happen if we switch from convolutional layers to solely densely connected layers? A network with one hidden layer of size 200 ends up having a similar number of trainable weights. Make a guess as to what training and testing accuracy you'd see. Run it to find out what happens.

In this notebook we will build a speech recognition model.  

Below we'll import the libraries we'll be using.

In [1]:
import os
import librosa   #for audio processing
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile #for audio processing
import warnings
warnings.filterwarnings("ignore")

Next, we'll download the dataset of speech commands from tensorflow.

In [2]:
!wget http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

--2020-07-05 10:00:15--  http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 173.194.76.128, 2a00:1450:400c:c00::80
Connecting to download.tensorflow.org (download.tensorflow.org)|173.194.76.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1489096277 (1.4G) [application/gzip]
Saving to: ‘speech_commands_v0.01.tar.gz’


2020-07-05 10:00:22 (207 MB/s) - ‘speech_commands_v0.01.tar.gz’ saved [1489096277/1489096277]



Here, we unzip the file we downloaded from tensorflow.

In [3]:
!mkdir speech_commands
!tar -C ./speech_commands -xf speech_commands_v0.01.tar.gz 

Below we load the data into `all_wavs` and their respective labels into `all_labs`.  The labels are either `yes` or `no`.

We'll also print the number of examples in `all_wavs`.

In [4]:
import os

directory = 'speech_commands/'

all_wavs = []
all_labs = []
for label in ['yes', 'no']:
    print(label)
    wavs = [f for f in os.listdir(directory + label) if f.endswith('.wav')]
    for wav in wavs:
        samples, sample_rate = librosa.load(directory + label + '/' + wav, sr = 16000)
        if(len(samples)== 16000): 
            all_wavs.append(samples)
            all_labs.append(label)
print(len(all_wavs))

yes
no
4255


Below we split our training and test data.  `X_train` is our processed audio files for training and `y_train` are their labels.  `X_test` and `y_test` are our test audio files and their labels, respectively.

In [5]:
from sklearn.model_selection import train_test_split
 
all_wavs = np.array(all_wavs).reshape(-1,16000,1)
all_labs = np.array([lab == 'yes' for lab in all_labs])
X_train, X_test, y_train, y_test = train_test_split(all_wavs,all_labs,test_size = 0.2)

Since we're not using convolutions, we'll just reshape the data to be flat vectors.

In [10]:
X_train = np.reshape(X_train, (3404,16000))
X_test = np.reshape(X_test, (851,16000))

In the following lines, we will build together the layers of our model for speech recognition.

In [45]:
!pip install keras=='2.3.1'
from keras.layers import Conv1D, Input, MaxPooling1D, Flatten, Dense
from keras.models import Model
 
inputs = Input(shape=(16000,))

x = inputs

x = Dense(200, activation='relu')(x)
x = Dense(1, activation='sigmoid')(x)

outputs = x 
 
model = Model(inputs, outputs)

print(model.summary())

Model: "model_22"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_22 (InputLayer)        (None, 16000)             0         
_________________________________________________________________
dense_70 (Dense)             (None, 200)               3200200   
_________________________________________________________________
dense_71 (Dense)             (None, 1)                 201       
Total params: 3,200,401
Trainable params: 3,200,401
Non-trainable params: 0
_________________________________________________________________
None


We then `fit` the model.  We use a `mean_squared_error` `loss` and optimize the weigths using use `adam` as our `optimizer`. We iterate through the data 15 times.  Each time, or `epoch`, we print out the `accuracy` and `loss` of our model so far.

In [46]:
model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])

model.fit(X_train, y_train ,epochs=15, batch_size=32)

model.evaluate(X_test, y_test)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


[0.38268632302553196, 0.533490002155304]