# Activation word Neural Network

## Summary
In this notebook I will complete two different tasks:
1. Create a large data set that combines positive and negative example words to train the neural network
2. Design and test the neural network

First, as in the previous notebook, I will add the necessary libraries to create the data set. The information and documentation of each API needed are:
* For pyaudio: [main page](https://people.csail.mit.edu/hubert/pyaudio/) and [documentation](https://people.csail.mit.edu/hubert/pyaudio/docs/)
* For NumPy: I used version [1.22.4](https://numpy.org/doc/1.22/)
* For Random: [documentation](https://docs.python.org/3/library/random.html)

In [1]:
import numpy as np
import random

Next, I will load all the NumPy arrays of the previously recorded sessions into two lists (act, neg) to then turn them into NumPy arrays themselves. If all audios were recorded well the dimensions will be (number of examples x the size of the recording) and the notebook will not give me any error.

In [2]:
act, neg = [], []
for i in range(15):
    act.append(np.load("./Act_train/activation/"+str(i+1)+".npy"))
for i in range(120):
    neg.append(np.load("./Act_train/negative/"+str(i+1)+".npy"))

In [3]:
act, neg = np.array(act), np.array(neg)
print(act.shape)
print(neg.shape)

(15, 44032)
(120, 44032)


Here, I declared two variables that will help by the time I start to work on the neural network, and the third one is for the function "create_training_example", number_of_act will help me get the number of activation examples so that with a counter I can go from the first to the last example without never going beyond it. That way I can include each example at least once.

In [4]:
length = 688
width = 64
number_of_act=act.shape[0]

In [5]:
def create_training_example(activates, negatives, num_active):
    """
    Creates a simple training example at random from a activates, and negatives NumPy array.
    
    Arguments:
    activates -- a list of NumPy array of the word you chose.
    negatives -- a list of NumPy array of random words that are not the activation word.
    num_activate -- The id of the next activation example.

    Returns:
    x -- the example that was chosen
    y -- a 2d array that represents if the example is the activation word or not.
    num_activate -- the updated id of the next activation example.
    """
    #  yes/no
    y = [0,0]
    x = None
    flag = np.random.randint(1, 100)
    if flag%5==0:
        x = activates[num_active]
        y[0] = 1
        num_active = (num_active+1)%number_of_act
    else:
        x = negatives[np.random.choice(len(negatives))]
        y[1] = 1
    
    return x, y, num_active

In [6]:
x, y, num_active = create_training_example(act, neg, 0)
print("Type of x and y:",type(x), type(y))
print("Size of x and y:",x.shape, len(y))
print("Values of variables: ")
print(" X:", x)
print(" Y:", y)
print(" Id of activation:", num_active)

Type of x and y: <class 'numpy.ndarray'> <class 'list'>
Size of x and y: (44032,) 2
Values of variables: 
 X: [-0.00866699 -0.0085144  -0.00814819 ...  0.0123291   0.01123047
  0.01004028]
 Y: [0, 1]
 Id of activation: 0


Here comes the conclusion of the first task. Here I will do my training and development/test set. First things first, I need to make a decision on how big my training and development/test set should be. I decided that it will be two thousand examples, and the main reason for this number its because my computer doesn't have enough power to make more.

In [7]:
nsamples = 2000

Next, it's the process of creating each set. Both procedures are practically the same and at the end of each, it will save the set in case you want to use it later. Also, I added a cell to see the shapes of both x and y and check if there's a positive example inside the data set.

In [None]:
X = []
Y = []
z = 0
for i in range(0, nsamples):
    if i%100 == 0:
        print(i)
    x, y, z= create_training_example(act, neg, z)
    X.append(x)
    Y.append(y)
X = np.array(X)
Y = np.array(Y).reshape((nsamples, 2))
    # Save the data for further uses
np.save(f'./Act_train/XY_train/X_npy.npy', X)
np.save(f'./Act_train/XY_train/Y_npy.npy', Y)

In [9]:
print("X and Y dimensions: ",X.shape, Y.shape)
print("Are there any positive examples in the training set?")
print("Yes" if 1 in Y[:,0] else "No")

X and Y dimensions:  (2000, 44032) (2000, 2)
Are there any positive examples in the training set?
Yes


In [None]:
X_dev = []
Y_dev = []
z = 0
for i in range(0, nsamples):
    if i%100 == 0:
        print(i)
    x, y, z= create_training_example(act, neg, z)
    X_dev.append(x)
    Y_dev.append(y)
X_dev = np.array(X_dev)
Y_dev = np.array(Y_dev).reshape((nsamples, 2))
np.save(f'./Act_train/XY_dev/X_dev_npy.npy', X_dev)
np.save(f'./Act_train/XY_dev/Y_dev_npy.npy', Y_dev)

In [10]:
print(X_dev.shape)
print(Y_dev.shape)
print("Are there any positive examples in the training set?")
print("Yes" if 1 in Y_dev[:,0] else "No")

(2000, 44032)
(2000, 2)
Are there any positive examples in the training set?
Yes


The next cell is to load previously made sets that you want to try to use.

In [8]:
X = np.load("./Act_train/XY_train/X_npy.npy")
Y = np.load("./Act_train/XY_train/Y_npy.npy")
X_dev = np.load("./Act_train/XY_dev/X_dev_npy.npy")
Y_dev = np.load("./Act_train/XY_dev/Y_dev_npy.npy")

Here starts the second task of this notebook. First of all, it needs to be pointed out that the base structure of this network was taken from one of the assignments of the  Deep Learning specialization offered by DeepLearning.AI on Coursera. Thank you so much to professor Andrew Ng and all the people that worked on that specialization.

The reason of why, its that the base structure it's for a model that detects an activation word inside an audio file of 10 seconds, so I thought that based on that I could make a network to detect it in almost real-time with audios of 1 second.

First I added all the tensor functions that I'll need. The information and documentation of the TensorFlow are:
* [Documentation(tf.keras)](https://www.tensorflow.org/api_docs/python/tf/keras)

In [11]:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Model, load_model, Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from tensorflow.keras.layers import GRU, Bidirectional, BatchNormalization, Reshape, Flatten
from tensorflow.keras.optimizers import Adam

Next, it's my very first NN that I need to tune to make it work with what I needed. Here is my experience experimenting with it. (skip this part if you want, next cell I will make a little explanation on how it works out)
At first, I tried to use .wav files that I needed to transform into spectrograms but the process was too long and it would have needed a decent amount of computation power to hear me, create a wave file, read that file, transform it into a spectrogram and feed it to the network.
Based on that I decided to use the NumPy arrays because they were (in my point of view) data that didn't need a lot to be worked at. The process in real-time would just be to hear me, turn a byte array to float, and feed, I didn't need to write and read anything to my computer. An up till now, it has worked well.

The model is this way:
* The input layer: 
 1. A reshaping of the data so that will help the computer make calculations faster.
 2. A Convolutional layer of 1 dimension.
 3. A batch normalization so that we can eliminate negative numbers.
 4. A activation using a "relu" function.
 5. A Dropout of everything down of 85%.
* The hidden layers: All hidden layers are composed of a GRU function, a Dropout, and a BatchNormalization
 * After some high-bias results I ended up making the network deeper and decreasing the number of units because I noticed that with every new hidden layer the accuracy started to go up and the loss go down. I assumed it was because the number of important characteristics was more evidently with every new layer.
* The output layer: this layer was composed by a TimeDistributed function as was a important port for the based model and a Dense function acommponied by a Flatten of the result of the TimeDistributed.
 * The 2 inside the last function it refers to the two labels fo the data:
   * "Yes", its the activation word.
   * "No", its not the activation word.

In [12]:
def modelf(input_shape):
    """
    Creates a experimental model of a NN that detects a activation word based on a NumPy array
    
    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """
    
    X_input = Input(shape = input_shape)
    
    X = Reshape((width, length))(X_input)
    X = Conv1D(filters = 196, kernel_size=5, strides=2)(X)
    X = BatchNormalization()(X)
    X = Activation("relu")(X)
    X = Dropout(rate=0.85)(X)                                  

    X = GRU(units = 128, return_sequences=True)(X)
    X = Dropout(rate = 0.85)(X)
    X = BatchNormalization()(X)                           
    
    X = GRU(units = 128, return_sequences=True)(X)
    X = Dropout(rate = 0.85)(X)       
    X = BatchNormalization()(X)
    
    X = GRU(units = 128, return_sequences=True)(X)
    X = Dropout(rate = 0.85)(X)       
    X = BatchNormalization()(X) 
    
    X = GRU(units = 60, return_sequences=True)(X)
    X = Dropout(rate = 0.85)(X)       
    X = BatchNormalization()(X) 
    
    X = GRU(units = 60, return_sequences=True)(X)
    X = Dropout(rate = 0.85)(X)       
    X = BatchNormalization()(X) 
    
    X = GRU(units = 30, return_sequences=True)(X)
    X = Dropout(rate = 0.90)(X)       
    X = BatchNormalization()(X)  
    
    X = TimeDistributed(Dense(2, activation = "sigmoid"))(X)
    X = (Dense(2, activation = "sigmoid"))(Flatten()(X))

    model = Model(inputs = X_input, outputs = X)
    
    return model

Here we call the function and pass the second part of the dimension of the data set, that will be the size of each example. if the compiling went well we should see the summary of the layers in the next cell.

In [13]:
model = modelf(input_shape = (X.shape[1]))

In [14]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 44032)]           0         
                                                                 
 reshape (Reshape)           (None, 64, 688)           0         
                                                                 
 conv1d (Conv1D)             (None, 30, 196)           674436    
                                                                 
 batch_normalization (BatchN  (None, 30, 196)          784       
 ormalization)                                                   
                                                                 
 activation (Activation)     (None, 30, 196)           0         
                                                                 
 dropout (Dropout)           (None, 30, 196)           0         
                                                             

And just before training, I added an Adam optimizer with a learning rate of 1e-3 because after experimenting with several values, this one was the best one for the model. I didn't touch the betas because they were already (as far as I understand it) what most people use.
The loss function is a binary cross-entropy because it's what the base model use.

In [15]:
opt = Adam(learning_rate=1e-3, beta_1=0.9, beta_2=0.999)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])

The moment of truth, I will finally train and tests the model. I found out that the best course of action for that the model get the best possible results is to run the model 30 times through the train set (30 epochs) in batches of 5. After doing that, the next cell will evaluate it with the dev/test set.

In [16]:
model.fit(X, Y, batch_size = 5, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x140bfd03490>

In [17]:
loss, acc, = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)

Dev set accuracy =  1.0


This past is for saving and loading the model if you want to continue to test it out.

In [None]:
model.save("model/act_word")

In [None]:
model = load_model("model/act_word")

Here is where I tested out by hand how well it works. First by using an example of any of the sets then comparing it with the real value, and then by using a fresh example: recording a word either the activation or a negative one to see the results.

In [18]:
model.predict(X[1].reshape((1,X.shape[1])))



array([[1.655904e-04, 9.998277e-01]], dtype=float32)

In [19]:
Y[1]

array([0, 1])

Here are the necessary cells to try how well will work with a fresh recording.

In [None]:
import pyaudiob

CHUNK = 1024
FORMAT = pyaudio.paFloat32
CHANNELS = 1
RATE = 44100
RECORD_SECONDS = 1.01

p = pyaudio.PyAudio()

inputs = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

In [None]:
frames = bytearray()
print("* recording")
for j in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
    print(j)
    frames += inputs.read(CHUNK)
frames = np.frombuffer(frames, dtype = "float32")
print("\nPrediction")
model.predict(frames.reshape((1,frames.shape[0])))