# <center> Simple speech recognition utilizing MFCC

## Import necessities
First, we import aux functions from $\texttt{preprocess.py}$ as well as Keras and Matplotlib

In [None]:
from preprocess import *  # Import all the preprocessing functions
import keras  # Keras as frontend
from keras.models import Sequential  # Sequential NN model
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D  # Useful layers to be used
from keras.utils import to_categorical  # One-hot encoding
import matplotlib.pyplot as plt  # Graphical representations and plotting

## Preprocessing
We want matrices of dimensions $20\times20$. This is arbitrary, but less MFCCs contain less info, so $20\times20$ is a good way to go.
We call `transformData` function which does the MFC transformation on each of the $\textit{.wav}$ files contained inside the folders within the $\texttt{./data/}$ folder, and packs them in tensors and saves as $\textit{.npy}$ files. 
`getTrainTest` unpacks these files, concatenates all of the matrices in one tensor, as well as their labels in one array, then calls `train_test_split` from $\textit{scikitlearn}$ to obtain training and test sets.

In [None]:
# The dimensions of matrices MFC transform
# produces from .wav files
M = 20
N = 20

# Save data to array file first
transformData(dim=[M,N])

# Loading train set and test set
X_train, X_test, y_train, y_test = getTrainTestData(splitRatio=0.6)

Verify the dimensions

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Prepare everything for the network

We define various aux variables in order to construct the network and define its behaviour, such as number of epochs network will be trained for, batch size of data provied to the network, number of output classes/nodes of the network... Then we reshape the datasets into 4D tensors, first dimension being the number of matrices in the dataset, M and N dimensions of those matrices, and 4th argument being the number of channels per matrix (channels in terms of RGB channels in color images, which here is simply 1), in order for Keras to accept it.

In [None]:
# Training features
channels = 1 # Fourth dimension of the data
             # network is receiving
epochs = 100 # Iterate epochs times over
             # training dataset
batchSize = 100   # Train network in batches
numOfClasses = 5  # This is the number of output
                  # neurons and depends on classes
                  # of .wav files which were provided
                  # from Kaggle dataset

# CNN expects tensor as input, aka a MxN "picture"
# with channel channels 
X_train = X_train.reshape(X_train.shape[0], M, N, channels)
X_test = X_test.reshape(X_test.shape[0], M, N, channels)

# One-hot encoding of outputs 
y_train_hot = to_categorical(y_train)
y_test_hot = to_categorical(y_test)

## Define the network architecture

Next we define a function which returns `Sequenital` NN model with architecture defined by the layers we stack, which are mostly convolutional ones, dropouts which downsize the intermediate features to alleviate the burden of large computational costs, and a fully connected layer at output with $\texttt{numOfClases}$ nodes. We shall train the network using standardized `keras.losses.categorical_crossentropy` loss function while trying to obtain the best accuracy possible.

We define one more function which is passed a path to the $\textit{.wav}$ file and a NN model, and returns the prediction of the network regarding the classification of the word spoken in that audio file.

In [None]:
""" CONSTRUCT_NETWORK
    Function which forms a sequential CNN network with
    provided layers
    
    input:  none
    output: model - constructed NN
"""
def constructNetwork():
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(2, 2), activation='relu', input_shape=(M, N, channels)))
    model.add(Conv2D(48, kernel_size=(2, 2), activation='relu'))
    model.add(Conv2D(120, kernel_size=(2, 2), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.4))
    model.add(Dense(numOfClasses, activation='softmax'))
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer='Adam',
                  metrics=['accuracy'])
    return model

""" PREDICT
    For passed .wav file as input argument, function calculates its
    MFCC, passes them to CNN and provides estimation of class the spoken
    word belongs to
    
    input:  filePath - path to .wav file
            model - NN model to pass the MFCCs
    output: 
"""
def predict(filePath, model):
    # Process the .wav file
    sample = audio2mfcc(filePath)
    # reshape it so it matches CNNs input dimensions
    sampleReshaped = sample.reshape(1, M, N, channels)
    # return the label of the output neuron which produces
    # largest output
    return getLabels()[0][np.argmax(model.predict(sampleReshaped))]

## Train the network

We construct and train the network on prepared training dataset while using test set as validation. This will stop network from overfitting the training set while at the same time gaining accuracy on test set.

In [None]:
# Construct the CNN
model = constructNetwork()
# Train it and validate
history = model.fit(X_train, y_train_hot, batch_size=batchSize, epochs=epochs, verbose=1, validation_data=(X_test, y_test_hot))

## Plot the results

In [None]:
# Plot accuracies and losses
plt.plot(history.history['acc'])
plt.title('Model accuracy on training set')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()
plt.plot(history.history['val_acc'])
plt.title('Model accuracy on test set')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()
plt.plot(history.history['loss'])
plt.title('Model loss on training set')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()
plt.plot(history.history['val_loss'])
plt.title('Model loss on test set')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()

## Evaluate network

Now we can se the accuracy of our network on test set

In [None]:
eval_loss, eval_acc = model.evaluate(X_test, y_test_hot)
print(eval_loss)
print(eval_acc*100) 

## Test
Test network prediction on random audio file from each of the folders.

In [None]:
words,_,_ = getLabels('./data/')

for word in words:
    files = getLabels('./data/'+ word)
    random_file = './data/'+word + '/'+ files[0][np.random.randint(1,max(files[1]))]
    print(random_file)
    print(predict(random_file, model))
