# Exercise 9 - Isolated Word Speech Recognition

## a)Isolated Word speech recognition using CNN

### Program 1 - Implementing Isolated word speech recognition in speech commands dataset Using CNN

#### AIM 
To build isolated word speech recognition model using CNN on the spech commands dataset and test it with recorded audio which is not from the dataset

#### About the Dataset
(2018) Speech commands dataset version 2. Available: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

**Note:** Only the words bed, cat and happy are used in this exercise

#### Modules used:

| Modules         | Version   |
| --------------  | --------- |
| tensorflow      |  2.6.0    |
| numpy           |  1.19.5   |
| librosa         |  0.8.1    |
| matplotlib      |  3.4.3    |
| ipython         |  7.26.0   |

#### Neural Network Architecture

| Layer (type) | Output Shape       |
| ------------ | ------------------ |
| Conv2D       | (None, 11, 29, 32) |
| Conv2D       | (None, 9, 27, 48)  |
| Conv2D       | (None, 6, 24, 64)  |
| MaxPooling2D | (None, 1, 6, 64)   |
| Dropout      | (None, 1, 6, 64)   |
| Flatten      | (None, 384)        |
| Dense        | (None, 128)        |
| Dropout      | (None, 128)        |
| Dense        | (None, 64)         |
| Dropout      | (None, 64)         |
| Dense        | (None, 3)          |

#### Part 1 - Loading the the audio files and extracting mfcc feature from the audio files

In [1]:
import numpy as np
import librosa 

In [2]:
data_path = "./datasets/isolated_word_dataset"
labels = np.array(["bed","cat","happy"])
n_classes = labels.shape[0]
n_mfcc = 12 # no. of mfc coefficients
t = 30 # no. of time windows on which the mfc coefficients are computed
input_shape = (-1,n_mfcc,t,1)

In [3]:
def wav2mfcc(file,t,n_mfcc):
    data, sr = librosa.load(file,sr=None)
    mfcc = librosa.feature.mfcc(data,sr,n_mfcc =n_mfcc)
    if mfcc.shape[1]>t:
        mfcc = mfcc[:,:t]
    if mfcc.shape[1]<t:
        mfcc = np.pad(mfcc,pad_width=((0,0),(0,t- mfcc.shape[1])))
    return mfcc

In [4]:
def load_speech_dataset_features(data_path,labels):
    X,y = [],[]
    for i,label in enumerate(labels):
        file_name = f"{data_path}/{label}.npy"
        try:
            data = np.load(file_name)
        except(FileNotFoundError):
            data = np.array([
                wav2mfcc(file,t = t,n_mfcc=n_mfcc) 
                for file in librosa.util.find_files(f"{data_path}/{label}/")
            ])
            np.save(file_name,data)
        X.append(data)
        y.append(np.full((data.shape[0],1),i))
    X = np.vstack(X).reshape(input_shape)
    y = np.vstack(y)==np.arange(n_classes) # y is one hot encoded     
    return X,y

In [5]:
X,y = load_speech_dataset_features(data_path,labels)

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2)
X_train,X_test = X_train,X_test

#### Part  - CNN Architecture Design

In [8]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.optimizers import Adadelta

In [9]:
def get_model():
    model = Sequential()
    
    model.add(Conv2D(
        32, kernel_size=(2, 2), activation='relu',
        input_shape=(n_mfcc,t , 1)
    ))
    model.add(Conv2D(48, kernel_size=(3, 3), activation='relu'))
    model.add(Conv2D(64, kernel_size=(4, 4), activation='relu'))
    model.add(MaxPooling2D(pool_size=(4, 4)))
    model.add(Dropout(0.25))
    
    model.add(Flatten())
    
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(n_classes, activation='softmax'))
    model.compile(
        loss=CategoricalCrossentropy(),
        optimizer=Adadelta(.3),
        metrics=['accuracy']
    )
    return model

#### Part 3 - Training the model

In [10]:
tf.random.set_seed(0)
model = get_model()
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 11, 29, 32)        160       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 27, 48)         13872     
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 6, 24, 64)         49216     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 1, 6, 64)          0         
_________________________________________________________________
dropout (Dropout)            (None, 1, 6, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 384)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               4

In [11]:
model.fit(
    X_train, y_train, batch_size=50, epochs=50,
    verbose=True, validation_data=(X_test, y_test)
)
tf.keras.models.save_model(model,"./models/isolated_word_speech_recognition_model.h5")
model = tf.keras.models.load_model("./models/isolated_word_speech_recognition_model.h5")

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Part 4 - Testing with real recorded audio speech

In [12]:
from IPython.display import Audio

In [13]:
for label in labels:
    file_name = f"{data_path}/recorded_test_audios/{label}.wav"
    test_audio,sr = librosa.load(file_name,sr=None,mono=True)
    print(f"File Nme: {label}.wav")
    display(Audio(test_audio, rate = sr))
    features = wav2mfcc(file_name,t,n_mfcc).reshape(input_shape)
    print("Predicted Output :", labels[np.argmax(model.predict(features))])
    print("-"*40)

File Nme: bed.wav


Predicted Output : bed
----------------------------------------
File Nme: cat.wav


Predicted Output : cat
----------------------------------------
File Nme: happy.wav


Predicted Output : happy
----------------------------------------
