# Project: Speech Recognition with RNNs


#### Part 1

Implemented the LSTM RNN cell.

#### Part 2 

- Implemented a Bidirectional RNN.
- Compared vanilla RNN, GRU, LSTM, and bidirectional VS unidirectional RNNs and reported their performance with respect to accuracy and time cost.

#### Part 3 

- Performed architecture optimisation.
- Improved utilisation of the hidden state sequence output by the RNN, for classification.

### Dataset

Using the Google [*Speech Commands*](https://www.tensorflow.org/tutorials/sequences/audio_recognition) v0.02 [1] dataset.

[1] Warden, P. (2018). [Speech commands: A dataset for limited-vocabulary speech recognition](https://arxiv.org/abs/1804.03209). *arXiv preprint arXiv:1804.03209.*


Set-up code and imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive

Drive already mounted at /content/drive

In [None]:
import math
import os
from collections import defaultdict

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset
import numpy as np
from scipy.io.wavfile import read
import librosa
from matplotlib import pyplot as plt

cuda = True if torch.cuda.is_available() else False

Tensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor

torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

Data provider class definition.



In [None]:
class SpeechCommandsDataset(Dataset):
    """Google Speech Commands dataset."""

    def __init__(self, root_dir, split):
        """
        Args:
            root_dir (string): Directory with all the data files.
            split    (string): In ["train", "valid", "test"].
        """
        self.root_dir = root_dir
        self.split = split

        self.number_of_classes = len(self.get_classes())

        self.class_to_file = defaultdict(list)

        self.valid_filenames = self.get_valid_filenames()
        self.test_filenames = self.get_test_filenames()

        for c in self.get_classes():
            file_name_list = sorted(os.listdir(self.root_dir + "/data_speech_commands_v0.02/" + c))
            for filename in file_name_list:
                if split == "train":
                    if (filename not in self.valid_filenames[c]) and (filename not in self.test_filenames[c]):
                        self.class_to_file[c].append(filename)
                elif split == "valid":
                    if filename in self.valid_filenames[c]:
                        self.class_to_file[c].append(filename)
                elif split == "test":
                    if filename in self.test_filenames[c]:
                        self.class_to_file[c].append(filename)
                else:
                    raise ValueError("Invalid split name.")

        self.filepath_list = list()
        self.label_list = list()
        for cc, c in enumerate(self.get_classes()):
            f_extension = sorted(list(self.class_to_file[c]))
            l_extension = [cc for i in f_extension]
            f_extension = [self.root_dir + "/data_speech_commands_v0.02/" + c + "/" + filename for filename in f_extension]
            self.filepath_list.extend(f_extension)
            self.label_list.extend(l_extension)
        self.number_of_samples = len(self.filepath_list)

    def __len__(self):
        return self.number_of_samples

    def __getitem__(self, idx):
        sample = np.zeros((16000, ), dtype=np.float32)

        sample_file = self.filepath_list[idx]

        sample_from_file = read(sample_file)[1]
        sample[:sample_from_file.size] = sample_from_file
        sample = sample.reshape((16000, ))
        
        # Experimenting with MFCC function configuration:
        sample = librosa.feature.mfcc(y=sample, sr=16000, hop_length=512, n_fft=2048).transpose().astype(np.float32)
        
        label = self.label_list[idx]

        return sample, label

    def get_classes(self):
        return ['one', 'two', 'three']

    def get_valid_filenames(self):
        class_names = self.get_classes()

        class_to_filename = defaultdict(set)
        with open(self.root_dir + "/data_speech_commands_v0.02/validation_list.txt", "r") as fp:
            for line in fp:
                clean_line = line.strip().split("/")

                if clean_line[0] in class_names:
                    class_to_filename[clean_line[0]].add(clean_line[1])

        return class_to_filename

    def get_test_filenames(self):
        class_names = self.get_classes()

        class_to_filename = defaultdict(set)
        with open(self.root_dir + "/data_speech_commands_v0.02/testing_list.txt", "r") as fp:
            for line in fp:
                clean_line = line.strip().split("/")

                if clean_line[0] in class_names:
                    class_to_filename[clean_line[0]].add(clean_line[1])

        return class_to_filename

Load Dataset 

In [None]:
## Dataset
dataset_folder = "./"

train_dataset = SpeechCommandsDataset(dataset_folder,
                                      "train")
valid_dataset = SpeechCommandsDataset(dataset_folder,
                                      "valid")

test_dataset = SpeechCommandsDataset(dataset_folder,
                                     "test")

batch_size = 20
num_epochs = 3
valid_every_n_steps = 20
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)
valid_loader = torch.utils.data.DataLoader(dataset=valid_dataset,
                                           batch_size=batch_size,
                                           shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

## Part 1 
 
Implementation of LSTMCell, BasicRNNCell and GRUCell

In [None]:
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        
        self.x2h = nn.Linear(self.input_size, self.hidden_size * 4, bias = self.bias)
        self.h2h = nn.Linear(self.hidden_size, self.hidden_size * 4, bias = self.bias)

        self.reset_parameters()

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, input, hx=None):
        if hx is None:
            hx = input.new_zeros(input.size(0), self.hidden_size, requires_grad=False)
            hx = (hx, hx)
            
        x_t = input
        h_t_p = hx[0]
        c_t_p = hx[1]

        # Define the current input, the sum of the input vector x_t and the previous hidden state vector, h_t_p
        current_input = self.x2h(x_t) + self.h2h(h_t_p)

        # Define the four gates and their activations
        igate, fgate, ggate, ogate = current_input.chunk(4, 1)

        igate = torch.sigmoid(igate)
        fgate = torch.sigmoid(fgate)
        ggate = torch.tanh(ggate)
        ogate = torch.sigmoid(ogate)

        # Compute new cell state
        c_t = fgate * c_t_p + igate * ggate

        # Compute candidate hidden state values to output
        h_cand_t = torch.tanh(c_t)

        # Compute actual hidden state values to output using output gate activations
        h_t = ogate * h_cand_t

        hy = h_t
        cy = c_t

        return (hy, cy)

class BasicRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True, nonlinearity="tanh"):
        super(BasicRNNCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        self.nonlinearity = nonlinearity
        if self.nonlinearity not in ["tanh", "relu"]:
            raise ValueError("Invalid nonlinearity selected for RNN.")

        self.x2h = nn.Linear(input_size, hidden_size, bias=bias)
        self.h2h = nn.Linear(hidden_size, hidden_size, bias=bias)

        self.reset_parameters()
        

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

            
    def forward(self, input, hx=None):
        if hx is None:
            hx = input.new_zeros(input.size(0), self.hidden_size, requires_grad=False)

        hy = self.x2h(input) + self.h2h(hx)
        if self.nonlinearity == 'tanh':
            hy = torch.tanh(hy)
        elif self.nonlinearity == 'relu':
            hy = torch.relu(hy)
            
        return hy

    
    
class GRUCell(nn.Module):
    def __init__(self, input_size, hidden_size, bias=True):
        super(GRUCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias

        self.x2h = nn.Linear(self.input_size, self.hidden_size * 3, bias = self.bias)
        self.h2h = nn.Linear(self.hidden_size, self.hidden_size * 3, bias = self.bias)

        self.reset_parameters()
        

    def reset_parameters(self):
        std = 1.0 / math.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, input, hx=None):
        if hx is None:
            hx = input.new_zeros(input.size(0), self.hidden_size, requires_grad=False)

        x, h = self.x2h(input), self.h2h(hx)

        # Gates:
        xr, xz, xo = x.chunk(3, 1)
        hr, hz, ho = h.chunk(3, 1)
        rgate = xr + hr
        ugate = xz + hz
        rgate = torch.sigmoid(rgate)
        ugate = torch.sigmoid(ugate)
        hgate = xo + rgate * ho 
        hgate = torch.tanh(hgate)

        hy = ((1 - ugate) * hx) + (ugate * hgate)
        
        return hy

In [None]:
class RNNModel(nn.Module):
    def __init__(self, mode, input_size, hidden_size, num_layers, bias, output_size):
        super(RNNModel, self).__init__()
        self.mode = mode
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.output_size = output_size
        
        self.rnn_cell_list = nn.ModuleList()
        
        if mode == 'LSTM':
            self.rnn_cell_list.append(LSTMCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(LSTMCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))


        elif mode == 'GRU':
            self.rnn_cell_list.append(GRUCell(self.input_size,
                                              self.hidden_size,
                                              self.bias))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(GRUCell(self.hidden_size,
                                                  self.hidden_size,
                                                  self.bias))

        elif mode == 'RNN_TANH':
            self.rnn_cell_list.append(BasicRNNCell(self.input_size,
                                                   self.hidden_size,
                                                   self.bias,
                                                   "tanh"))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(BasicRNNCell(self.hidden_size,
                                                       self.hidden_size,
                                                       self.bias,
                                                       "tanh"))

        elif mode == 'RNN_RELU':
            self.rnn_cell_list.append(BasicRNNCell(self.input_size,
                                                   self.hidden_size,
                                                   self.bias,
                                                   "relu"))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(BasicRNNCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias,
                                                   "relu"))
        else:
            raise ValueError("Invalid RNN mode selected.")


        self.att_fc = nn.Linear(self.hidden_size, 1)
        self.fc = nn.Linear(self.hidden_size, self.output_size * 32)

        
    def forward(self, input, hx=None):
        if hx is None:
            if torch.cuda.is_available():
                h0 = Variable(torch.zeros(self.num_layers, input.size(0), self.hidden_size).cuda())
            else:
                h0 = Variable(torch.zeros(self.num_layers, input.size(0), self.hidden_size))

        else:
             h0 = hx

        outs = []

        sequence_length = input.size()[1] 

        # Loop over number of layers:
        for layer in range(self.num_layers):
            if layer == 0:
                # Loop over length of sequence
                for j in range(sequence_length):
                    if j == 0:
                        hx = self.rnn_cell_list[layer].forward(input[:,j,:], None)
                    else:
                        hx = self.rnn_cell_list[layer].forward(input[:,j,:], hx)
                    outs.append(hx)
            else: 
                if self.mode == 'LSTM':
                    # Loop over length of sequence
                    for j in range(sequence_length):
                        if j == 0:
                            outs[j] = self.rnn_cell_list[layer].forward(outs[j][0], None)
                        else:
                            outs[j] = self.rnn_cell_list[layer].forward(outs[j][0], outs[j-1])
                else:
                    # Loop over length of sequence
                    for j in range(sequence_length):
                        if j == 0:
                            outs[j] = self.rnn_cell_list[layer].forward(outs[j], None)
                        else:
                            outs[j] = self.rnn_cell_list[layer].forward(outs[j], outs[j-1])

        if self.mode == 'LSTM':
            outs = [outs[i][0] for i in range(len(outs))]

        # Experiments for Part 3:
        # Taking the final hidden state only:
        #out = outs[-1].squeeze() # Related to Part 3.

        # Using all hidden states
        outs = [outs[i].squeeze() for i in range(len(outs))]

        # Taking the final hidden state only:
        # out = self.fc(out)

        # Using all hidden states:
        out = [self.fc(out) for out in outs]
        
        return out
    

class BidirRecurrentModel(nn.Module):
    def __init__(self, mode, input_size, hidden_size, num_layers, bias, output_size):
        super(BidirRecurrentModel, self).__init__()
        self.mode = mode
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bias = bias
        self.output_size = output_size
        
        self.rnn_cell_list = nn.ModuleList()
        
        self.rnn_cell_list_forward = nn.ModuleList()
        self.rnn_cell_list_backward = nn.ModuleList()
        if mode == 'Bi_LSTM':
            self.rnn_cell_list_forward.append(LSTMCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            self.rnn_cell_list_backward.append(LSTMCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            for l in range(1, self.num_layers):
                self.rnn_cell_list_forward.append(LSTMCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))
                self.rnn_cell_list_backward.append(LSTMCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))

        elif mode == 'Bi_GRU':
            self.rnn_cell_list_forward.append(GRUCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            self.rnn_cell_list_backward.append(GRUCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            for l in range(1, self.num_layers):
                self.rnn_cell_list_forward.append(GRUCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))
                self.rnn_cell_list_backward.append(GRUCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))

        elif mode == 'Bi_RNN_TANH':
            self.rnn_cell_list_forward.append(BasicRNNCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            self.rnn_cell_list_backward.append(BasicRNNCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            for l in range(1, self.num_layers):
                self.rnn_cell_list_forward.append(BasicRNNCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))
                self.rnn_cell_list_backward.append(BasicRNNCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))

        elif mode == 'Bi_RNN_RELU':
            self.rnn_cell_list_forward.append(BasicRNNCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            self.rnn_cell_list_backward.append(BasicRNNCell(self.input_size,
                                               self.hidden_size,
                                               self.bias))
            for l in range(1, self.num_layers):
                self.rnn_cell_list_forward.append(BasicRNNCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))
                self.rnn_cell_list_backward.append(BasicRNNCell(self.hidden_size,
                                                   self.hidden_size,
                                                   self.bias))
        else:
            raise ValueError("Invalid Bidirectional RNN mode selected.")

        self.att_fc = nn.Linear(self.hidden_size * 2, 1)
        self.fc1 = nn.Linear(self.hidden_size * 2, self.output_size)
        self.fc2 = nn.Linear(self.output_size * 32, self.output_size)
        
        
    def forward(self, input, hx=None):
        if torch.cuda.is_available():
            h0 = Variable(torch.zeros(self.num_layers, input.size(0), self.hidden_size).cuda())
        else:
            h0 = Variable(torch.zeros(self.num_layers, input.size(0), self.hidden_size))
            
        if torch.cuda.is_available():
            hT = Variable(torch.zeros(self.num_layers, input.size(0), self.hidden_size).cuda())
        else:
            hT = Variable(torch.zeros(self.num_layers, input.size(0), self.hidden_size))
            
            
        outs = []
        outs_rev = []
        
        direction = ['forward', 'backward']
        sequence_length = input.size()[1] 

        for d in direction:
            if d == 'forward':
                # Loop over number of layers:
                for layer in range(self.num_layers):
                    if layer == 0:
                        # Loop over length of sequence
                        for j in range(sequence_length):
                            if j == 0:
                                hx = self.rnn_cell_list_forward[layer].forward(input[:,j,:], None)
                            else:
                                hx = self.rnn_cell_list_forward[layer].forward(input[:,j,:], hx)
                            outs.append(hx)
                    else: 
                        if self.mode == 'Bi_LSTM':
                            # Loop over length of sequence
                            for j in range(sequence_length):
                                if j == 0:
                                    outs[j] = self.rnn_cell_list_forward[layer].forward(outs[j][0], None)
                                else:
                                    outs[j] = self.rnn_cell_list_forward[layer].forward(outs[j][0], outs[j-1])
                        else:
                            # Loop over length of sequence
                            for j in range(sequence_length):
                                if j == 0:
                                    outs[j] = self.rnn_cell_list_forward[layer].forward(outs[j], None)
                                else:
                                    outs[j] = self.rnn_cell_list_forward[layer].forward(outs[j], outs[j-1])
                if self.mode == 'Bi_LSTM':
                    outs = [outs[i][0] for i in range(len(outs))]
            elif d == 'backward':
                # Loop over number of layers:
                for layer in range(self.num_layers):
                    if layer == 0:
                        # Loop over length of sequence
                        for j in range(sequence_length-1,-1,-1):
                            if j == sequence_length-1:
                                hx = self.rnn_cell_list_backward[layer].forward(input[:,j,:], None)
                            else:
                                hx = self.rnn_cell_list_backward[layer].forward(input[:,j,:], hx)
                            outs_rev.append(hx)
                    else: 
                        if self.mode == 'Bi_LSTM':
                            # Loop over length of sequence
                            for j in range(sequence_length-1,-1,-1):
                                if j == sequence_length-1:
                                    outs_rev[j] = self.rnn_cell_list_backward[layer].forward(outs_rev[j][0], None)
                                else:
                                    outs_rev[j] = self.rnn_cell_list_backward[layer].forward(outs_rev[j][0], outs_rev[j-1])
                        else:
                            # Loop over length of sequence
                            for j in range(sequence_length-1,-1,-1):
                                if j == sequence_length-1:
                                    outs_rev[j] = self.rnn_cell_list_backward[layer].forward(outs_rev[j], None)
                                else:
                                    outs_rev[j] = self.rnn_cell_list_backward[layer].forward(outs_rev[j], outs_rev[j-1])
                if self.mode == 'Bi_LSTM':
                    outs_rev = [outs_rev[i][0] for i in range(len(outs_rev))]

        '''out = outs[-1].squeeze()  
        out_rev = outs_rev[0].squeeze()
        out = torch.cat((out, out_rev), 1)

        out = self.fc(out)'''

        # Experiments for Part 3:
        # Using all hidden states
        outs_cat = [torch.cat((outs[i], outs_rev[i]), 1) for i in range(len(outs))]

        # Using all hidden states:
        out_pred = [self.fc1(outs_cat[j]) for j in range(len(outs_cat))]
        out_pred_ensemble = torch.cat(tuple(out_pred), 1)

        # Ensemble prediction
        out = self.fc2(out_pred_ensemble)

        return out

## Part 2

Experimented with different RNN architectures (RNN cell types, number of layers, hidden state dimensionality, MFCC configuration, unidirectional VS bidirectional) and reported the accuracy as well as training time. 


In [None]:
# Parts of experiment code based on: https://github.com/emadRad/lstm-gru-pytorch
import time
seq_dim, input_dim = train_dataset[0][0].shape
output_dim = 3
hidden_dim = 64
layer_dim = 4
bias = True

model = RNNModel("LSTM", input_dim, hidden_dim, layer_dim, bias, output_dim)
# model = BidirRecurrentModel("Bi_LSTM", input_dim, hidden_dim, layer_dim, bias, output_dim)

if torch.cuda.is_available():
    model.cuda()
    
criterion = nn.CrossEntropyLoss()

learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

loss_list = []
iter = 0
max_v_accuracy = 0
reported_t_accuracy = 0
max_t_accuracy = 0
t_before = time.clock()

for epoch in range(num_epochs):
    for i, (audio, labels) in enumerate(train_loader):
        if torch.cuda.is_available():
            audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
            labels = Variable(labels.cuda())
        else:
            audio = Variable(audio.view(-1, seq_dim, input_dim))
            labels = Variable(labels)

        optimizer.zero_grad()

        outputs = model(audio)

        loss = criterion(outputs, labels)

        if torch.cuda.is_available():
            loss.cuda()

        loss.backward()

        optimizer.step()

        loss_list.append(loss.item())
        iter += 1

        if iter % valid_every_n_steps == 0:
            correct = 0
            total = 0
            for audio, labels in valid_loader:
                if torch.cuda.is_available():
                    audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
                else:
                    audio = Variable(audio.view(-1, seq_dim, input_dim))

                outputs = model(audio)

                _, predicted = torch.max(outputs.data, 1)

                total += labels.size(0)

                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()

            v_accuracy = 100 * correct / total
            
            is_best = False
            if v_accuracy >= max_v_accuracy:
                max_v_accuracy = v_accuracy
                is_best = True

            if is_best:
                for audio, labels in test_loader:
                    if torch.cuda.is_available():
                        audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
                    else:
                        audio = Variable(audio.view(-1, seq_dim, input_dim))

                    outputs = model(audio)

                    _, predicted = torch.max(outputs.data, 1)

                    total += labels.size(0)

                    if torch.cuda.is_available():
                        correct += (predicted.cpu() == labels.cpu()).sum()
                    else:
                        correct += (predicted == labels).sum()

                t_accuracy = 100 * correct / total
                reported_t_accuracy = t_accuracy

            print('Iteration: {}. Loss: {}. V-Accuracy: {}  T-Accuracy: {}'.format(iter, loss.item(), v_accuracy, reported_t_accuracy))

training_time = time.clock() - t_before
print('Training time: ' + str(training_time) + ' seconds \n')
print('Training time: ' + str(training_time / 60.0) + ' minutes')

Iteration: 20. Loss: 1.0984433889389038. V-Accuracy: 33  T-Accuracy: 33
Iteration: 40. Loss: 1.0906832218170166. V-Accuracy: 43  T-Accuracy: 46
Iteration: 60. Loss: 1.0014851093292236. V-Accuracy: 61  T-Accuracy: 59
Iteration: 80. Loss: 0.7186201214790344. V-Accuracy: 69  T-Accuracy: 68
Iteration: 100. Loss: 0.8055158853530884. V-Accuracy: 75  T-Accuracy: 73
Iteration: 120. Loss: 0.45969128608703613. V-Accuracy: 81  T-Accuracy: 79
Iteration: 140. Loss: 0.4297122061252594. V-Accuracy: 81  T-Accuracy: 80
Iteration: 160. Loss: 0.3701402544975281. V-Accuracy: 77  T-Accuracy: 80
Iteration: 180. Loss: 0.3631555736064911. V-Accuracy: 82  T-Accuracy: 82
Iteration: 200. Loss: 0.5101146697998047. V-Accuracy: 85  T-Accuracy: 84
Iteration: 220. Loss: 0.3955325484275818. V-Accuracy: 86  T-Accuracy: 85
Iteration: 240. Loss: 0.4306756854057312. V-Accuracy: 87  T-Accuracy: 86
Iteration: 260. Loss: 0.5295273065567017. V-Accuracy: 87  T-Accuracy: 87
Iteration: 280. Loss: 0.25445204973220825. V-Accuracy:

In [None]:
#################### Experimental Results for Part 1.2 #####################
from IPython.display import Image, display
display(Image(filename='results_table.png', width=1300))

## Part 3 



## Discussion about Predictions.

Utilising the final hidden state may not be the best choice to make a final prediction because it only uses the hidden representation of the last part of the audio sample to make a classification. For a larger number of classes, the last hidden state in the sequence may be very similar since the word may have the same phoneme at the end. This is equivalent to only using only a small region of a feature map after a convolutional layer when trying to classify on image, many images of different classes can have similar looking local regions. This can lead to lots of false positives. 

Variations on final prediction:


*   Instead of using only the last hidden state: an average over all hidden states can be computed which is then fed into the FC; or, each hidden state can be fed into the FC to 32 individual predictions, and then the final prediction is the majority vote out of these 32. 

*   Better still, certain frames in the audio sample may be more important than others. Attention can therefore be applied at the output to use a weighted average of the hidden states to feed into the FC layer to make a final prediction. 


In [None]:
# Parts of experiment code based on: https://github.com/emadRad/lstm-gru-pytorch

from collections import Counter
import time

seq_dim, input_dim = train_dataset[0][0].shape
output_dim = 3
hidden_dim = 64
layer_dim = 4
bias = True

# model = RNNModel("LSTM", input_dim, hidden_dim, layer_dim, bias, output_dim)
model = BidirRecurrentModel("Bi_GRU", input_dim, hidden_dim, layer_dim, bias, output_dim)

if torch.cuda.is_available():
    model.cuda()
    
criterion = nn.CrossEntropyLoss()

learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

loss_list = []
iter = 0
max_v_accuracy = 0
reported_t_accuracy = 0
max_t_accuracy = 0
t_before = time.clock()

for epoch in range(num_epochs):
    for i, (audio, labels) in enumerate(train_loader):
        if torch.cuda.is_available():
            audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
            labels = Variable(labels.cuda())
        else:
            audio = Variable(audio.view(-1, seq_dim, input_dim))
            labels = Variable(labels)

        optimizer.zero_grad()

        outputs = model(audio)

        loss = criterion(outputs, labels)

        '''
        losses = torch.FloatTensor([criterion(outputs[i], labels) for i in range(len(outputs))]).cuda()
        loss = torch.mean(losses)
        loss.requires_grad = True
        '''

        if torch.cuda.is_available():
            loss.cuda()

        loss.backward()

        optimizer.step()

        loss_list.append(loss.item())
        iter += 1

        if iter % valid_every_n_steps == 0:
            correct = 0
            total = 0
            for audio, labels in valid_loader:
                if torch.cuda.is_available():
                    audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
                else:
                    audio = Variable(audio.view(-1, seq_dim, input_dim))

                outputs = model(audio)

                _, predicted = torch.max(outputs.data, 1)

                '''
                #print(outputs[0].data)
                predicted = [torch.max(outputs[i].data, 1)[1] for i in range(len(outputs))]
                print(predicted[0]
                predicted = Counter([predicted[j][1] for j in range(len(predicted))]).most_common(1)
                predicted = predicted[0][0].item()
                '''

                total += labels.size(0)

                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()

            v_accuracy = 100 * correct / total
            
            is_best = False
            if v_accuracy >= max_v_accuracy:
                max_v_accuracy = v_accuracy
                is_best = True

            if is_best:
                for audio, labels in test_loader:
                    if torch.cuda.is_available():
                        audio = Variable(audio.view(-1, seq_dim, input_dim).cuda())
                    else:
                        audio = Variable(audio.view(-1, seq_dim, input_dim))

                    outputs = model(audio)

                    _, predicted = torch.max(outputs.data, 1)

                    '''
                    predicted = [torch.max(outputs[i].data, 1) for i in range(len(outputs))]
                    predicted = Counter([predicted[j][1] for j in range(len(predicted))]).most_common(1)
                    predicted = predicted[0][0]
                    '''

                    total += labels.size(0)

                    if torch.cuda.is_available():
                        correct += (predicted.cpu() == labels.cpu()).sum()
                    else:
                        correct += (predicted == labels).sum()

                t_accuracy = 100 * correct / total
                reported_t_accuracy = t_accuracy

            print('Iteration: {}. Loss: {}. V-Accuracy: {}  T-Accuracy: {}'.format(iter, loss.item(), v_accuracy, reported_t_accuracy))

training_time = time.clock() - t_before
print('Training time: ' + str(training_time) + ' seconds \n')
print('Training time: ' + str(training_time / 60.0) + ' minutes')

### Short discussion on the vanishing/exploding gradient problem.

Vanishing and exploding gradients appear when calculating the derivative of the loss with respect to the weights in an RNN. It occurs because parameters are shared across each time step, so the gradient of the loss with respect to a single parameter $w$ consists of contributions from every path from $w$ to the loss. Each of these paths consists of a product of the parameter matrices due to the chain rule, and the further back the start of the path from the current time-step, the more matrix multiplications there are in the gradient. If the parameter is less than 1. 

Therefore, for large sequences, the relative influence of gradients w.r.t $w$ from much older time-steps vanishes if the parameter is less than 1, or explodes and dominates if the parameter is greater than 1. This means that information from past time-steps does not improve the current prediction since relationships are lost between distant states in the sequence. In practical terms, the update to $w$ is dominated by more recent gradients in the sequene if vanishing. In the exploding case, the update to $w$ is disproportionately dominated by older terms, effectively overwriting any prior learning. 

LSTMs mitigate this problem. Even though these gradient products exist over long sequences, the gradient is redeemed because the new state interacts in an additive way with the old state, instead of  through a purely multiplicative interaction. This means influence from the past never truly disappears unless, for example, the forget gate is closed. This improve gradient flow means the memory is stored over longer time periods. 