# HW1: Frame-Level Speech Recognition

In this homework, you will be working with MFCC data consisting of 15 features at each time step/frame. Your model should be able to recognize the phoneme occured in that frame.

# Libraries

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torchsummaryX import summary
import sklearn
import gc
import zipfile
import pandas as pd
from tqdm.auto import tqdm
import os
import datetime
import wandb
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

from hparams import Hparams
from model import DLHW1_MLP_modular

if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_built():
    device = 'mps'
else:
    device = 'cpu' 

hparams = Hparams()

print("Device: ", device)

Device:  cuda


In [2]:
# model = DLHW1_MLP_modular(hparams, torch.nn.SiLU).to(device)

# model_parameters = filter(lambda p: p.requires_grad, model.parameters())
# params = sum([np.prod(p.size()) for p in model_parameters])
# params

In [3]:
### PHONEME LIST
PHONEMES = [
            'SIL',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',  
            'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
            'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
            'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
            'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
            'V',     'W',     'Y',     'Z',     'ZH',    '<sos>', '<eos>']

## Data Exploration and Testing

In [4]:
datapath = '/home/jbajor/Dev/hw1/data/dev-clean'

idx = 83

mfcc_list = os.listdir(f'{datapath}/mfcc')
trans_list = os.listdir(f'{datapath}/transcript')

mfcc = np.load(f'{datapath}/mfcc/{mfcc_list[idx]}')
trans = np.load(f'{datapath}/transcript/{mfcc_list[idx]}')[1:-1]

In [5]:
mfcc.shape

(334, 15)

In [6]:
np.mean(mfcc, axis=0).shape

(15,)

In [7]:
(mfcc - np.mean(mfcc, axis=0)).shape

(334, 15)

In [8]:
alltrans = np.concatenate([np.load(f'{datapath}/transcript/{i}') for i in trans_list])

In [9]:
unq = np.unique(alltrans, return_counts=True)

#sil count
sil_count = unq[1][np.where(unq[0]=='SIL')[0]]
print(f'SIL count: {sil_count}')

#least common phoneme
least_common = unq[0][np.argmin(unq[1])]
print(f'Least Common: {least_common}')


SIL count: [396367]
Least Common: ZH


In [10]:
unq[1][np.where(unq[0]=='SIL')[0]]

array([396367])

In [11]:
np.unique(PHONEMES, return_inverse=True)

(array(['<eos>', '<sos>', 'AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'B', 'CH',
        'D', 'DH', 'EH', 'ER', 'EY', 'F', 'G', 'HH', 'IH', 'IY', 'JH', 'K',
        'L', 'M', 'N', 'NG', 'OW', 'OY', 'P', 'R', 'S', 'SH', 'SIL', 'T',
        'TH', 'UH', 'UW', 'V', 'W', 'Y', 'Z', 'ZH'], dtype='<U5'),
 array([32,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35,
        36, 37, 38, 39, 40, 41,  1,  0]))

In [12]:
#Mapping tests
phone_map = {lb:idx for idx, lb in enumerate(PHONEMES)}

for idx, i in enumerate(trans):
    trans[idx] = phone_map[i]


In [13]:
trans.astype(int)

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, 10, 10, 10, 10, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
        7,  7,  7, 18, 18, 18, 18, 18, 18, 15, 15, 15, 15, 15, 15, 15,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 23, 23, 23, 23, 23,
       31, 31, 31, 31, 31, 31, 31, 34, 34, 34, 34, 34, 34, 34, 21, 21, 21,
       21,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 14, 14,
       14, 14, 14, 14,  0,  0,  0,  0,  7,  7,  7,  7, 26, 26, 26, 26, 26,
       26, 26, 26, 26, 26, 26, 29, 29, 29, 29, 29, 29, 29, 29, 29, 31, 31,
       31,  3,  3,  3, 28, 28, 28,  3,  3,  3, 29, 29, 29, 29, 29, 29, 29,
       29, 29, 29, 29, 29, 29, 29, 21, 21, 21, 18, 18, 18, 18, 18, 18, 18,
       18, 18, 18, 36, 36, 36, 17, 17, 17, 23, 23, 23, 23, 23, 10, 10, 10,
       13, 13, 13, 13, 13, 13, 13, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
       29, 29,  4,  4,  4

## End of Testing Section

# Dataset

This section covers the dataset/dataloader class for speech data. You will have to spend time writing code to create this class successfully. We have given you a lot of comments guiding you on what code to write at each stage, from top to bottom of the class. Please try and take your time figuring this out, as it will immensely help in creating dataset/dataloader classes for future homeworks.

Before running the following cells, please take some time to analyse the structure of data. Try loading a single MFCC and its transcipt, print out the shapes and print out the values. Do the transcripts look like phonemes?

Note: Dataset functions have been modularized in another file (dataset.py)

# Parameters Configuration

Storing your parameters and hyperparameters in a single configuration dictionary makes it easier to keep track of them during each experiment. It can also be used with weights and biases to log your parameters for each experiment and keep track of them across multiple experiments. 

In [14]:
config = {
    'epochs': 20,
    'batch_size' : 1400,
    'context' : 25,
    'learning_rate' : 0.001,
    'architecture' : '6-deep-v1',
    'dropout_p' : 0.2
    # Add more as you need them - e.g dropout values, weight decay, scheduler parameters
}

from hparams import Hparams

hparams = Hparams()

# Create Datasets

In [15]:
from dataset import AudioDataset, AudioTestDataset

In [16]:
train_data = AudioDataset('/home/jbajor/Dev/hw1/data/train-clean-100', context=hparams.context) #TODO: Create a dataset object using the AudioDataset class for the training data 

val_data = AudioDataset('/home/jbajor/Dev/hw1/data/dev-clean', context=hparams.context) # TODO: Create a dataset object using the AudioDataset class for the validation data 

test_data = AudioTestDataset('/home/jbajor/Dev/hw1/data/test-clean', context=hparams.context) # TODO: Create a dataset object using the AudioTestDataset class for the test data 

In [17]:
# Define dataloaders for train, val and test datasets
# Dataloaders will yield a batch of frames and phonemes of given batch_size at every iteration


train_loader = torch.utils.data.DataLoader(train_data, num_workers= 10,
                                           batch_size=hparams.batch_size, pin_memory= True,
                                           shuffle= True)

val_loader = torch.utils.data.DataLoader(val_data, num_workers= 10,
                                         batch_size=hparams.batch_size, pin_memory= True,
                                         shuffle= False)

test_loader = torch.utils.data.DataLoader(test_data, num_workers= 10, 
                                          batch_size=hparams.batch_size, pin_memory= True, 
                                          shuffle= False)


print("Batch size: ", hparams.batch_size)
print("Context: ", hparams.context)
print("Input size: ", (2*hparams.context+1)*15)
print("Output symbols: ", len(PHONEMES))

print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Validation dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size:  1028
Context:  30
Input size:  915
Output symbols:  42
Train dataset samples = 36191134, batches = 35206
Validation dataset samples = 1937496, batches = 1885
Test dataset samples = 1943253, batches = 1891


In [18]:
# Testing code to check if your data loaders are working
for i, data in enumerate(train_loader):
    frames, phoneme = data
    print(frames.shape, phoneme.shape)
    break

torch.Size([1028, 915]) torch.Size([1028])


# Define Model, Loss Function and Optimizer

Here we define the model, loss function, optimizer and optionally a learning rate scheduler. 

In [19]:
model = DLHW1_MLP_modular(hparams, torch.nn.SiLU).to(device)
frames,phoneme = next(iter(train_loader))
# Check number of parameters of your network - Remember, you are limited to 20 million parameters for HW1 (including ensembles)
summary(model, frames.to(device))

                          Kernel Shape  Output Shape    Params  Mult-Adds
Layer                                                                    
0_layers.Linear_0          [915, 1745]  [1028, 1745]  1.59842M  1.596675M
1_layers.SiLU_1                      -  [1028, 1745]         -          -
2_layers.Dropout_2                   -  [1028, 1745]         -          -
3_layers.BatchNorm1d_3          [1745]  [1028, 1745]     3.49k     1.745k
4_layers.Linear_4         [1745, 1745]  [1028, 1745]  3.04677M  3.045025M
5_layers.SiLU_5                      -  [1028, 1745]         -          -
6_layers.Dropout_6                   -  [1028, 1745]         -          -
7_layers.BatchNorm1d_7          [1745]  [1028, 1745]     3.49k     1.745k
8_layers.Linear_8         [1745, 1745]  [1028, 1745]  3.04677M  3.045025M
9_layers.SiLU_9                      -  [1028, 1745]         -          -
10_layers.Dropout_10                 -  [1028, 1745]         -          -
11_layers.BatchNorm1d_11        [1745]

  df_sum = df.sum()


Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_layers.Linear_0,"[915, 1745]","[1028, 1745]",1598420.0,1596675.0
1_layers.SiLU_1,-,"[1028, 1745]",,
2_layers.Dropout_2,-,"[1028, 1745]",,
3_layers.BatchNorm1d_3,[1745],"[1028, 1745]",3490.0,1745.0
4_layers.Linear_4,"[1745, 1745]","[1028, 1745]",3046770.0,3045025.0
5_layers.SiLU_5,-,"[1028, 1745]",,
6_layers.Dropout_6,-,"[1028, 1745]",,
7_layers.BatchNorm1d_7,[1745],"[1028, 1745]",3490.0,1745.0
8_layers.Linear_8,"[1745, 1745]","[1028, 1745]",3046770.0,3045025.0
9_layers.SiLU_9,-,"[1028, 1745]",,


In [20]:
from sched import scheduler


criterion = torch.nn.KLDivLoss(reduction='batchmean')
optimizer = torch.optim.Adam(model.parameters(), lr=hparams.lr) #Defining Optimizer
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', patience=4)

# Training and Validation Functions

This section covers the training, and validation functions for each epoch of running your experiment with a given model architecture. The code has been provided to you, but we recommend going through the comments to understand the workflow to enable you to write these loops for future HWs.

In [21]:
if device == 'cuda':
    torch.cuda.empty_cache()
    
gc.collect()

0

In [22]:
def train(model, optimizer, criterion, dataloader, use_wandb:bool=True, mixed_pr:bool=False):

    model.train()
    train_loss = 0.0 #Monitoring Loss
    
    for iter, (mfccs, phonemes) in enumerate(dataloader):

        ### Initialize Gradients
        optimizer.zero_grad()

        ### Move Data to Device (Ideally GPU)
        mfccs = mfccs.to(device)
        phonemes = phonemes.to(device)
        phonemes = F.one_hot(phonemes, num_classes=40).float()

        if mixed_pr:
            with torch.autocast(device):
                
                ### Forward Propagation
                logits = model(mfccs)

                ### Loss Calculation
                loss = criterion(logits, phonemes)
                train_loss += loss.item()
        else:
            ### Forward Propagation
            logits = model(mfccs)

            ### Loss Calculation
            loss = criterion(logits, phonemes)
            train_loss += loss.item()
            
        ### Backward Propagation
        loss.backward()

        ### Gradient Descent
        optimizer.step()
  
    train_loss /= len(dataloader)
    if wandb:
        wandb.log({'train loss':train_loss})

    return train_loss

In [23]:
def eval(model, dataloader, use_wandb=True):

    model.eval() # set model in evaluation mode

    phone_true_list = []
    phone_pred_list = []
    val_loss = 0.0

    for i, data in enumerate(dataloader):

        frames, phonemes = data
        ### Move data to device (ideally GPU)
        frames, phonemes = frames.to(device), phonemes.to(device)
        with torch.inference_mode(): # makes sure that there are no gradients computed as we are not training the model now
            ### Forward Propagation
            logits = model(frames)

        ### Get Predictions
        predicted_phonemes = torch.argmax(logits, dim=1)
        
        ### Store Pred and True Labels
        phone_pred_list.extend(predicted_phonemes.cpu().tolist())
        phone_true_list.extend(phonemes.cpu().tolist())
        
        ### Convert pred to onehot
        phonemes = F.one_hot(phonemes, num_classes=40).float()
        predicted_phonemes = F.one_hot(predicted_phonemes, num_classes=40).float()

        loss = criterion(predicted_phonemes, phonemes)
        val_loss += loss.item()
    
        del frames, phonemes, logits
        if device == 'cuda':
            torch.cuda.empty_cache()

    val_loss = val_loss/len(dataloader)

    ### Calculate Accuracy
    accuracy = accuracy_score(phone_pred_list, phone_true_list)
    if use_wandb:
        wandb.log({'validation accuracy':accuracy*100})
        wandb.log({'validation loss':val_loss})

    return val_loss, accuracy*100

# Weights and Biases Setup

This section is to enable logging metrics and files with Weights and Biases. Please refer to wandb documentationa and recitation 0 that covers the use of weights and biases for logging, hyperparameter tuning and monitoring your runs for your homeworks. Using this tool makes it very easy to show results when submitting your code and models for homeworks, and also extremely useful for study groups to organize and run ablations under a single team in wandb. 

We have written code for you to make use of it out of the box, so that you start using wandb for all your HWs from the beginning.

In [24]:
use_wandb:bool = True

In [25]:
if use_wandb:
    from dataclasses import asdict
    import time
    wandb.login(key="c319fb8dfa7ce22e07aa0cefe0823a9752d50720") #API Key is in your wandb account, under settings (wandb.ai/settings)

    ### Create your wandb run
    run = wandb.init(
        name = f"{hparams.architecture}_{int(time.time())}", ### Wandb creates random run names if you skip this field, we recommend you give useful names
        reinit=True, ### Allows reinitalizing runs when you re-run this cell
        project="hw1p2", ### Project should be created in your wandb account 
        config=asdict(hparams) ### Wandb Config for your run
    )
    ### Save your model architecture as a string with str(model) 
    model_arch = str(model)

    ### Save it in a txt file 
    arch_file = open("model_arch.txt", "w")
    file_write = arch_file.write(model_arch)
    arch_file.close()

    ### log it in your wandb run with wandb.save()
    wandb.save('model_arch.txt')

[34m[1mwandb[0m: Currently logged in as: [33mjbajor[0m ([33midl-group[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/jbajor/.netrc


# Experiment

In [32]:
torch.load('model_checkpoint.pth')

{'epoch': 23,
 'model_state_dict': OrderedDict([('layers.0.weight',
               tensor([[ 1.1667,  0.5752,  0.4339,  ...,  0.0970, -0.0850, -0.4561],
                       [-0.1759,  0.1760, -0.5892,  ..., -0.4625,  0.1752, -0.3077],
                       [ 0.2314,  0.0965, -0.0439,  ..., -0.3021, -0.0099,  0.1964],
                       ...,
                       [ 0.2226,  0.4204, -0.2168,  ...,  0.2165, -0.0953, -0.3430],
                       [ 0.1174, -1.8793, -0.6521,  ..., -0.1232,  0.1457,  0.3114],
                       [-0.3169, -0.1427, -0.2361,  ..., -0.2070,  0.3545, -0.2763]],
                      device='cuda:0')),
              ('layers.0.bias',
               tensor([ -6.6531, -11.4309,  -4.7974,  ..., -15.0290, -13.9033, -16.9025],
                      device='cuda:0')),
              ('layers.3.weight',
               tensor([1.3066, 1.5920, 1.4017,  ..., 1.5795, 1.3451, 1.6311], device='cuda:0')),
              ('layers.3.bias',
               tensor([-0.

Now, it is time to finally run your ablations! Have fun!

In [26]:
# Iterate over number of epochs to train and evaluate your model
if device == 'cuda':
  torch.cuda.empty_cache()

best_acc = 0.0 ### Monitor best accuracy in your run

for epoch in range(hparams.epochs):
    print("\nEpoch {}/{}".format(epoch+1, hparams.epochs))

    train_loss = train(model, optimizer, criterion, train_loader, use_wandb, mixed_pr=hparams.mixed_precision)
    val_loss, accuracy = eval(model, val_loader, use_wandb)

    scheduler.step(val_loss)

    print("\tTrain Loss: {:.4f}".format(train_loss))
    print("\tValidation Accuracy: {:.2f}%".format(accuracy))

    ### Save checkpoint if accuracy is better than your current best
    if accuracy >= best_acc:
      best_acc = accuracy

      ### Save checkpoint with information you want
      torch.save({'epoch': epoch,
              'model_state_dict': model.state_dict(),
              'optimizer_state_dict': optimizer.state_dict(),
              'loss': train_loss,
              'acc': accuracy}, 
        './model_checkpoint.pth')
      
      ### Save checkpoint in wandb
      if use_wandb:
        wandb.save('checkpoint.pth')

### Finish your wandb run
if use_wandb:
  run.finish()


Epoch 1/25
	Train Loss: 0.7448
	Validation Accuracy: 81.40%

Epoch 2/25
	Train Loss: 0.5949
	Validation Accuracy: 83.01%

Epoch 3/25
	Train Loss: 0.5527
	Validation Accuracy: 83.88%

Epoch 4/25
	Train Loss: 0.5277
	Validation Accuracy: 84.33%

Epoch 5/25
	Train Loss: 0.5105
	Validation Accuracy: 84.73%

Epoch 6/25
	Train Loss: 0.4974
	Validation Accuracy: 85.01%

Epoch 7/25
	Train Loss: 0.4869
	Validation Accuracy: 85.16%

Epoch 8/25
	Train Loss: 0.4781
	Validation Accuracy: 85.33%

Epoch 9/25
	Train Loss: 0.4709
	Validation Accuracy: 85.55%

Epoch 10/25
	Train Loss: 0.4648
	Validation Accuracy: 85.66%

Epoch 11/25
	Train Loss: 0.4591
	Validation Accuracy: 85.81%

Epoch 12/25
	Train Loss: 0.4541
	Validation Accuracy: 85.88%

Epoch 13/25
	Train Loss: 0.4498
	Validation Accuracy: 85.94%

Epoch 14/25
	Train Loss: 0.4457
	Validation Accuracy: 86.06%

Epoch 15/25
	Train Loss: 0.4422
	Validation Accuracy: 86.15%

Epoch 16/25
	Train Loss: 0.4388
	Validation Accuracy: 86.15%

Epoch 17/25
	Tra

VBox(children=(Label(value='0.002 MB of 0.002 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
train loss,█▅▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
validation accuracy,▁▃▄▅▅▆▆▆▇▇▇▇▇▇▇▇█████████
validation loss,█▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁

0,1
train loss,0.41755
validation accuracy,86.56921
validation loss,-0.8657


# Testing and submission to Kaggle

Before we get to the following code, make sure to see the format of submission given in *random_submission.csv*. Once you have done so, it is time to fill the following function to complete your inference on test data. Refer the eval function from previous cells to get an idea of how to go about completing this function.

In [27]:
def test(model, test_loader):
  ### What you call for model to perform inference?
  model.eval()

  ### List to store predicted phonemes of test data
  test_predictions = []

  ### Which mode do you need to avoid gradients?
  with torch.no_grad():

      for i, frames in enumerate(tqdm(test_loader)):

          frames = frames.float().to(device)             
          
          output = model(frames)

          ### Get most likely predicted phoneme with argmax
          predicted_phonemes = torch.argmax(output, dim=1)

          test_predictions.extend(predicted_phonemes.cpu().tolist())

          ### How do you store predicted_phonemes with test_predictions? Hint, look at eval 
          
  return test_predictions

In [28]:
predictions = test(model, test_loader)

  0%|          | 0/1891 [00:00<?, ?it/s]

In [29]:
### Create CSV file with predictions
with open("./submission.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(predictions)):
        f.write("{},{}\n".format(i, predictions[i]))

In [30]:
### Submit to kaggle competition using kaggle API
!kaggle competitions submit -c 11-785-f22-hw1p2 -f ./submission.csv -m "Test Submission"

100%|██████████████████████████████████████| 18.6M/18.6M [00:03<00:00, 6.15MB/s]
Successfully submitted to Frame-Level Speech Recognition