# 11785 HW3P2: Automatic Speech Recognition

Welcome to HW3P2. In this homework, you will be using the same data from HW1 but will be incorporating sequence models. We recommend you get familaried with sequential data and the working of RNNs, LSTMs and GRUs to have a smooth learning in this part of the homework.

Disclaimer: This starter notebook will not be as elaborate as that of HW1P2 or HW2P2. You will need to do most of the implementation in this notebook because, it is expected after 2 HWs, you will be in a position to write a notebook from scratch. You are welcomed to reuse the code from the previous starter notebooks but may also need to make appropriate changes for this homework. <br>
We have also given you 3 log files for the Very Low Cutoff (Levenshtein Distance = 30) so that you can observe how loss decreases.

Common errors which you may face


*   Shape errors: Half of the errors from this homework will account to this category. Try printing the shapes between intermediate steps to debug
*   CUDA out of Memory: When your architecture has a lot of parameters, this can happen. Golden keys for this is, (1) Reducing batch_size (2) Call *torch.cuda.empty_cache* often, even inside your training loop, (3) Call *gc.collect* if it helps and (4) Restart run time if nothing works







# Prelimilaries

You will need to install packages for decoding and calculating the Levenshtein distance

In [1]:
!pip install python-Levenshtein
!git clone --recursive https://github.com/parlance/ctcdecode.git
!pip install wget
%cd ctcdecode
!pip install .
%cd ..

!pip install torchsummaryX # We also install a summary package to check our model's forward before training

Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[?25l[K     |██████▌                         | 10 kB 37.6 MB/s eta 0:00:01[K     |█████████████                   | 20 kB 43.9 MB/s eta 0:00:01[K     |███████████████████▌            | 30 kB 24.5 MB/s eta 0:00:01[K     |██████████████████████████      | 40 kB 13.8 MB/s eta 0:00:01[K     |████████████████████████████████| 50 kB 5.4 MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149862 sha256=0cac5a01d022af30b0a2763c1167c71be93cf57a67e9c9c6eb804a91b317bbb1
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2
Clo

# Libraries

In [2]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torchsummaryX import summary
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

from sklearn.metrics import accuracy_score
import gc
import zipfile
import pandas as pd
from tqdm import tqdm
import os
import datetime

# imports for decoding and distance calculation
import ctcdecode
import Levenshtein
from ctcdecode import CTCBeamDecoder

import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", device)

Device:  cuda


# Kaggle (TODO)

You need to set up your Kaggle and download the data

In [3]:
import json

TOKEN = {"username":"meiirbekislamov","key":"af197071383b4332b004369ebae2a753"}

! pip install kaggle==1.5.12
! mkdir -p .kaggle
! mkdir -p /content & mkdir -p /content/.kaggle & mkdir -p /root/.kaggle/

with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(TOKEN, file)

! pip install --upgrade --force-reinstall --no-deps kaggle
! ls "/content/.kaggle"
! chmod 600 /content/.kaggle/kaggle.json
! cp /content/.kaggle/kaggle.json /root/.kaggle/

! kaggle config set -n path -v /content

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[K     |████████████████████████████████| 58 kB 5.2 MB/s 
[?25hBuilding wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73051 sha256=71f05c1e061fdae32d384abd96249bcbd27f952a875b3413695e93a90400feb2
  Stored in directory: /root/.cache/pip/wheels/62/d6/58/5853130f941e75b2177d281eb7e44b4a98ed46dd155f556dc5
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.12
    Uninstalling kaggle-1.5.12:
      Successfully uninstalled kaggle-1.5.12
Successfully installed kaggle-1.5.12
kaggle.json
- path is now set to: /content


In [4]:
! kaggle competitions download -c 11-785-s22-hw3p2

Downloading 11-785-s22-hw3p2.zip to /content/competitions/11-785-s22-hw3p2
 99% 1.83G/1.84G [00:29<00:00, 65.9MB/s]
100% 1.84G/1.84G [00:29<00:00, 66.7MB/s]


In [5]:
! unzip /content/competitions/11-785-s22-hw3p2/11-785-s22-hw3p2.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-020.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-021.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-022.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-023.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-024.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-025.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-026.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-027.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-028.npy  
  inflating: hw3p2_student_data/hw3p2_student_data/train/transcript/7113-086041-029.npy  
  inflating: hw3p2_student_data/hw3

# Dataset and dataloading (TODO)

In [6]:
# PHONEME_MAP is the list that maps the phoneme to a single character. 
# The dataset contains a list of phonemes but you need to map them to their corresponding characters to calculate the Levenshtein Distance
# You final submission should not have the phonemes but the mapped string
# No TODOs in this cell

PHONEME_MAP = [
    " ",
    ".", #SIL
    "a", #AA
    "A", #AE
    "h", #AH
    "o", #AO
    "w", #AW
    "y", #AY
    "b", #B
    "c", #CH
    "d", #D
    "D", #DH
    "e", #EH
    "r", #ER
    "E", #EY
    "f", #F
    "g", #G
    "H", #H
    "i", #IH 
    "I", #IY
    "j", #JH
    "k", #K
    "l", #L
    "m", #M
    "n", #N
    "N", #NG
    "O", #OW
    "Y", #OY
    "p", #P 
    "R", #R
    "s", #S
    "S", #SH
    "t", #T
    "T", #TH
    "u", #UH
    "U", #UW
    "v", #V
    "W", #W
    "?", #Y
    "z", #Z
    "Z" #ZH
]

In [7]:
# This cell is where your actual TODOs start
# You will need to implement the Dataset class by your own. You may also implement it similar to HW1P2 (dont require context)
# The steps for implementation given below are how we have implemented it.
# However, you are welcomed to do it your own way if it is more comfortable or efficient. 

class LibriSamples(torch.utils.data.Dataset):

    def __init__(self, data_path, partition= "train"): # You can use partition to specify train or dev

        self.X_dir = data_path + "/" + partition + "/mfcc/"
        self.Y_dir = data_path + "/" + partition +"/transcript/"

        self.X_files = os.listdir(self.X_dir)
        self.Y_files = os.listdir(self.Y_dir)

        # TODO: store PHONEMES from phonemes.py inside the class. phonemes.py will be downloaded from kaggle.
        # You may wish to store PHONEMES as a class attribute or a global variable as well.
        self.PHONEMES = ["", 'SIL',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',  
                        'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
                        'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
                        'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
                        'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
                        'V',     'W',     'Y',     'Z',     'ZH']

        assert(len(self.X_files) == len(self.Y_files))

    def __len__(self):
        return len(self.X_files)

    def __getitem__(self, ind):

        X_path = self.X_dir + self.X_files[ind]
        Y_path = self.Y_dir + self.Y_files[ind]
        label = [self.PHONEMES.index(yy) for yy in np.load(Y_path)[1:-1]]

        X = np.load(X_path)
        # X = (X - X.mean(axis=0))/X.std(axis=0)
    
        # X = # TODO: Load the mfcc npy file at the specified index ind in the directory
        # Y = # TODO: Load the corresponding transcripts

        # Remember, the transcripts are a sequence of phonemes. Eg. np.array(['<sos>', 'B', 'IH', 'K', 'SH', 'AA', '<eos>'])
        # You need to convert these into a sequence of Long tensors
        # Tip: You may need to use self.PHONEMES
        # Remember, PHONEMES or PHONEME_MAP do not have '<sos>' or '<eos>' but the transcripts have them. 
        # You need to remove '<sos>' and '<eos>' from the trancripts. 
        # Inefficient way is to use a for loop for this. Efficient way is to think that '<sos>' occurs at the start and '<eos>' occurs at the end.
        
        Yy = torch.tensor(label).long()

        return torch.tensor(X), Yy
    
    def collate_fn(batch):

        batch_x = [x for x,y in batch]
        batch_y = [y for x,y in batch]

        batch_x_pad = pad_sequence(batch_x, batch_first=True)
        lengths_x = [len(x) for x,y in batch]

        batch_y_pad = pad_sequence(batch_y, batch_first=True)
        lengths_y = [len(y) for x,y in batch]

        return batch_x_pad, batch_y_pad, torch.tensor(lengths_x), torch.tensor(lengths_y)


# You can either try to combine test data in the previous class or write a new Dataset class for test data
class LibriSamplesTest(torch.utils.data.Dataset):

    def __init__(self, data_path, test_order): # test_order is the csv similar to what you used in hw1

        self.X_dir = data_path + "/" + "test" + "/mfcc/"
        test_order_path = data_path + "/" + "test" + "/" + test_order
        self.X_files = list(pd.read_csv(test_order_path).file)
        # self.X = # TODO: Load the npy files from test_order.csv and append into a list
        # You can load the files here or save the paths here and load inside __getitem__ like the previous class
    
    def __len__(self):
        return len(self.X_files)
    
    def __getitem__(self, ind):
        # TODOs: Need to return only X because this is the test dataset
        X_path = self.X_dir + self.X_files[ind]
        X = np.load(X_path)
        # X = (X - X.mean(axis=0))/X.std(axis=0)
        return torch.tensor(X)
    
    def collate_fn(batch):
        batch_x = [x for x in batch]
        batch_x_pad = pad_sequence(batch_x, batch_first=True)
        lengths_x = [len(x) for x in batch]

        return batch_x_pad, torch.tensor(lengths_x)

In [8]:
batch_size = 64

root = "/content/hw3p2_student_data/hw3p2_student_data"

train_data = LibriSamples(root, 'train')
val_data = LibriSamples(root, 'dev')
test_data = LibriSamplesTest(root, 'test_order.csv')

train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, collate_fn=LibriSamples.collate_fn, shuffle= True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size, collate_fn=LibriSamples.collate_fn, shuffle= True) 
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, collate_fn=LibriSamplesTest.collate_fn) 

print("Batch size: ", batch_size)
print("Train dataset samples = {}, batches = {}".format(train_data.__len__(), len(train_loader)))
print("Val dataset samples = {}, batches = {}".format(val_data.__len__(), len(val_loader)))
print("Test dataset samples = {}, batches = {}".format(test_data.__len__(), len(test_loader)))

Batch size:  64
Train dataset samples = 28539, batches = 446
Val dataset samples = 2703, batches = 43
Test dataset samples = 2620, batches = 41


In [9]:
# Optional
# Test code for checking shapes and return arguments of the train and val loaders
for data in val_loader:
    x, y, lx, ly = data # if you face an error saying "Cannot unpack", then you are not passing the collate_fn argument
    # print(x.shape, y.shape, lx.shape, ly.shape)
    packed_input = pack_padded_sequence(x, lx, batch_first=True, enforce_sorted=False)
    print(x.shape, lx.shape)
    print(packed_input[0].shape)
    # print(packed_input[1].shape)
    # print(packed_input[2].shape)
    # print(packed_input[3].shape)
    break
# lstm = nn.LSTM(input_size=13, hidden_size=256, num_layers=1, batch_first=True)
# out1, (out2, out3) = lstm(packed_input)
# out, lengths  = pad_packed_sequence(out1, batch_first=True)

torch.Size([64, 3264, 13]) torch.Size([64])
torch.Size([45630, 13])


# Model Configuration (TODO)

In [10]:
class InvertedResidualBlock(nn.Module):
    
    def __init__(self,
                 in_channels,
                 out_channels,
                 stride,
                 expand_ratio, i):
        super().__init__() # Just have to do this for all nn.Module classes

        # Expand Ratio is like 6, so hidden_dim >> in_channels
        hidden_dim = in_channels * expand_ratio

        self.residual_blocks = nn.Sequential(
            nn.Conv1d(in_channels, hidden_dim, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm1d(hidden_dim),
            nn.GELU(),
            nn.Conv1d(hidden_dim, hidden_dim, kernel_size=3, stride=stride, padding=1, groups=hidden_dim, bias=False),
            nn.BatchNorm1d(hidden_dim),
            nn.GELU(),
            nn.Conv1d(hidden_dim, out_channels, kernel_size=1, stride=1, padding=0, bias=False),
            nn.BatchNorm1d(out_channels),
            nn.GELU(),
            nn.Dropout(0.2)
        )


    def forward(self, x):
        out = self.residual_blocks(x)

        return x + out


In [12]:
class Network(nn.Module):

    def __init__(self): # You can add any extra arguments as you wish

        super(Network, self).__init__()

        # Embedding layer converts the raw input into features which may (or may not) help the LSTM to learn better 
        # For the very low cut-off you dont require an embedding layer. You can pass the input directly to the  LSTM
        # self.embedding = 
        
        self.lstm = nn.LSTM(input_size=256, hidden_size=256, num_layers=4, dropout = 0.4, bidirectional= True, batch_first=True) # Create a single layer, uni-directional LSTM with hidden_size = 256
        # Use nn.LSTM() Make sure that you give in the proper arguments as given in https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

        self.stem = nn.Sequential(
            nn.Conv1d(13, 64, kernel_size=3, stride=1, padding=1, bias=False),
            # nn.BatchNorm1d(64),
        )

        self.stage_cfgs = [
            # expand_ratio, channels, # blocks, stride of first block
            [4,  64, 2, 1],
            [4,  128, 2, 1],
            [4,  256, 2, 1],
        ]

        self.downsampling_layer = [
                                   nn.Sequential(
                                      #  nn.BatchNorm1d(64),
                                       nn.Conv1d(64, 128, kernel_size=2, stride=2)),
                                      #  nn.Dropout(0.3)),

                                   nn.Sequential(
                                      #  nn.BatchNorm1d(128),
                                       nn.Conv1d(128, 256, kernel_size=2, stride=2))
                                      #  nn.Dropout(0.2)),
                                   ]

        in_channels = 64

        # Let's make the layers
        layers_convnext = []
        ix = 0
        idx_dropout = -1
        for curr_stage in self.stage_cfgs:
            expand_ratio, num_channels, num_blocks, stride = curr_stage

            for block_idx in range(num_blocks):
                idx_dropout += 1
                out_channels = num_channels
                layers_convnext.append(InvertedResidualBlock(
                    in_channels=in_channels,
                    out_channels=out_channels, 
                    # only have non-trivial stride if first block
                    stride=stride if block_idx == 0 else 1,
                    expand_ratio=expand_ratio, i = idx_dropout,
                ))
                # In channels of the next block is the out_channels of the current one
                in_channels = out_channels
            if ix < 2:
              layers_convnext.append(self.downsampling_layer[ix])
              ix += 1
              in_channels = 2 * in_channels

        self.layers_convnext = nn.Sequential(*layers_convnext)


        layers_classification = [
                  nn.Linear(2 * 256, 2048),
                  nn.GELU(),
                  nn.Dropout(p=0.4),
                  nn.Linear(2048, 41),
                  # nn.GELU(),
                  # nn.Dropout(p=0.1),
                  # nn.Linear(1024, 41),
        ]

        self.layers_cl = nn.Sequential(*layers_classification)


    def forward(self, x, lengths_x): # TODO: You need to pass atleast 1 more parameter apart from self and x

        # Embedding layers: Conv1D
        x = torch.transpose(x, 2, 1)
        x = self.stem(x)
        x = self.layers_convnext(x)
        x = torch.transpose(x, 2, 1)
        lengths_x = lengths_x//4

        # x is returned from the dataloader. So it is assumed to be padded with the help of the collate_fn
        packed_input = pack_padded_sequence(x, lengths_x, batch_first=True, enforce_sorted=False)
        # h0 = (1, 256)
        # c0 = (1, 256)
        out1, (out2, out3) = self.lstm(packed_input)
        # output, (hn, cn) = rnn(input, (h0, c0))
        # As you may see from the LSTM docs, LSTM returns 3 vectors. Which one do you need to pass to the next function?
        out, lengths  = pad_packed_sequence(out1, batch_first=True)

        # out = self.classification1(out)
        # out = self.classification2(out)
        out = self.layers_cl(out)
        out = nn.LogSoftmax(dim=2)(out) # Do log softmax on the output. Which dimension?

        return out, lengths

model = Network().to(device)
print(model)
# model.load_state_dict(torch.load("/content/model_epoch_after_50_13.pt"))
summary(model, x.to(device), lx) # x and lx are from the previous cell

Network(
  (lstm): LSTM(256, 256, num_layers=4, batch_first=True, dropout=0.4, bidirectional=True)
  (stem): Sequential(
    (0): Conv1d(13, 64, kernel_size=(3,), stride=(1,), padding=(1,), bias=False)
  )
  (layers_convnext): Sequential(
    (0): InvertedResidualBlock(
      (residual_blocks): Sequential(
        (0): Conv1d(64, 256, kernel_size=(1,), stride=(1,), bias=False)
        (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): GELU()
        (3): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,), groups=256, bias=False)
        (4): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): GELU()
        (6): Conv1d(256, 64, kernel_size=(1,), stride=(1,), bias=False)
        (7): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (8): GELU()
        (9): Dropout(p=0.2, inplace=False)
      )
    )
    (1): InvertedResidualBlock(
      (residual_blo

Unnamed: 0_level_0,Kernel Shape,Output Shape,Params,Mult-Adds
Layer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0_stem.Conv1d_0,"[13, 64, 3]","[64, 64, 3264]",2496.0,8146944.0
1_layers_convnext.0.residual_blocks.Conv1d_0,"[64, 256, 1]","[64, 256, 3264]",16384.0,53477376.0
2_layers_convnext.0.residual_blocks.BatchNorm1d_1,[256],"[64, 256, 3264]",512.0,256.0
3_layers_convnext.0.residual_blocks.GELU_2,-,"[64, 256, 3264]",,
4_layers_convnext.0.residual_blocks.Conv1d_3,"[1, 256, 3]","[64, 256, 3264]",768.0,2506752.0
...,...,...,...,...
63_lstm,-,"[11387, 512]",5783552.0,5767168.0
64_layers_cl.Linear_0,"[512, 2048]","[64, 816, 2048]",1050624.0,1048576.0
65_layers_cl.GELU_1,-,"[64, 816, 2048]",,
66_layers_cl.Dropout_2,-,"[64, 816, 2048]",,


# Training Configuration (TODO)

In [13]:
criterion = nn.CTCLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay = 2.00e-03) 
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=5, mode='min', threshold=0.01)
# scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=(len(train_loader) * 30))
# scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[5,10, 15, 20, 25], gamma=0.7)
# Do you need to transpose or permute the model output to find out the loss? Read its documentation

PHONEMES = ["", 'SIL',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',  
                        'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
                        'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
                        'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
                        'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
                        'V',     'W',     'Y',     'Z',     'ZH']

decoder = CTCBeamDecoder(
    PHONEME_MAP,
    model_path=None,
    alpha=0,
    beta=0,
    cutoff_top_n=40,
    cutoff_prob=1.0,
    beam_width=20,
    num_processes=4,
    blank_id=0,
    log_probs_input=True
)
# beam_results, beam_scores, timesteps, out_lens = decoder.decode(output)
# Check out https://github.com/parlance/ctcdecode for the details on how to implement decoding
# Do you need to give log_probs_input = True or False?

In [14]:
# this function calculates the Levenshtein distance 

def calculate_levenshtein(h, y, lh, ly, decoder, PHONEME_MAP):

    # h - ouput from the model. Probability distributions at each time step 
    # y - target output sequence - sequence of Long tensors
    # lh, ly - Lengths of output and target
    # decoder - decoder object which was initialized in the previous cell
    # PHONEME_MAP - maps output to a character to find the Levenshtein distance

    # TODO: You may need to transpose or permute h based on how you passed it to the criterion
    # Print out the shapes often to debug
    beam_results, beam_scores, timesteps, out_lens = decoder.decode(h, seq_lens=lh)

    # TODO: call the decoder's decode method and get beam_results and out_len (Read the docs about the decode method's outputs)
    # Input to the decode method will be h and its lengths lh 
    # You need to pass lh for the 'seq_lens' parameter. This is not explicitly mentioned in the git repo of ctcdecode.

    batch_size = h.shape[0]

    dist = 0

    for i in range(batch_size): # Loop through each element in the batch

        h_sliced = beam_results[i][0][:out_lens[i][0].item()]
        # Remember that h is padded to the max sequence length and lh contains lengths of individual sequences
        # Same goes for beam_results and out_lens
        # You do not require the padded portion of beam_results - you need to slice it with out_lens 
        # If it is confusing, print out the shapes of all the variables and try to understand

        h_string =  "".join([PHONEME_MAP[phoneme] for phoneme in h_sliced])
        # TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string

        y_sliced = y[i][:ly[i].item()]
        # TODO: Do the same for y - slice off the padding with ly
        y_string = "".join([PHONEME_MAP[phoneme] for phoneme in y_sliced])
        # TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string
        
        dist += Levenshtein.distance(h_string, y_string)

    dist/=batch_size

    return dist

In [15]:
# Optional but recommended

for i, data in enumerate(train_loader, 0):
    
    # Write a test code do perform a single forward pass and also compute the Levenshtein distance
    # Make sure that you are able to get this right before going on to the actual training
    # You may encounter a lot of shape errors
    # Printing out the shapes will help in debugging
    # Keep in mind that the Loss which you will use requires the input to be in a different format and the decoder expects it in a different format
    # Make sure to read the corresponding docs about it
    h, y, lh, ly = data
    h = h.to(device)
    h, lh = model(h, lh)
    
    print(calculate_levenshtein(h, y, lh, ly, decoder, PHONEME_MAP))

    break # one iteration is enough
    del h

125.828125


In [16]:
torch.cuda.empty_cache() # Use this often

# TODO: Write the model evaluation function if you want to validate after every epoch

# You are free to write your own code for model evaluation or you can use the code from previous homeworks' starter notebooks
# However, you will have to make modifications because of the following.
# (1) The dataloader returns 4 items unlike 2 for hw2p2
# (2) The model forward returns 2 outputs
# (3) The loss may require transpose or permuting

# Note that when you give a higher beam width, decoding will take a longer time to get executed
# Therefore, it is recommended that you calculate only the val dataset's Levenshtein distance (train not recommended) with a small beam width
# When you are evaluating on your test set, you may have a higher beam width

def validate(model, device, val_loader):
  model.eval()
  levenshtein_distances = []
  for i, data in enumerate(val_loader):

    h, y, lh, ly = data
    h = h.to(device)
    h, lh = model(h, lh)
    # calculate_levenshtein(h, y, lh, ly, decoder, PHONEME_MAP)
    levenshtein_distances.append(calculate_levenshtein(h, y, lh, ly, decoder, PHONEME_MAP))
    # break
  return np.average(levenshtein_distances)
  # return calculate_levenshtein(h, y, lh, ly, decoder, PHONEME_MAP)

In [17]:
# validate(model, device, val_loader)


In [18]:
torch.cuda.empty_cache()

# TODO: Write the model training code 

# You are free to write your own code for training or you can use the code from previous homeworks' starter notebooks
# However, you will have to make modifications because of the following.
# (1) The dataloader returns 4 items unlike 2 for hw2p2
# (2) The model forward returns 2 outputs
# (3) The loss may require transpose or permuting

# Tip: Implement mixed precision training

def train(model, device, train_loader, optimizer, criterion, log_interval):
    model.train()
    # scaler = torch.cuda.amp.GradScaler()
    scaler = torch.cuda.amp.GradScaler()
    for batch_idx, data in enumerate(train_loader):
        h, y, lh, ly = data
        h = h.to(device)
        optimizer.zero_grad()
        with torch.cuda.amp.autocast():
          # h, lh = model(h, lh)
          h1, lh1 = model(h, lh)
          h1 = torch.transpose(h1, 0, 1)
          loss = criterion(h1, y, lh1, ly)
        scaler.scale(loss).backward()
        # loss.backward()
        scaler.step(optimizer)
        # optimizer.step()
        scaler.update()
        if batch_idx % log_interval == 0:
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tLR: {}'.format(
                    epoch, batch_idx * len(h), len(train_loader.dataset),
                    100. * batch_idx / len(train_loader), loss.item(), optimizer.state_dict()['param_groups'][0]['lr']))
          

# Train

In [19]:
torch.cuda.empty_cache()
# scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=3, mode='max', threshold=0.001)
epochs = 1
log_interval = 32

for epoch in range(1, epochs + 1):
  train(model, device, train_loader, optimizer, criterion,log_interval)

  # train_lev_distance = validate(model, device, train_loader)
  dev_lev_distance = validate(model, device, val_loader)
  # scheduler.step(dev_lev_distance)
  scheduler.step()

  torch.save(model.state_dict(), f"model_epoch_after_50_{epoch}.pt")
  
  print(f'Epoch {epoch}/{epochs}')
  print('Lev Distance ', dev_lev_distance)




RuntimeError: ignored

# Submit to kaggle (TODO)

In [None]:
# this function calculates the Levenshtein distance 

def output_string_test(h, lh, decoder, PHONEME_MAP):

    # h - ouput from the model. Probability distributions at each time step 
    # y - target output sequence - sequence of Long tensors
    # lh, ly - Lengths of output and target
    # decoder - decoder object which was initialized in the previous cell
    # PHONEME_MAP - maps output to a character to find the Levenshtein distance

    # TODO: You may need to transpose or permute h based on how you passed it to the criterion
    # Print out the shapes often to debug
    beam_results, beam_scores, timesteps, out_lens = decoder.decode(h, seq_lens=lh)

    # TODO: call the decoder's decode method and get beam_results and out_len (Read the docs about the decode method's outputs)
    # Input to the decode method will be h and its lengths lh 
    # You need to pass lh for the 'seq_lens' parameter. This is not explicitly mentioned in the git repo of ctcdecode.

    batch_size = h.shape[0]
    
    output_strings_batch = []
    for i in range(batch_size): # Loop through each element in the batch

        h_sliced = beam_results[i][0][:out_lens[i][0].item()]
        # Remember that h is padded to the max sequence length and lh contains lengths of individual sequences
        # Same goes for beam_results and out_lens
        # You do not require the padded portion of beam_results - you need to slice it with out_lens 
        # If it is confusing, print out the shapes of all the variables and try to understand

        h_string =  "".join([PHONEME_MAP[phoneme] for phoneme in h_sliced])
        # TODO: MAP the sequence of numbers to its corresponding characters with PHONEME_MAP and merge everything as a single string
        output_strings_batch.append(h_string)

    return output_strings_batch

In [None]:
# TODO: Write your model evaluation code for the test dataset
# You can write your own code or use from the previous homewoks' stater notebooks
# You can't calculate loss here. Why?

def test(model, device, test_loader):
  model.eval()
  
  output_strings = []
  for i, data in enumerate(test_loader):

    h, lh = data
    h = h.to(device)
    h, lh = model(h, lh)
    list_strings_batch = output_string_test(h, lh, decoder, PHONEME_MAP)
    output_strings.extend(list_strings_batch)
  return output_strings

In [None]:
# TODO: Generate the csv file
output_strings = test(model, device, test_loader)
output = pd.DataFrame()
output['id'] = np.array(range(len(output_strings)))
output['predictions'] = np.array(output_strings) 
output.to_csv("submission.csv", index = False)

In [None]:
!kaggle competitions submit -c 11-785-s22-hw3p2 -f submission.csv -m "submission 50 epochs, residual embedding"

100% 212k/212k [00:02<00:00, 106kB/s]
Successfully submitted to Automatic Speech Recognition (ASR)