# **Homework 2-1 Phoneme Classification**

## The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT)
The TIMIT corpus of reading speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.

This homework is a multiclass classification task, 
we are going to train a deep neural network classifier to predict the phonemes for each frame from the speech corpus TIMIT.

link: https://academictorrents.com/details/34e2b78745138186976cbc27939b1b34d18bd5b3

## Set up environment variable

In [None]:
# For reproducibility
import os
os.environ['CUBLAS_WORKSPACE_CONFIG']=":16:8" 

## Download Data
Download data from google drive, then unzip it.

You should have `timit_11/train_11.npy`, `timit_11/train_label_11.npy`, and `timit_11/test_11.npy` after running this block.<br><br>
`timit_11/`
- `train_11.npy`: training data<br>
- `train_label_11.npy`: training label<br>
- `test_11.npy`:  testing data<br><br>

**notes: if the google drive link is dead, you can download the data directly from Kaggle and upload it to the workspace**




In [None]:
!gdown --id '1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR' --output data.zip
!unzip data.zip
!ls 

Downloading...
From: https://drive.google.com/uc?id=1HPkcmQmFGu-3OknddKIa5dNDsR05lIQR
To: /content/data.zip
372MB [00:02, 125MB/s]
Archive:  data.zip
   creating: timit_11/
  inflating: timit_11/train_11.npy   
  inflating: timit_11/test_11.npy    
  inflating: timit_11/train_label_11.npy  
data.zip  sample_data  timit_11


## Import Packages

In [None]:
import numpy as np
import torch
import torch.nn as nn

## Preparing Data
Load the training and testing data from the `.npy` file (NumPy array).

In [None]:
print('Loading data ...')

data_root='./timit_11/'
train = np.load(data_root + 'train_11.npy')
train_label = np.load(data_root + 'train_label_11.npy').astype(int)
test = np.load(data_root + 'test_11.npy')

print('Size of training data: {}'.format(train.shape))
print('Size of testing data: {}'.format(test.shape))

Loading data ...
Size of training data: (1229932, 429)
Size of testing data: (451552, 429)


### Reconstruct frames into sequences

In [None]:
# Find the indices to split
center_offset = 39 * 5
n_features = 39
def find_split_indices(arr):
    indices = [0]
    each_seq_len = []
    for i in range(arr.shape[0] - 1):
        if np.array_equal(arr[i][center_offset : center_offset + n_features], arr[i + 1][center_offset - n_features: center_offset]) == False:
            indices.append(i + 1)
            each_seq_len.append(indices[len(indices) - 1] - indices[len(indices) - 2])
    each_seq_len.append(arr.shape[0] - indices[len(indices) - 1])
    return indices, each_seq_len

train_split_indices, train_seq_len = find_split_indices(train)
test_split_indices, test_seq_len = find_split_indices(test)

In [None]:
# Fetch the center frame
train = train[:, center_offset : center_offset + n_features]
test = test[:, center_offset : center_offset + n_features]

### Pad each sequences into same length (only for training data)

In [None]:
def transform_to_padded_seqs(data, split_indices, seq_len):

    l = len(split_indices)

    seqs = [torch.tensor(data[split_indices[i] : split_indices[i + 1]]) if i != l - 1 else torch.tensor(data[split_indices[i]:]) for i in range(l)]
    padded_seqs = torch.nn.utils.rnn.pad_sequence(seqs, batch_first=True, padding_value=-1)
    sorted_idx = np.argsort(np.array(seq_len) * -1) # descending order
    padded_seqs = padded_seqs[sorted_idx]
    
    return padded_seqs

In [None]:
# preprocessing training data
padded_train = transform_to_padded_seqs(train, train_split_indices, train_seq_len)
padded_train_labels = transform_to_padded_seqs(train_label, train_split_indices, train_seq_len)
train_seq_len.sort(reverse=True)
max_train_seq_len = train_seq_len[0]
# preprocessing testing data
l = len(test_split_indices)
test_seqs_list = [torch.tensor(test[test_split_indices[i] : test_split_indices[i + 1]]) if i != l - 1 else torch.tensor(test[test_split_indices[i]:]) for i in range(l)]

### Split data into training and validation set if needed

In [None]:
VAL = False
if VAL:
    train_indices = [i for i in range(padded_train.shape[0]) if i % 10 != 0]
    val_indices = [i for i in range(padded_train.shape[0]) if i % 10 == 0]
    padded_val = padded_train[val_indices]
    padded_train = padded_train[train_indices]
    padded_val_labels = padded_train_labels[val_indices]
    padded_train_labels = padded_train_labels[train_indices]
    val_seq_len = [train_seq_len[i] for i in range(len(train_seq_len)) if i % 10 == 0 ]
    train_seq_len = [train_seq_len[i] for i in range(len(train_seq_len)) if i % 10 != 0]

## Create Model

In [None]:
class Classifier(nn.Module):
    def __init__(self, device):
        super(Classifier, self).__init__()
        self.n_layer = 2
        self.hidden_dim = 768 * 2
        self.lstm = nn.LSTM(input_size=39, hidden_size=self.hidden_dim // 2, num_layers=self.n_layer, bidirectional=True, dropout=0.5)
        self.out = nn.Linear(self.hidden_dim, 39)
        self.act = nn.ReLU()
        self.drop = nn.Dropout(0.5)

    def forward(self, x, seq_len, total_length):
        x = torch.nn.utils.rnn.pack_padded_sequence(x, seq_len, batch_first=True)
        x, _ = self.lstm(x)
        x, l = torch.nn.utils.rnn.pad_packed_sequence(x, total_length=total_length, batch_first=True)
        x = x.reshape(-1, self.hidden_dim)
        x = self.act(x)
        x = self.drop(x)
        x = self.out(x)
        return x

## **Train**

In [None]:
# fix random seed
def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  
    np.random.seed(seed)  
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

In [None]:
#check device
def get_device():
  return 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
# fix random seed for reproducibility
same_seeds(0)

# get device 
device = get_device()
print(f'DEVICE: {device}')

n_train_seqs = padded_train.shape[0]
if VAL:
    n_val_seqs = padded_val.shape[0]

# Hyperparameters
n_epoch = 8
batch_size = 2
learning_rate = 0.001          
model_path = './model.ckpt'

# create model, define a loss function, and optimizer
model = Classifier(device).to(device)
criterion = nn.CrossEntropyLoss(ignore_index=-1) 
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# ramdom number generator for batch index
rng = np.random.default_rng(0)

best_acc = 0.0

for i in range(n_epoch):

    train_acc = 0.0
    train_loss = 0.0
    total_labels_num = 0
    model.train()

    if i == 5:
        for g in optimizer.param_groups:
            g['lr'] = g['lr'] / 10
            print("Change learning rate to: ", g['lr'])

    for j in range(int(n_train_seqs / batch_size)):
        
        batch_indices = np.sort(rng.choice(n_train_seqs, batch_size, replace=False))
        batch_padded_train = padded_train[batch_indices].to(device)
        batch_train_seq_len = np.array(train_seq_len)[batch_indices].tolist() # The length of each sequence
        batch_padded_train_labels = padded_train_labels[batch_indices].to(device).reshape(-1)

        optimizer.zero_grad() 
        
        batch_logits = model(batch_padded_train.float(), batch_train_seq_len, max_train_seq_len)
        _, batch_train_pred = torch.max(batch_logits, 1) 
        
        batch_loss = criterion(batch_logits, batch_padded_train_labels)

        train_acc += (torch.logical_and(batch_train_pred.cpu() == batch_padded_train_labels.cpu(), batch_padded_train_labels.cpu() != -1)).sum().item()
        train_loss += batch_loss.item()
        total_labels_num += (batch_padded_train_labels.cpu() != -1).sum().item()

        batch_loss.backward(retain_graph=True) 
        optimizer.step()

        avg_train_loss = train_loss / int(n_train_seqs / batch_size)
        avg_train_acc = train_acc / total_labels_num

    if VAL:
        model.eval()
        with torch.no_grad():
            val_data = padded_val.float().to(device)
            logits = model(val_data, val_seq_len, max_train_seq_len)
            _, val_pred = torch.max(logits, 1)
            labels = padded_val_labels.to(device).reshape(-1)
            val_loss = criterion(logits, labels)
            val_acc = (torch.logical_and(val_pred.cpu() == labels.cpu(), labels.cpu() != -1)).sum().item()
            total_labels_num = (labels.cpu() != -1).sum().item()
            avg_val_acc = val_acc / total_labels_num
            avg_val_loss = val_loss
            if avg_val_acc > best_acc:
                best_acc = avg_val_acc
                torch.save(model.state_dict(), model_path)
                print('saving model with acc {:.3f}'.format(avg_val_acc))
        print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f} | Val Acc: {:3.6f} loss: {:3.6f}'.format(
            i + 1, n_epoch, avg_train_acc, avg_train_loss, avg_val_acc, avg_val_loss
            ))
    else:
        print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f}'.format(
            i + 1, n_epoch, avg_train_acc, avg_train_loss))
        torch.save(model.state_dict(), model_path)

DEVICE: cuda
[001/008] Train Acc: 0.659272 Loss: 1.095358
[002/008] Train Acc: 0.767911 Loss: 0.711162
[003/008] Train Acc: 0.804336 Loss: 0.588268
[004/008] Train Acc: 0.832855 Loss: 0.498345
[005/008] Train Acc: 0.851354 Loss: 0.435548
Change learning rate to:  0.0001
[006/008] Train Acc: 0.881419 Loss: 0.342824
[007/008] Train Acc: 0.900688 Loss: 0.281273
[008/008] Train Acc: 0.911380 Loss: 0.247932


## Testing

In [None]:
# create model and load weights from checkpoint
del model
model = Classifier(device).to(device)
model.load_state_dict(torch.load(model_path))

<All keys matched successfully>

Make prediction.

In [None]:
predict = []
model.eval() # set the model to evaluation mode
with torch.no_grad():
    for seq in test_seqs_list:
        seq = seq.float().to(device)
        logits = model(seq.unsqueeze(0), [seq.shape[0]], seq.shape[0])
        _, test_pred = torch.max(logits, 1)
        for label in test_pred:
            predict.append(label)  

Write prediction to a CSV file.

After finish running this block, download the file `prediction.csv` from the files section on the left-hand side and submit it to Kaggle.

In [None]:
with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(predict):
        f.write('{},{}\n'.format(i, y))

## Reference
Source: TA Sample code (https://github.com/ga642381/ML2021-Spring/blob/main/HW02/HW02-1.ipynb)
<br>
Pytorch official document
<br>
https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html
https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html
https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html
<br>
RNN introduction
<br>
https://www.youtube.com/watch?v=xCGidAeyS4M (Prof. Hung-Yi Lee's youtube channel)


