<a href="https://colab.research.google.com/github/hjori66/Kaist-AI605-2021-Spring/blob/main/Kaist_AI605_Assignment_1_20194364_Taehwan_Kim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KAIST AI605 Assignment 1: Text Classification with RNNs
Authors: Hyeong-Gwon Hong (honggudrnjs@kaist.ac.kr) and Minjoon Seo (minjoon@kaist.ac.kr)

**Due Date:** March 31 (Wed) 11:00pm, 2021

## Assignment Objectives
- Verify theoretically and empirically why gating mechanism (LSTM, GRU) helps in Recurrent Neural Networks (RNNs)
- Design an LSTM-based text classification model from scratch using PyTorch.
- Apply the classification model to a popular classification task, Stanford Sentiment Treebank v2 (SST-2).
- Achieve higher accuracy by applying common machine learning strategies, including Dropout.
- Utilize pretrained word embedding (e.g. GloVe) to leverage self-supervision over a large text corpus.
- (Bonus) Use Hugging Face library (`transformers`) to leverage self-supervision via large language models.

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submtit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in yoru assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are four bonus questions with 10 points each (two bonus questions added on Mar 19). Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Limitations of Vanilla RNNs
In Lecture 04 and 05, we saw how RNNs suffer from exploding or vanishing gradients. We mathematically showed that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

**Problem 1.1** *(10 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

**Answer 1.1** The definition of the gradient clipping is followed.

$$ \begin{aligned}
\frac{\partial \textbf{h}}{\partial \theta} \leftarrow &\left\{\begin{array}{ll}
\frac{\text { threshold }}{\|\hat{g}\|} \hat{g} & \text { if }\|\hat{g}\| \geq \text { threshold }
\\
\hat{g} & \text { otherwise }
\end{array}\right
.\\
& \text { where } \hat{g}=\frac{\partial \textbf{h}}{\partial \theta}
\end{aligned} $$

If the gradient exploded, weights of the model can be NaN value (either overflow or underflow). So, rescaling the error derivative before propagating it backward through the network to update the weights can be one of the solution. If we do that, we can decrease the likelihood of an over or underflow.


**Problem 1.2** *(10 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 04 and 05 slides for the definition of LSTM.

**Answer 1.2** The formulation of LSTM is followed.

$$
\begin{aligned}
f_{t} &=\sigma_{g}\left(W_{f} x_{t}+U_{f} h_{t-1}+b_{f}\right) \\
i_{t} &=\sigma_{g}\left(W_{i} x_{t}+U_{i} h_{t-1}+b_{i}\right) \\
o_{t} &=\sigma_{g}\left(W_{o} x_{t}+U_{o} h_{t-1}+b_{o}\right) \\
\tilde{c}_{t} &=\sigma_{c}\left(W_{c} x_{t}+U_{c} h_{t-1}+b_{c}\right) \\
c_{t} &=f_{t} \circ c_{t-1}+i_{t} \circ \tilde{c}_{t} \\
h_{t} &=o_{t} \circ \sigma_{h}\left(c_{t}\right)
\end{aligned}
$$

Then,

$$
\frac{\partial c_{T}}{\partial c_{t}}=
\frac{\partial c_{T}}{\partial c_{T-1}} * 
\frac{\partial c_{T-1}}{\partial c_{T-2}} * 
\ldots * 
\frac{\partial c_{t+1}}{\partial c_{t}}
$$
and
$$
\frac{\partial c_{T}}{\partial c_{t}}=\prod_{i=t+1}^{T} f_{i}
$$

Now, $f_{i}$ is sigmoid function. so, it is larger than 0 and "smaller than 1". If $f_{i}$ was closed to 1, cell state ($c_{i}$) considers the long term memory. Otherwise, it doesn't consider the long term memory. This formulation mitigates the vanishing gradient. 

Similary, 
$$
\frac{\partial c_{T}}{\partial \tilde{c}_{t}}=\prod_{j=t+1}^{T} i_{j} \text { and } \frac{\partial h_{T}}{\partial h_{t}}
$$
mitigates the vanishing, too.

## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank v2, a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST-2 via GLUE
General Language Understanding Evaluation (GLUE) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing natural language understanding (NLU) tasks. See GLUE website (https://gluebenchmark.com/) and the GLUE paper (https://openreview.net/pdf?id=rJ4km2R5t7) for more details. GLUE provides an easy way to access the datasets, including SST-2.
You can download SST-2 dataset by following the steps below:

1. Clone GitHub repository:

In [None]:
!git clone https://github.com/nyu-mll/GLUE-baselines.git

Cloning into 'GLUE-baselines'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 891 (delta 1), reused 2 (delta 0), pack-reused 886[K
Receiving objects: 100% (891/891), 1.48 MiB | 9.00 MiB/s, done.
Resolving deltas: 100% (610/610), done.


2. Download SST-2 only:

In [None]:
%cd GLUE-baselines/
!python download_glue_data.py --data_dir glue_data --tasks SST

/content/GLUE-baselines
Downloading and extracting SST...
	Completed!


Your training, dev, and test data can be found at `glue_data/SST-2`. Note that each file is in a tsv format, where the first column is the sentence and the second column is the label (either 0 or 1, where 1 means positive sentiment). 

In [None]:
!head -10 glue_data/SST-2/train.tsv

sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0
on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 	0
that 's far too tragic to merit such superficial treatment 	0
demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 	1
of saucy 	1
a depressed fifteen-year-old 's suicidal poetry 	0


**Problem 2.1** *(10 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

In [None]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [None]:
# Constructing vocabulary with `UNK`
vocab = ['UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

['UNK', 'world!', 'Hello']
2


In [None]:
# Constructing vocabulary using space tokenizer and 'UNK'
import pandas as pd

def make_vocab(fname):
    df = pd.read_csv(fname, sep='\t')
    tokens = list()
    token_occur = dict()

    for i, row in df.iterrows():
        for token in row['sentence'].strip().split(' '):
            if token in token_occur:
                token_occur[token] += 1
            else:
                token_occur[token] = 1
                tokens.append(token)

    vocab = ['UNK'] + tokens
    return vocab

vocab = make_vocab('glue_data/SST-2/train.tsv')
print("The size of the vocabulary is ", len(vocab))

The size of the vocabulary is  14817


**Problem 2.2** *(10 points)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

In [None]:
# Constructing reduced vocabulary (occur at least twice)
import pandas as pd

def make_reduced_vocab(fname, min_occur=0):
    df = pd.read_csv(fname, sep='\t')
    # tokens = list()
    token_occur = dict()

    for i, row in df.iterrows():
        for token in row['sentence'].strip().split(' '):
            if token in token_occur:
                token_occur[token] += 1
            else:
                token_occur[token] = 1
                # tokens.append(token)

    vocab = ['UNK', 'PAD']
    for token, occur in token_occur.items():
        if occur >= min_occur:
            vocab.append(token)
    word2id = {word: id_ for id_, word in enumerate(vocab)}
    return vocab, word2id

train_fname = 'glue_data/SST-2/train.tsv'
dev_fname = 'glue_data/SST-2/dev.tsv'
vocab, word2id = make_reduced_vocab(train_fname)
reduced_vocab, word2id = make_reduced_vocab(train_fname, min_occur=2)
print(len(vocab))
print(len(reduced_vocab))
print("The size of the vocabulary change is ", len(vocab) - len(reduced_vocab))

14818
14311
The size of the vocabulary change is  507


## 3. Text Classification Baselines

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to go through one layer of neural network and then average the outputs, and finally classify the average embedding: 

In [None]:
from torch import nn

input_ = "hi world!"
input_tokens = input_.split(' ')
input_ids = [word2id[word] if word in word2id else 0 for word in input_tokens]
input_tensor = torch.LongTensor([input_ids]) # the first dimension is minibatch size
print(input_tensor)

tensor([[0, 0]])


In [None]:
# One layer, average pooling and classification
class Baseline(nn.Module):
  def __init__(self, d):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor)
    out = self.relu(self.layer(emb))
    avg = out.mean(1)
    logits = self.class_layer(avg)
    return logits

d = 3 # usually bigger, e.g. 128
baseline = Baseline(d)
logits = baseline(input_tensor)
softmax = nn.Softmax(1)
print(softmax(logits)) # probability for each class

tensor([[0.4992, 0.5008]], grad_fn=<SoftmaxBackward>)


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [None]:
cel = nn.CrossEntropyLoss()
label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
loss = cel(logits, label) # Loss, a.k.a L
print(loss)

tensor(0.6915, grad_fn=<NllLossBackward>)


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [None]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad() # reset process
loss.backward() # compute gradients
optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [None]:
print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])


**Problem 3.1** *(10 points)* Properly train this average-pooling baseline model on SST-2 and report the model's accuracy on the dev data.


In [None]:
# Constructing the dataset

vocab, word2id = make_reduced_vocab(train_fname, min_occur=2)
vocab_size = len(vocab)
batch_size = 128
shuffle = False

def seq2id(seq, word2id):
    sentence = seq.strip().split(' ')
    return [word2id[word] if word in word2id else 0 for word in sentence]

train_df = pd.read_csv(train_fname, sep='\t')
train_src = [seq2id(seq, word2id) for seq in train_df['sentence'].tolist()]
train_tgt = train_df['label'].tolist()

dev_df = pd.read_csv(dev_fname, sep='\t')
dev_src = [seq2id(seq, word2id) for seq in dev_df['sentence'].tolist()]
dev_tgt = dev_df['label'].tolist()

class DataLoader:
    def __init__(self, src, tgt, batch_size, pad_idx, shuffle=False):
        assert len(src) == len(tgt)
        self.src = src
        self.tgt = tgt
        self.size = len(src)
        self.batch_size = batch_size
        self.pad_idx = pad_idx
        self.shuffle = shuffle

    def __iter__(self):
        self.index = 0
        if self.shuffle:
            index = list(range(self.size))
            random.shuffle(index)

            shuffle_src = list()
            shuffle_tgt = list()

            for i in index:
                shuffle_src.append(self.src[i])
                shuffle_tgt.append(self.tgt[i])

            self.src = shuffle_src
            self.tgt = shuffle_tgt

        return self

    def pad(self, batch):
        max_len = 0
        for seq in batch:
            if max_len < len(seq):
                max_len = len(seq)

        for i in range(len(batch)):
            batch[i] += [self.pad_idx] * (max_len - len(batch[i]))
        seq_lens = torch.LongTensor([(batch[i] + [self.pad_idx]).index(self.pad_idx) for i in range(len(batch))])

        return batch, seq_lens

    def __next__(self):
        if self.batch_size * self.index >= self.size:
            raise StopIteration

        src_batch = self.src[self.batch_size * self.index : self.batch_size * (self.index+1)]
        tgt_batch = self.tgt[self.batch_size * self.index : self.batch_size * (self.index+1)]

        padded_src_batch, src_seq_lens = self.pad(src_batch)

        self.index += 1

        return padded_src_batch, src_seq_lens, tgt_batch

train_data_loader = DataLoader(train_src, train_tgt, batch_size=batch_size, pad_idx=1, shuffle=shuffle)
dev_data_loader = DataLoader(dev_src, dev_tgt, batch_size=batch_size, pad_idx=1, shuffle=shuffle)

In [None]:
# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

d = 128 # usually bigger, e.g. 128
baseline = Baseline(d).to(device)

cel = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)

epochs = 100

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)
        logits = baseline(train_tensor_src)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_tensor_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_tensor_tgt).sum().float()

        train_data_num += train_tensor_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
                logits = baseline(dev_tensor_src)
                loss = cel(logits, dev_tensor_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_tensor_tgt).sum().float()

                valid_data_num += dev_tensor_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))


train:: Epoch: 0001 cost = 0.005350, acc = 0.560736
valid:: Epoch: 0001 cost = 0.005533, acc = 0.534404
train:: Epoch: 0002 cost = 0.005285, acc = 0.578004
valid:: Epoch: 0002 cost = 0.005398, acc = 0.581422
train:: Epoch: 0003 cost = 0.005201, acc = 0.600143
valid:: Epoch: 0003 cost = 0.005245, acc = 0.628440
train:: Epoch: 0004 cost = 0.005105, acc = 0.620573
valid:: Epoch: 0004 cost = 0.005113, acc = 0.653670
train:: Epoch: 0005 cost = 0.005002, acc = 0.637975
valid:: Epoch: 0005 cost = 0.005010, acc = 0.670872
train:: Epoch: 0006 cost = 0.004898, acc = 0.653566
valid:: Epoch: 0006 cost = 0.004923, acc = 0.693807
train:: Epoch: 0007 cost = 0.004801, acc = 0.666291
valid:: Epoch: 0007 cost = 0.004841, acc = 0.701835
train:: Epoch: 0008 cost = 0.004708, acc = 0.677263
valid:: Epoch: 0008 cost = 0.004761, acc = 0.705275
train:: Epoch: 0009 cost = 0.004618, acc = 0.687107
valid:: Epoch: 0009 cost = 0.004687, acc = 0.699541
train:: Epoch: 0010 cost = 0.004532, acc = 0.697263
valid:: Epoc

In [None]:
I used d = 128, batch_size = 128, epoch = 100, for the baseline. The result is followed.

After 30 Iteration, It converges. (Validation Loss begins to increase)
train:: Epoch: 0030 cost = 0.003190 ,acc = 0.823991
valid:: Epoch: 0030 cost = 0.003767 ,acc = 0.770642

**Problem 3.2** *(10 points)* Implement a recurrent neural network (without using PyTorch's RNN module) where the output of the linear layer not only depends on the current input but also the previous output. Report the model's accuracy on the dev data. Is it better or worse than the baseline? Why?



In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# One RNN layer and classification
class RNNNode(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(RNNNode, self).__init__()

        self.W_h2h = torch.nn.Linear(input_size + hidden_size, hidden_size)
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.tanh = torch.nn.Tanh()
        self.softmax = torch.nn.Softmax()
    
    def forward(self, input, hidden, src_batch_sizes):
        h_t = hidden.squeeze(0)
        h_last = h_t
        y_list = list()
        batch_size = int(src_batch_sizes[0])
        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :]
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)
            h_last[:batch, :] = h_t = self.tanh(self.W_h2h(combined_input))
            y_t = self.W_h2y(h_t)
            y_list.append(nn.ZeroPad2d((0, 0, batch_size - batch, 0))(y_t)) # Zero-padding
        y = torch.stack(y_list, dim=0) # 
        return y, h_last.unsqueeze(0)


class RNNModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)

        # nn.RNN
        self.rnn = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers)

        # my RNN
        # self.rnn = RNNNode(input_size=embedding_dim, hidden_size=hidden_dim)

        self.fc = nn.Linear(hidden_dim * n_layers, 2)
        self.softmax = torch.nn.Softmax()

    def forward(self, input_tensor, src_seq_lens, hidden):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.RNN
        packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        outs, hidden = self.rnn(packed, hidden) # h0 = zero initialization
        outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my RNN
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        # src_batch_sizes = seqs2batches(src_seq_lens)
        # outs, hidden = self.rnn(emb, hidden, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden

        logits = self.fc(hidden.squeeze(0))
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = 256 # usually bigger, e.g. 128
hidden_dim = 128
n_layers = 1
rnnmodel = RNNModel(embedding_dim, hidden_dim, n_layers).to(device)

cel = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=0.1)

epochs = 150

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_batch = train_tensor_src[sorted_indices]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(n_layers, train_sorted_batch.shape[0], hidden_dim, requires_grad=True).to(device)
        # logits = rnnmodel(train_tensor_src, src_seq_lens, h0)
        logits = rnnmodel(train_sorted_batch, sorted_seq_lens, h0)

        # print(train_tensor_src.shape, train_tensor_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_tensor_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_tensor_tgt).sum().float()

        train_data_num += train_tensor_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_batch = dev_tensor_src[sorted_indices]

                h0 = torch.zeros(n_layers, dev_sorted_batch.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_batch, sorted_seq_lens, h0)

                loss = cel(logits, dev_tensor_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_tensor_tgt).sum().float()

                valid_data_num += dev_tensor_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        


    

train:: Epoch: 0001 cost = 0.005422, acc = 0.542577
valid:: Epoch: 0001 cost = 0.005589, acc = 0.503440
train:: Epoch: 0002 cost = 0.005398, acc = 0.547105
valid:: Epoch: 0002 cost = 0.005594, acc = 0.488532
train:: Epoch: 0003 cost = 0.005388, acc = 0.549852
valid:: Epoch: 0003 cost = 0.005603, acc = 0.491972
train:: Epoch: 0004 cost = 0.005376, acc = 0.552287
valid:: Epoch: 0004 cost = 0.005614, acc = 0.489679
train:: Epoch: 0005 cost = 0.005364, acc = 0.557024
valid:: Epoch: 0005 cost = 0.005627, acc = 0.494266
train:: Epoch: 0006 cost = 0.005350, acc = 0.561434
valid:: Epoch: 0006 cost = 0.005643, acc = 0.489679
train:: Epoch: 0007 cost = 0.005335, acc = 0.565042
valid:: Epoch: 0007 cost = 0.005663, acc = 0.490826
train:: Epoch: 0008 cost = 0.005319, acc = 0.569660
valid:: Epoch: 0008 cost = 0.005688, acc = 0.489679


KeyboardInterrupt: ignored

In [None]:
(The loss is not really decreasing and the accuracy is poor...
It is much worse than the baseline. (Either cases, nn.RNN or my code)

...
train:: Epoch: 0019 cost = 0.005243 ,acc = 0.588739
train:: Epoch: 0020 cost = 0.005223 ,acc = 0.591694
valid:: Epoch: 0020 cost = 0.005724 ,acc = 0.512615

This is because of the 

**Problem 3.3 (bonus)** *(10 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.



**Problem 3.4 (bonus)** *(10 points)* Why is it numerically unstable if you compute log on top of softmax?

## 4. Text Classification with LSTM and Dropout

Now it is time to improve your baselines! Replace your RNN module with an LSTM module. See Lecture slides 04 and 05 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.2000, 0.0000, 0.0000, 1.4000, 1.8000])


Problem 4.1 (20 points) Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN to improve your model. Report the accuracy on the dev data.

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# One LSTM layer and classification
class LSTMNode(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMNode, self).__init__()

        self.W_ft = torch.nn.Linear(input_size + hidden_size, hidden_size) # forget gate
        self.W_it = torch.nn.Linear(input_size + hidden_size, hidden_size) # input gate
        self.W_ot = torch.nn.Linear(input_size + hidden_size, hidden_size) # output gate
        self.W_ct = torch.nn.Linear(input_size + hidden_size, hidden_size) # cell state
        self.W_ht = torch.nn.Linear(hidden_size, hidden_size) # hidden state
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.sigmoid = torch.nn.Sigmoid()
        self.tanh = torch.nn.Tanh()
    
    def forward(self, input, hidden, cell, src_batch_sizes):
        h_t = hidden.squeeze(0)
        c_t = cell.squeeze(0)
        y_list = list()
        batch_size = int(src_batch_sizes[0])

        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :]
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)

            f_t = self.sigmoid(self.W_ft(combined_input))
            i_t = self.sigmoid(self.W_it(combined_input))
            o_t = self.sigmoid(self.W_ot(combined_input))
            c_hat_t = self.tanh(self.W_ct(combined_input))

            c_t[:batch, :] = f_t*c_t[:batch, :] + i_t*c_hat_t
            h_t[:batch, :] = o_t*self.tanh(c_t_next)

        return c_t, h_t


class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)

        # nn.LSTM
        # self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim)

        # my LSTM
        self.lstm = LSTMNode(input_size=embedding_dim, hidden_size=hidden_dim)

        self.fc = nn.Linear(hidden_dim, 2, bias=True)

    def forward(self, input_tensor, src_seq_lens, hidden, cell):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.LSTM
        # packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        # outs, hidden = self.lstm(packed, hidden) # h0 = zero initialization
        # outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my LSTM
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        src_batch_sizes = seqs2batches(src_seq_lens)
        outs, hidden = self.lstm(emb, hidden, cell, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden

        logits = self.fc(hidden.transpose(0, 1).view(hidden.shape[1], -1))
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 128
rnnmodel = LSTMModel(embedding_dim, hidden_dim).to(device)

cel = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=0.1)

epochs = 150

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_batch = train_tensor_src[sorted_indices]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(1, train_sorted_batch.shape[0], hidden_dim, requires_grad=True).to(device)
        logits = rnnmodel(train_sorted_batch, sorted_seq_lens, h0)

        # print(train_tensor_src.shape, train_tensor_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_tensor_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_tensor_tgt).sum().float()

        train_data_num += train_tensor_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 10 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_batch = dev_tensor_src[sorted_indices]

                h0 = torch.zeros(1, dev_sorted_batch.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_batch, sorted_seq_lens, h0)

                loss = cel(logits, dev_tensor_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_tensor_tgt).sum().float()

                valid_data_num += dev_tensor_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        


NameError: ignored

Problem 4.2 (10 points) Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data and briefly describe how it differs from 4.1.

Problem 4.3 (bonus) (10 points) Consider implementing bidirectional LSTM and two layers of LSTM to further improve your model. Report your accuracy on dev data.

## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST-2 training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

**Problem 5.1** *(10 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to further improve your model from 4.2. Report the model's accuracy on the dev data.

Problem 5.2 (bonus) (10 points) You can go one step further by using word vectors obtained from pretrained language models. Can you import the word embeddings from bert-base-uncased model (via Hugging Face's transformers: https://huggingface.co/transformers/pretrained_models.html) into your model and improve it further? Report the accuracy on the dev data here. If the score is now higher, explain where the improvement is coming from.