<a href="https://colab.research.google.com/github/hjori66/Kaist-AI605-2021-Spring/blob/main/Kaist_AI605_Assignment_1_20194364_Taehwan_Kim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KAIST AI605 Assignment 1: Text Classification with RNNs
Authors: Hyeong-Gwon Hong (honggudrnjs@kaist.ac.kr) and Minjoon Seo (minjoon@kaist.ac.kr)

**Due Date:** March 31 (Wed) 11:00pm, 2021

## Assignment Objectives
- Verify theoretically and empirically why gating mechanism (LSTM, GRU) helps in Recurrent Neural Networks (RNNs)
- Design an LSTM-based text classification model from scratch using PyTorch.
- Apply the classification model to a popular classification task, Stanford Sentiment Treebank v2 (SST-2).
- Achieve higher accuracy by applying common machine learning strategies, including Dropout.
- Utilize pretrained word embedding (e.g. GloVe) to leverage self-supervision over a large text corpus.
- (Bonus) Use Hugging Face library (`transformers`) to leverage self-supervision via large language models.

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submtit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in yoru assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are four bonus questions with 10 points each (two bonus questions added on Mar 19). Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [1]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Limitations of Vanilla RNNs
In Lecture 04 and 05, we saw how RNNs suffer from exploding or vanishing gradients. We mathematically showed that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

**Problem 1.1** *(10 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

**Answer 1.1** \\

The definition of the gradient clipping is followed.

$$ \begin{aligned}
\frac{\partial \textbf{h}}{\partial \theta} \leftarrow &\left\{\begin{array}{ll}
\frac{\text { threshold }}{\|\hat{g}\|} \hat{g} & \text { if }\|\hat{g}\| \geq \text { threshold }
\\
\hat{g} & \text { otherwise }
\end{array}\right
.\\
& \text { where } \hat{g}=\frac{\partial \textbf{h}}{\partial \theta}
\end{aligned} $$

If the gradient exploded, weights of the model can be NaN value (either overflow or underflow). So, rescaling the error derivative before propagating it backward through the network to update the weights can be one of the solution. If we do that, we can decrease the likelihood of an over or underflow.


**Problem 1.2** *(10 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 04 and 05 slides for the definition of LSTM.

**Answer 1.2** \\

The formulation of LSTM is followed.

$$
\begin{aligned}
f_{t} &=\sigma_{g}\left(W_{f} x_{t}+U_{f} h_{t-1}+b_{f}\right) \\
i_{t} &=\sigma_{g}\left(W_{i} x_{t}+U_{i} h_{t-1}+b_{i}\right) \\
o_{t} &=\sigma_{g}\left(W_{o} x_{t}+U_{o} h_{t-1}+b_{o}\right) \\
\tilde{c}_{t} &=\sigma_{c}\left(W_{c} x_{t}+U_{c} h_{t-1}+b_{c}\right) \\
c_{t} &=f_{t} \circ c_{t-1}+i_{t} \circ \tilde{c}_{t} \\
h_{t} &=o_{t} \circ \sigma_{h}\left(c_{t}\right)
\end{aligned}
$$

Then,

$$
\frac{\partial c_{T}}{\partial c_{t}}=
\frac{\partial c_{T}}{\partial c_{T-1}} * 
\frac{\partial c_{T-1}}{\partial c_{T-2}} * 
\ldots * 
\frac{\partial c_{t+1}}{\partial c_{t}}
$$
and
$$
\frac{\partial c_{T}}{\partial c_{t}}=\prod_{i=t+1}^{T} f_{i}
$$

Now, $f_{i}$ is sigmoid function. so, it is larger than 0 and "smaller than 1". If $f_{i}$ was closed to 1, cell state ($c_{i}$) considers the long term memory. Otherwise, it doesn't consider the long term memory. This formulation mitigates the vanishing gradient. 

Similary, 
$$
\frac{\partial c_{T}}{\partial \tilde{c}_{t}}=\prod_{j=t+1}^{T} i_{j} \text { and } \frac{\partial h_{T}}{\partial h_{t}}
$$
mitigates the vanishing, too.

## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank v2, a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST-2 via GLUE
General Language Understanding Evaluation (GLUE) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing natural language understanding (NLU) tasks. See GLUE website (https://gluebenchmark.com/) and the GLUE paper (https://openreview.net/pdf?id=rJ4km2R5t7) for more details. GLUE provides an easy way to access the datasets, including SST-2.
You can download SST-2 dataset by following the steps below:

1. Clone GitHub repository:

In [2]:
!git clone https://github.com/nyu-mll/GLUE-baselines.git

Cloning into 'GLUE-baselines'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 891 (delta 1), reused 2 (delta 0), pack-reused 886[K
Receiving objects: 100% (891/891), 1.48 MiB | 4.76 MiB/s, done.
Resolving deltas: 100% (610/610), done.


2. Download SST-2 only:

In [3]:
%cd GLUE-baselines/
!python download_glue_data.py --data_dir glue_data --tasks SST

/content/GLUE-baselines
Downloading and extracting SST...
	Completed!


Your training, dev, and test data can be found at `glue_data/SST-2`. Note that each file is in a tsv format, where the first column is the sentence and the second column is the label (either 0 or 1, where 1 means positive sentiment). 

In [None]:
!head -10 glue_data/SST-2/train.tsv

sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0
on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 	0
that 's far too tragic to merit such superficial treatment 	0
demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 	1
of saucy 	1
a depressed fifteen-year-old 's suicidal poetry 	0


**Problem 2.1** *(10 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

In [None]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [4]:
# Constructing vocabulary with `UNK`
vocab = ['UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

NameError: ignored

**Answer 2.1** 

In [None]:
# Constructing vocabulary using space tokenizer and 'UNK'
import pandas as pd

def make_vocab(fname):
    df = pd.read_csv(fname, sep='\t')
    tokens = list()
    token_occur = dict()

    for i, row in df.iterrows():
        for token in row['sentence'].strip().split(' '):
            if token in token_occur:
                token_occur[token] += 1
            else:
                token_occur[token] = 1
                tokens.append(token)

    vocab = ['UNK'] + tokens
    return vocab

vocab = make_vocab('glue_data/SST-2/train.tsv')
print("The size of the vocabulary is ", len(vocab))

The size of the vocabulary is  14817


**Problem 2.2** *(10 points)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

**Answer 2.2** 

In [5]:
# Constructing reduced vocabulary (occur at least twice)
import pandas as pd

def make_reduced_vocab(fname, min_occur=0):
    df = pd.read_csv(fname, sep='\t')
    # tokens = list()
    token_occur = dict()

    for i, row in df.iterrows():
        for token in row['sentence'].strip().split(' '):
            if token in token_occur:
                token_occur[token] += 1
            else:
                token_occur[token] = 1
                # tokens.append(token)

    vocab = ['UNK', 'PAD']
    for token, occur in token_occur.items():
        if occur >= min_occur:
            vocab.append(token)
    word2id = {word: id_ for id_, word in enumerate(vocab)}
    return vocab, word2id

train_fname = 'glue_data/SST-2/train.tsv'
dev_fname = 'glue_data/SST-2/dev.tsv'
vocab, word2id = make_reduced_vocab(train_fname)
reduced_vocab, word2id = make_reduced_vocab(train_fname, min_occur=2)
print(len(vocab))
print(len(reduced_vocab))
print("The size of the vocabulary change is ", len(vocab) - len(reduced_vocab))

14818
14311
The size of the vocabulary change is  507


## 3. Text Classification Baselines

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to go through one layer of neural network and then average the outputs, and finally classify the average embedding: 

In [6]:
from torch import nn

input_ = "hi world!"
input_tokens = input_.split(' ')
input_ids = [word2id[word] if word in word2id else 0 for word in input_tokens]
input_tensor = torch.LongTensor([input_ids]) # the first dimension is minibatch size
print(input_tensor)

tensor([[0, 0]])


In [7]:
# One layer, average pooling and classification
class Baseline(nn.Module):
  def __init__(self, d):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor)
    out = self.relu(self.layer(emb))
    avg = out.mean(1)
    logits = self.class_layer(avg)
    return logits

d = 3 # usually bigger, e.g. 128
baseline = Baseline(d)
logits = baseline(input_tensor)
softmax = nn.Softmax(1)
print(softmax(logits)) # probability for each class

tensor([[0.4660, 0.5340]], grad_fn=<SoftmaxBackward>)


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [None]:
cel = nn.CrossEntropyLoss()
label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
loss = cel(logits, label) # Loss, a.k.a L
print(loss)

tensor(0.7606, grad_fn=<NllLossBackward>)


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [None]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad() # reset process
loss.backward() # compute gradients
optimizer.step() # update parameters

RuntimeError: ignored

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [None]:
print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])


**Problem 3.1** *(10 points)* Properly train this average-pooling baseline model on SST-2 and report the model's accuracy on the dev data.

**Answer 3.1** 


In [8]:
# Constructing the dataset
import random

vocab, word2id = make_reduced_vocab(train_fname, min_occur=2)
vocab_size = len(vocab)
batch_size = 128
shuffle = False

UNK_TOKEN = 0

def unk_preprocessing(word, word2id_):
    # one_hot = torch.zeros(len(word2id_))
    if word in word2id_:
        index = word2id_[word]
    else:
        index = UNK_TOKEN
    # one_hot[index]=1.
    return index
    # return one_hot

def seq2id(seq, word2id_):
    sentence = seq.strip().split(' ')
    source = [unk_preprocessing(word, word2id_) for word in sentence]
    return source

train_df = pd.read_csv(train_fname, sep='\t')
train_src = [seq2id(seq, word2id) for seq in train_df['sentence'].tolist()]
train_tgt = train_df['label'].tolist()

dev_df = pd.read_csv(dev_fname, sep='\t')
dev_src = [seq2id(seq, word2id) for seq in dev_df['sentence'].tolist()]
dev_tgt = dev_df['label'].tolist()


# for sentence in dev_src:
#     print(sentence.count(0))

class DataLoader:
    def __init__(self, src, tgt, batch_size, pad_idx, shuffle=False):
        assert len(src) == len(tgt)
        self.src = src
        self.tgt = tgt
        self.size = len(src)
        self.batch_size = batch_size
        self.pad_idx = pad_idx
        self.shuffle = shuffle

    def __iter__(self):
        self.index = 0
        if self.shuffle:
            index = list(range(self.size))
            random.shuffle(index)

            shuffle_src = list()
            shuffle_tgt = list()

            for i in index:
                shuffle_src.append(self.src[i])
                shuffle_tgt.append(self.tgt[i])

            self.src = shuffle_src
            self.tgt = shuffle_tgt

        return self

    def pad(self, batch):
        max_len = 0
        for seq in batch:
            if max_len < len(seq):
                max_len = len(seq)

        for i, seq in enumerate(batch):
            if max_len > len(seq):
                batch[i] += [self.pad_idx] * (max_len - len(seq))
        seq_lens = torch.LongTensor([(batch[i] + [self.pad_idx]).index(self.pad_idx) for i in range(len(batch))])

        return batch, seq_lens

    def __next__(self):
        if self.batch_size * self.index >= self.size:
            raise StopIteration

        src_batch = self.src[self.batch_size * self.index : self.batch_size * (self.index+1)]
        tgt_batch = self.tgt[self.batch_size * self.index : self.batch_size * (self.index+1)]

        padded_src_batch, src_seq_lens = self.pad(src_batch)

        self.index += 1

        return padded_src_batch, src_seq_lens, tgt_batch

train_data_loader = DataLoader(train_src, train_tgt, batch_size=batch_size, pad_idx=1, shuffle=shuffle)
dev_data_loader = DataLoader(dev_src, dev_tgt, batch_size=batch_size, pad_idx=1, shuffle=shuffle)

print(len(train_src))
print(len(dev_src))

67349
872


In [9]:
# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

d = 128 # usually bigger, e.g. 128
baseline = Baseline(d).to(device)

cel = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)

epochs = 50

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)
        logits = baseline(train_tensor_src)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_tensor_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_tensor_tgt).sum().float()

        train_data_num += train_tensor_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
                logits = baseline(dev_tensor_src)
                loss = cel(logits, dev_tensor_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_tensor_tgt).sum().float()

                valid_data_num += dev_tensor_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))


train:: Epoch: 0001 cost = 0.005359, acc = 0.558910
valid:: Epoch: 0001 cost = 0.005538, acc = 0.527523
train:: Epoch: 0002 cost = 0.005306, acc = 0.573743
valid:: Epoch: 0002 cost = 0.005425, acc = 0.567661


KeyboardInterrupt: ignored

In [None]:
I used d = 128, batch_size = 128, epoch = 100, for the baseline. The result is followed.

After 30 Iteration, It converges. (Validation Loss begins to increase)
train:: Epoch: 0030 cost = 0.003190 ,acc = 0.823991
valid:: Epoch: 0030 cost = 0.003767 ,acc = 0.770642

**Problem 3.2** *(10 points)* Implement a recurrent neural network (without using PyTorch's RNN module) where the output of the linear layer not only depends on the current input but also the previous output. Report the model's accuracy on the dev data. Is it better or worse than the baseline? Why?

**Answer 3.2** 

In [11]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# One RNN layer and classification
class RNNNode(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(RNNNode, self).__init__()

        self.W_h2h = torch.nn.Linear(input_size + hidden_size, hidden_size)
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.tanh = torch.nn.Tanh()
        self.softmax = torch.nn.Softmax()
    
    def forward(self, input, hidden, src_batch_sizes):
        h_t = hidden.squeeze(0)
        h_last = h_t
        y_list = list()
        batch_size = int(src_batch_sizes[0])
        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :]
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)
            h_last[:batch, :] = h_t = self.tanh(self.W_h2h(combined_input))
            y_t = self.W_h2y(h_t)
            y_list.append(nn.ZeroPad2d((0, 0, batch_size - batch, 0))(y_t)) # Zero-padding
        y = torch.stack(y_list, dim=0) # 
        return y, h_last.unsqueeze(0)


class RNNModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim, padding_idx=1)

        # nn.RNN
        # self.rnn = nn.RNN(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers)

        # my RNN
        self.rnn = RNNNode(input_size=embedding_dim, hidden_size=hidden_dim)

        self.fc = nn.Linear(hidden_dim * n_layers, 2)
        self.softmax = torch.nn.Softmax()

    def forward(self, input_tensor, src_seq_lens, hidden):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.RNN
        # packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        # outs, hidden = self.rnn(packed, hidden) # h0 = zero initialization
        # outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my RNN
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        src_batch_sizes = seqs2batches(src_seq_lens)
        outs, hidden = self.rnn(emb, hidden, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden

        hidden = hidden.transpose(0, 1)
        hidden = hidden.contiguous().view(hidden.shape[0], -1)
        logits = self.fc(hidden)
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 256
n_layers = 1
rnnmodel = RNNModel(embedding_dim, hidden_dim, n_layers).to(device)

cel = nn.CrossEntropyLoss()
cel = cel.to(device)
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=2e-4)

epochs = 50
vocab_size = len(vocab)

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_src = train_tensor_src[sorted_indices]
        train_sorted_tgt = train_tensor_tgt[sorted_indices]
        b_size = train_sorted_src.shape[0]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(n_layers, b_size, hidden_dim, requires_grad=True).to(device)
        # logits = rnnmodel(train_tensor_src, src_seq_lens, h0)
        logits = rnnmodel(train_sorted_src, sorted_seq_lens, h0)

        # print(train_tensor_src.shape, train_sorted_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_sorted_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_sorted_tgt).sum().float()

        train_data_num += train_sorted_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_src = dev_tensor_src[sorted_indices]
                dev_sorted_tgt = dev_tensor_tgt[sorted_indices]

                h0 = torch.zeros(n_layers, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_src, sorted_seq_lens, h0)

                loss = cel(logits, dev_sorted_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_sorted_tgt).sum().float()

                valid_data_num += dev_sorted_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        

train:: Epoch: 0001 cost = 0.005026, acc = 0.623142
valid:: Epoch: 0001 cost = 0.005065, acc = 0.642202


KeyboardInterrupt: ignored

In [None]:
It is slightly better than the baseline. (Either cases, nn.RNN or my code)
RNN can read the context between words. (time-series data)


After 10 Iter, It converges. (Validation Loss begins to increase at 6th iter)
...
train:: Epoch: 0001 cost = 0.005095, acc = 0.610655
valid:: Epoch: 0001 cost = 0.005056, acc = 0.646789
train:: Epoch: 0002 cost = 0.004459, acc = 0.700203
valid:: Epoch: 0002 cost = 0.004732, acc = 0.685780
train:: Epoch: 0003 cost = 0.003854, acc = 0.760976
valid:: Epoch: 0003 cost = 0.004496, acc = 0.735092
train:: Epoch: 0004 cost = 0.003335, acc = 0.804229
valid:: Epoch: 0004 cost = 0.004379, acc = 0.751147
train:: Epoch: 0005 cost = 0.002910, acc = 0.837340
valid:: Epoch: 0005 cost = 0.004287, acc = 0.769495
train:: Epoch: 0006 cost = 0.002569, acc = 0.860696
valid:: Epoch: 0006 cost = 0.004296, acc = 0.770642 -> Minimal Val Loss
train:: Epoch: 0007 cost = 0.002291, acc = 0.878899
valid:: Epoch: 0007 cost = 0.004403, acc = 0.780963
train:: Epoch: 0008 cost = 0.002055, acc = 0.893644
valid:: Epoch: 0008 cost = 0.004595, acc = 0.784404
train:: Epoch: 0009 cost = 0.001855, acc = 0.905670
valid:: Epoch: 0009 cost = 0.004847, acc = 0.791284
train:: Epoch: 0010 cost = 0.001684, acc = 0.915366
valid:: Epoch: 0010 cost = 0.005108, acc = 0.791284 -> Maximal Val Acc


**Problem 3.3 (bonus)** *(10 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.



**Answer 3.3** \\
 
We assume that the probability distribution follows bernoulli's distribution. 

$$
P\left(Y=y_{i}\right)=p^{y_{i}}(1-p)^{1-y_{i}} \quad\left(y_{i}=0,1\right)
$$


Therefore, the likelihood function is product of pdfs.

$$
L=\prod_{i} p^{y_{i}}(1-p)^{1-y_{i}}
$$

Then, "negative" "log" likelihood function is 
$$
l=-\log L=-\log [\prod_{i} p^{y_{i}}(1-p)^{1-y_{i}}]
\\
= -\sum_{i} \log [p^{y_{i}}(1-p)^{1-y_{i}}]
\\
= -\sum_{i} [{y_{i}}\log p + ({1-y_{i}})\log(1-p)]
$$

Actually, this is the binary cross-entropy. \\

The definition of cross-entropy is

$$
\sum_{x} g(x) \log \frac{1}{f(x)}=-\sum_{x} g(x) \log f(x)
$$

So, in the binary classification problem, negative log likelihood is equvalent to (binary) cross-entropy.

**Problem 3.4 (bonus)** *(10 points)* Why is it numerically unstable if you compute log on top of softmax?

**Answer 3.4**  \\

Softmax function is numerically unstable when too large numbers(or too small number, << 1) are approximated as infinity (or zero).

There is an example code.

In [None]:
import numpy as np

def softmax(x):
    prob = np.exp(x) / np.sum(np.exp(x))
    return prob

outs = np.array([10.0, 20.0, 30.0, 9000.0])
print(softmax(outs))

[ 0.  0.  0. nan]


  after removing the cwd from sys.path.
  after removing the cwd from sys.path.


The result is [ 0.  0.  0. nan].
To avoid this, we need to use other softmax function.

There is an example code.

In [None]:
import numpy as np

def softmax(x):
    # prob = np.exp(x) / np.sum(np.exp(x))
    z = x - np.max(x) # It makes the softmax function is numerically stable.
    prob = np.exp(z) / np.sum(np.exp(z)) 
    return prob

outs = np.array([10.0, 20.0, 30.0, 9000.0])
print(softmax(outs))

[0. 0. 0. 1.]


## 4. Text Classification with LSTM and Dropout

Now it is time to improve your baselines! Replace your RNN module with an LSTM module. See Lecture slides 04 and 05 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.2000, 0.0000, 0.0000, 0.0000, 1.8000])


Problem 4.1 (20 points) Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN to improve your model. Report the accuracy on the dev data.

**Answer 4.1** 

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# One LSTM layer and classification
class LSTMNode(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMNode, self).__init__()

        self.W_ft = torch.nn.Linear(input_size + hidden_size, hidden_size) # forget gate
        self.W_it = torch.nn.Linear(input_size + hidden_size, hidden_size) # input gate
        self.W_ot = torch.nn.Linear(input_size + hidden_size, hidden_size) # output gate
        self.W_ct = torch.nn.Linear(input_size + hidden_size, hidden_size) # cell state
        self.W_ht = torch.nn.Linear(hidden_size, hidden_size) # hidden state
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.sigmoid = torch.nn.Sigmoid()
        self.tanh = torch.nn.Tanh()
    
    def forward(self, input, hidden, cell, src_batch_sizes):
        h_t = hidden.squeeze(0)
        c_t = cell.squeeze(0)
        h_last = h_t
        c_last = c_t
        y_list = list()
        batch_size = int(src_batch_sizes[0])

        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :].clone()
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)

            f_t = self.sigmoid(self.W_ft(combined_input))
            i_t = self.sigmoid(self.W_it(combined_input))
            o_t = self.sigmoid(self.W_ot(combined_input))
            c_hat_t = self.tanh(self.W_ct(combined_input))
            
            c_last[:batch, :] = c_t = f_t*c_t[:batch, :].clone() + i_t*c_hat_t
            h_last[:batch, :] = h_t = o_t*self.tanh(c_t)

            y_t = self.W_h2y(h_t)
            y_list.append(nn.ZeroPad2d((0, 0, batch_size - batch, 0))(y_t)) # Zero-padding
        y = torch.stack(y_list, dim=0)

        return y, c_last.unsqueeze(0), h_last.unsqueeze(0)


class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, dropout):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)

        # nn.LSTM
        # self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers)

        # my LSTM
        self.lstm = LSTMNode(input_size=embedding_dim, hidden_size=hidden_dim)

        self.fc = nn.Linear(hidden_dim * n_layers, 2, bias=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_tensor, src_seq_lens, hidden, cell):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        # emb = self.dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.LSTM
        # packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        # outs, (hidden, cell) = self.lstm(packed) # h0 = zero initialization
        # outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my LSTM
        # print(hidden.shape, cell.shape)
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        src_batch_sizes = seqs2batches(src_seq_lens)
        outs, cell, hidden = self.lstm(emb, hidden, cell, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, cell.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden, 1 * batch * hidden

        hidden = hidden.transpose(0, 1)
        hidden = hidden.contiguous().view(hidden.shape[0], -1)

        # hidden = self.dropout(hidden)
        logits = self.fc(hidden)
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 256
n_layers = 1
dropout = 0.5
rnnmodel = LSTMModel(embedding_dim, hidden_dim, n_layers, dropout).to(device)

cel = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=1e-4)

epochs = 150

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_src = train_tensor_src[sorted_indices]
        train_sorted_tgt = train_tensor_tgt[sorted_indices]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        c0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        logits = rnnmodel(train_sorted_src, sorted_seq_lens, h0, c0)

        # print(train_tensor_src.shape, train_tensor_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_sorted_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_sorted_tgt).sum().float()

        train_data_num += train_sorted_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_src = dev_tensor_src[sorted_indices]
                dev_sorted_tgt = dev_tensor_tgt[sorted_indices]

                h0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                c0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_src, sorted_seq_lens, h0, c0)

                loss = cel(logits, dev_sorted_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_sorted_tgt).sum().float()

                valid_data_num += dev_sorted_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        


In [None]:
It is slightly better than the RNN. (Either cases, nn.LSTM or my code)
LSTM also read the context between words. (time-series data)

nn.LSTM is slightly better than my code.
But, both are overfit too fast. we need the regularization (ex. Dropout)

# my LSTM
After 2 Iter, It converges. (Validation Loss begins to increase at 2th iter)
train:: Epoch: 0001 cost = 0.005046, acc = 0.620276
valid:: Epoch: 0001 cost = 0.004601, acc = 0.706422
train:: Epoch: 0002 cost = 0.004217, acc = 0.728266
valid:: Epoch: 0002 cost = 0.004331, acc = 0.731651 -> Minimal Val Loss
train:: Epoch: 0003 cost = 0.003472, acc = 0.792142
valid:: Epoch: 0003 cost = 0.004532, acc = 0.754587
train:: Epoch: 0004 cost = 0.002940, acc = 0.831430
valid:: Epoch: 0004 cost = 0.004979, acc = 0.756881
train:: Epoch: 0005 cost = 0.002581, acc = 0.856256
valid:: Epoch: 0005 cost = 0.004944, acc = 0.769495
train:: Epoch: 0006 cost = 0.002306, acc = 0.873985
valid:: Epoch: 0006 cost = 0.005360, acc = 0.776376
train:: Epoch: 0007 cost = 0.002091, acc = 0.887066
valid:: Epoch: 0007 cost = 0.005567, acc = 0.780963
train:: Epoch: 0008 cost = 0.001913, acc = 0.899152
valid:: Epoch: 0008 cost = 0.005734, acc = 0.775229
train:: Epoch: 0009 cost = 0.001777, acc = 0.906888
valid:: Epoch: 0009 cost = 0.005808, acc = 0.769495
train:: Epoch: 0010 cost = 0.001653, acc = 0.915158
valid:: Epoch: 0010 cost = 0.006519, acc = 0.774083
train:: Epoch: 0011 cost = 0.001562, acc = 0.920637
valid:: Epoch: 0011 cost = 0.006924, acc = 0.783257
train:: Epoch: 0012 cost = 0.001464, acc = 0.926205
valid:: Epoch: 0012 cost = 0.007025, acc = 0.802752 -> Maximal Val Acc

# nn.LSTM
After 3 Iter, It converges. (Validation Loss begins to increase at 3th iter)
train:: Epoch: 0001 cost = 0.005045, acc = 0.621464
valid:: Epoch: 0001 cost = 0.004841, acc = 0.677752
train:: Epoch: 0002 cost = 0.004179, acc = 0.731533
valid:: Epoch: 0002 cost = 0.004319, acc = 0.721330
train:: Epoch: 0003 cost = 0.003417, acc = 0.796849
valid:: Epoch: 0003 cost = 0.004258, acc = 0.753440 -> Minimal Val Loss
train:: Epoch: 0004 cost = 0.002875, acc = 0.835454
valid:: Epoch: 0004 cost = 0.004336, acc = 0.764908
train:: Epoch: 0005 cost = 0.002508, acc = 0.860354
valid:: Epoch: 0005 cost = 0.004433, acc = 0.780963
train:: Epoch: 0006 cost = 0.002240, acc = 0.878320
valid:: Epoch: 0006 cost = 0.004692, acc = 0.790138
train:: Epoch: 0007 cost = 0.002028, acc = 0.892070
valid:: Epoch: 0007 cost = 0.004736, acc = 0.792431
train:: Epoch: 0008 cost = 0.001864, acc = 0.901840
valid:: Epoch: 0008 cost = 0.004864, acc = 0.797018
train:: Epoch: 0009 cost = 0.001728, acc = 0.909947
valid:: Epoch: 0009 cost = 0.005114, acc = 0.798165
train:: Epoch: 0010 cost = 0.001621, acc = 0.916020
valid:: Epoch: 0010 cost = 0.005273, acc = 0.801605
train:: Epoch: 0011 cost = 0.001515, acc = 0.922167
valid:: Epoch: 0011 cost = 0.005428, acc = 0.801605
train:: Epoch: 0012 cost = 0.001436, acc = 0.927230
valid:: Epoch: 0012 cost = 0.005878, acc = 0.803899 -> Maximal Val Acc

Problem 4.2 (10 points) Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data and briefly describe how it differs from 4.1.

**Answer 4.2** 

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# One LSTM layer and classification
class LSTMNode(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMNode, self).__init__()

        self.W_ft = torch.nn.Linear(input_size + hidden_size, hidden_size) # forget gate
        self.W_it = torch.nn.Linear(input_size + hidden_size, hidden_size) # input gate
        self.W_ot = torch.nn.Linear(input_size + hidden_size, hidden_size) # output gate
        self.W_ct = torch.nn.Linear(input_size + hidden_size, hidden_size) # cell state
        self.W_ht = torch.nn.Linear(hidden_size, hidden_size) # hidden state
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.sigmoid = torch.nn.Sigmoid()
        self.tanh = torch.nn.Tanh()
    
    def forward(self, input, hidden, cell, src_batch_sizes):
        h_t = hidden.squeeze(0)
        c_t = cell.squeeze(0)
        h_last = h_t
        c_last = c_t
        y_list = list()
        batch_size = int(src_batch_sizes[0])

        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :].clone()
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)

            f_t = self.sigmoid(self.W_ft(combined_input))
            i_t = self.sigmoid(self.W_it(combined_input))
            o_t = self.sigmoid(self.W_ot(combined_input))
            c_hat_t = self.tanh(self.W_ct(combined_input))
            
            c_last[:batch, :] = c_t = f_t*c_t[:batch, :].clone() + i_t*c_hat_t
            h_last[:batch, :] = h_t = o_t*self.tanh(c_t)

            y_t = self.W_h2y(h_t)
            y_list.append(nn.ZeroPad2d((0, 0, batch_size - batch, 0))(y_t)) # Zero-padding
        y = torch.stack(y_list, dim=0)

        return y, c_last.unsqueeze(0), h_last.unsqueeze(0)


class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, dropout):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)

        # nn.LSTM
        # self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers, dropout=dropout)

        # my LSTM
        self.lstm = LSTMNode(input_size=embedding_dim, hidden_size=hidden_dim)

        self.fc = nn.Linear(hidden_dim * n_layers, 2, bias=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_tensor, src_seq_lens, hidden, cell):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        emb = self.dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.LSTM
        # packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        # outs, (hidden, cell) = self.lstm(packed) # h0 = zero initialization
        # outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my LSTM
        # print(hidden.shape, cell.shape)
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        src_batch_sizes = seqs2batches(src_seq_lens)
        outs, cell, hidden = self.lstm(emb, hidden, cell, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, cell.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden, 1 * batch * hidden

        hidden = hidden.transpose(0, 1)
        hidden = hidden.contiguous().view(hidden.shape[0], -1)

        hidden = self.dropout(hidden)
        logits = self.fc(hidden)
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 256
n_layers = 1
dropout = 0.5
rnnmodel = LSTMModel(embedding_dim, hidden_dim, n_layers, dropout).to(device)

cel = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=1e-3)

epochs = 50

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_src = train_tensor_src[sorted_indices]
        train_sorted_tgt = train_tensor_tgt[sorted_indices]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        c0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        logits = rnnmodel(train_sorted_src, sorted_seq_lens, h0, c0)

        # print(train_tensor_src.shape, train_tensor_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_sorted_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_sorted_tgt).sum().float()

        train_data_num += train_sorted_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_src = dev_tensor_src[sorted_indices]
                dev_sorted_tgt = dev_tensor_tgt[sorted_indices]

                h0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                c0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_src, sorted_seq_lens, h0, c0)

                loss = cel(logits, dev_sorted_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_sorted_tgt).sum().float()

                valid_data_num += dev_sorted_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        


RuntimeError: ignored

In [None]:
It is better than the vanilla LSTM. (Accuracy)
and it takes more time for training.
Dropout works as regularizer, so the model does not overfit too fast.
(The validation loss of this model is lower than others model.)

# my LSTM
After 5 Iter, It converges. (Validation Loss begins to increase at 5th iter)
train:: Epoch: 0001 cost = 0.004816, acc = 0.649646
valid:: Epoch: 0001 cost = 0.004635, acc = 0.717890
train:: Epoch: 0002 cost = 0.003708, acc = 0.770687
valid:: Epoch: 0002 cost = 0.004408, acc = 0.744266
train:: Epoch: 0003 cost = 0.003049, acc = 0.823724
valid:: Epoch: 0003 cost = 0.003992, acc = 0.776376
train:: Epoch: 0004 cost = 0.002638, acc = 0.851609
valid:: Epoch: 0004 cost = 0.003777, acc = 0.785550
train:: Epoch: 0005 cost = 0.002331, acc = 0.872916
valid:: Epoch: 0005 cost = 0.003772, acc = 0.800459 -> Minimal Val Loss
train:: Epoch: 0006 cost = 0.002082, acc = 0.889382
valid:: Epoch: 0006 cost = 0.003856, acc = 0.798165
train:: Epoch: 0007 cost = 0.001931, acc = 0.898053
valid:: Epoch: 0007 cost = 0.003924, acc = 0.807339
train:: Epoch: 0008 cost = 0.001785, acc = 0.906591
valid:: Epoch: 0008 cost = 0.003915, acc = 0.805046
train:: Epoch: 0009 cost = 0.001650, acc = 0.914861
valid:: Epoch: 0009 cost = 0.003891, acc = 0.810780
train:: Epoch: 0010 cost = 0.001551, acc = 0.919910
valid:: Epoch: 0010 cost = 0.003962, acc = 0.806193
train:: Epoch: 0011 cost = 0.001486, acc = 0.923830
valid:: Epoch: 0011 cost = 0.004208, acc = 0.825688
train:: Epoch: 0012 cost = 0.001389, acc = 0.930095
valid:: Epoch: 0012 cost = 0.004485, acc = 0.809633
train:: Epoch: 0013 cost = 0.001314, acc = 0.935218
valid:: Epoch: 0013 cost = 0.004200, acc = 0.814220
train:: Epoch: 0013 cost = 0.001314, acc = 0.935218
valid:: Epoch: 0013 cost = 0.004200, acc = 0.814220
train:: Epoch: 0014 cost = 0.001282, acc = 0.935946
valid:: Epoch: 0014 cost = 0.004002, acc = 0.830275
train:: Epoch: 0015 cost = 0.001205, acc = 0.939331
valid:: Epoch: 0015 cost = 0.004838, acc = 0.808486
train:: Epoch: 0016 cost = 0.001151, acc = 0.943785
valid:: Epoch: 0016 cost = 0.004312, acc = 0.823394
train:: Epoch: 0017 cost = 0.001083, acc = 0.946161
valid:: Epoch: 0017 cost = 0.004074, acc = 0.832569
train:: Epoch: 0018 cost = 0.001050, acc = 0.947557
valid:: Epoch: 0018 cost = 0.004598, acc = 0.826835
train:: Epoch: 0019 cost = 0.001021, acc = 0.949650
valid:: Epoch: 0019 cost = 0.003923, acc = 0.837156 -> Maximal Val Acc
train:: Epoch: 0020 cost = 0.000978, acc = 0.952264
valid:: Epoch: 0020 cost = 0.004235, acc = 0.825688



Problem 4.3 (bonus) (10 points) Consider implementing bidirectional LSTM and two layers of LSTM to further improve your model. Report your accuracy on dev data.

**Answer 4.3** 

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Two bidirectional LSTM layer and classification
class BidirectionalLSTMNode(nn.Module):
    def __init__(self, input_size, hidden_size, n_layers):
        super(LSTMNode, self).__init__()

        self.W_ft = torch.nn.Linear(input_size + hidden_size, hidden_size) # forget gate
        self.W_it = torch.nn.Linear(input_size + hidden_size, hidden_size) # input gate
        self.W_ot = torch.nn.Linear(input_size + hidden_size, hidden_size) # output gate
        self.W_ct = torch.nn.Linear(input_size + hidden_size, hidden_size) # cell state
        self.W_ht = torch.nn.Linear(hidden_size, hidden_size) # hidden state
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.sigmoid = torch.nn.Sigmoid()
        self.tanh = torch.nn.Tanh()
    
    def forward(self, input, hidden, cell, src_batch_sizes):
        h_t = hidden.squeeze(0)
        c_t = cell.squeeze(0)
        h_last = h_t
        c_last = c_t
        y_list = list()
        batch_size = int(src_batch_sizes[0])

        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :].clone()
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)

            f_t = self.sigmoid(self.W_ft(combined_input))
            i_t = self.sigmoid(self.W_it(combined_input))
            o_t = self.sigmoid(self.W_ot(combined_input))
            c_hat_t = self.tanh(self.W_ct(combined_input))
            
            c_last[:batch, :] = c_t = f_t*c_t[:batch, :].clone() + i_t*c_hat_t
            h_last[:batch, :] = h_t = o_t*self.tanh(c_t)

            y_t = self.W_h2y(h_t)
            y_list.append(nn.ZeroPad2d((0, 0, batch_size - batch, 0))(y_t)) # Zero-padding
        y = torch.stack(y_list, dim=0)

        return y, c_last.unsqueeze(0), h_last.unsqueeze(0)


class BidirectionalLSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, dropout, bidirectional):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)

        # nn.LSTM
        # self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers, dropout=dropout, bidirectional=bidirectional)

        # my LSTM
        self.lstm = LSTMNode(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers)

        self.fc = nn.Linear(hidden_dim * n_layers, 2, bias=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_tensor, src_seq_lens, hidden, cell):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        emb = self.dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.LSTM
        # packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        # outs, (hidden, cell) = self.lstm(packed) # h0 = zero initialization
        # outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my LSTM
        # print(hidden.shape, cell.shape)
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        src_batch_sizes = seqs2batches(src_seq_lens)
        outs, cell, hidden = self.lstm(emb, hidden, cell, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, cell.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden, 1 * batch * hidden

        hidden = hidden.transpose(0, 1)
        hidden = hidden.contiguous().view(hidden.shape[0], -1)

        hidden = self.dropout(hidden)
        logits = self.fc(hidden)
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 256
n_layers = 1
dropout = 0.5
bidirectional = True
rnnmodel = BidirectionalLSTMModel(embedding_dim, hidden_dim, n_layers, dropout, bidirectional).to(device)

cel = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=2e-4)

epochs = 50

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_src = train_tensor_src[sorted_indices]
        train_sorted_tgt = train_tensor_tgt[sorted_indices]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        c0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        logits = rnnmodel(train_sorted_src, sorted_seq_lens, h0, c0)

        # print(train_tensor_src.shape, train_tensor_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_sorted_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_sorted_tgt).sum().float()

        train_data_num += train_sorted_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_src = dev_tensor_src[sorted_indices]
                dev_sorted_tgt = dev_tensor_tgt[sorted_indices]

                h0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                c0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_src, sorted_seq_lens, h0, c0)

                loss = cel(logits, dev_sorted_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_sorted_tgt).sum().float()

                valid_data_num += dev_sorted_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        


## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST-2 training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

**Problem 5.1** *(10 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to further improve your model from 4.2. Report the model's accuracy on the dev data.

**Answer 5.1** 

In [None]:
!git clone https://github.com/stanfordnlp/glove

Cloning into 'glove'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 595 (delta 0), reused 1 (delta 0), pack-reused 592[K
Receiving objects: 100% (595/595), 222.33 KiB | 14.82 MiB/s, done.
Resolving deltas: 100% (338/338), done.


In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2021-04-02 11:57:00--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-04-02 11:57:00--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-04-02 11:57:01--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
import zipfile
path_to_zip_file = "glove.6B.zip"
zip_ref = zipfile.ZipFile(path_to_zip_file, 'r')
zip_ref.extractall()
zip_ref.close()

In [None]:
!head -10 glove.6B.200d.txt

the -0.071549 0.093459 0.023738 -0.090339 0.056123 0.32547 -0.39796 -0.092139 0.061181 -0.1895 0.13061 0.14349 0.011479 0.38158 0.5403 -0.14088 0.24315 0.23036 -0.55339 0.048154 0.45662 3.2338 0.020199 0.049019 -0.014132 0.076017 -0.11527 0.2006 -0.077657 0.24328 0.16368 -0.34118 -0.06607 0.10152 0.038232 -0.17668 -0.88153 -0.33895 -0.035481 -0.55095 -0.016899 -0.43982 0.039004 0.40447 -0.2588 0.64594 0.26641 0.28009 -0.024625 0.63302 -0.317 0.10271 0.30886 0.097792 -0.38227 0.086552 0.047075 0.23511 -0.32127 -0.28538 0.1667 -0.0049707 -0.62714 -0.24904 0.29713 0.14379 -0.12325 -0.058178 -0.001029 -0.082126 0.36935 -0.00058442 0.34286 0.28426 -0.068599 0.65747 -0.029087 0.16184 0.073672 -0.30343 0.095733 -0.5286 -0.22898 0.064079 0.015218 0.34921 -0.4396 -0.43983 0.77515 -0.87767 -0.087504 0.39598 0.62362 -0.26211 -0.30539 -0.022964 0.30567 0.06766 0.15383 -0.11211 -0.09154 0.082562 0.16897 -0.032952 -0.28775 -0.2232 -0.090426 1.2407 -0.18244 -0.0075219 -0.041388 -0.011083 0.078186 0.3

In [None]:
glove_vocab = ['UNK', 'PAD']
glove_embeddings_dict = dict()
vector_size = 200

glove_embeddings_dict[0] = torch.zeros(vector_size)
glove_embeddings_dict[1] = torch.zeros(vector_size)

word2id = {word: id_ for id_, word in enumerate(vocab)}
idx = 2

with open("glove.6B.100d.txt", 'r', encoding="utf-8") as f:
    line = f.readline()
    while line is not None:
        values = line.strip().split()
        if not values:
            break
        word = values[0]
        vector = np.asarray(values[1:], "float32")

        glove_vocab.append(word)
        glove_embeddings_dict[word] = torch.FloatTensor(vector)
        idx += 1
        line = f.readline()
if 'UNK' in glove_vocab:
    print("UNK")
if 'PAD' in glove_vocab:
    print("PAD")
print(len(glove_vocab))


UNK
PAD
400002


In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pickle

filename = '/content/drive/My Drive/Kaist_석사_2021_여름_AI605_Assignment/glove_vectors.pkl'
pickle.dump({'glove_embeddings_dict' : glove_embeddings_dict, 'glove_vocab' : glove_vocab} , open(filename, 'wb'))

In [14]:
import pickle

filename = '/content/drive/My Drive/Kaist_석사_2021_여름_AI605_Assignment/glove_vectors.pkl'

glove_data = dict()
with open(filename, 'rb') as f:
    glove_data = pickle.load(f)

glove_embeddings_dict = glove_data['glove_embeddings_dict']
glove_vocab = glove_data['glove_vocab']

In [15]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

filename = '/content/drive/My Drive/Kaist_석사_2021_여름_AI605_Assignment/glove_vectors.pkl'
train_fname = 'glue_data/SST-2/train.tsv'
dev_fname = 'glue_data/SST-2/dev.tsv'

vocab, word2id = make_reduced_vocab(train_fname)
glove_vocab_size = len(glove_vocab)

vector_size = 200

embedding_matrix = torch.zeros((len(vocab), vector_size))
for i, word in enumerate(vocab):
    temp = glove_embeddings_dict.get(word)
    if temp is not None:
        embedding_matrix[i] = temp
    elif i > 0:
        embedding_matrix[i] = torch.rand(vector_size)


# One LSTM layer and classification with glove.
class LSTMNode(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LSTMNode, self).__init__()

        self.W_ft = torch.nn.Linear(input_size + hidden_size, hidden_size) # forget gate
        self.W_it = torch.nn.Linear(input_size + hidden_size, hidden_size) # input gate
        self.W_ot = torch.nn.Linear(input_size + hidden_size, hidden_size) # output gate
        self.W_ct = torch.nn.Linear(input_size + hidden_size, hidden_size) # cell state
        self.W_ht = torch.nn.Linear(hidden_size, hidden_size) # hidden state
        self.W_h2y = torch.nn.Linear(hidden_size, hidden_size)
        self.sigmoid = torch.nn.Sigmoid()
        self.tanh = torch.nn.Tanh()
    
    def forward(self, input, hidden, cell, src_batch_sizes):
        h_t = hidden.squeeze(0)
        c_t = cell.squeeze(0)
        h_last = h_t
        c_last = c_t
        y_list = list()
        batch_size = int(src_batch_sizes[0])

        for i, batch in enumerate(src_batch_sizes):
            batch = int(batch)
            token = input[i][:batch, :].clone()
            combined_input = torch.cat([token, h_t[:batch, :]], dim=1)

            f_t = self.sigmoid(self.W_ft(combined_input))
            i_t = self.sigmoid(self.W_it(combined_input))
            o_t = self.sigmoid(self.W_ot(combined_input))
            c_hat_t = self.tanh(self.W_ct(combined_input))
            
            c_last[:batch, :] = c_t = f_t*c_t[:batch, :].clone() + i_t*c_hat_t
            h_last[:batch, :] = h_t = o_t*self.tanh(c_t)

            y_t = self.W_h2y(h_t)
            y_list.append(nn.ZeroPad2d((0, 0, batch_size - batch, 0))(y_t)) # Zero-padding
        y = torch.stack(y_list, dim=0)

        return y, c_last.unsqueeze(0), h_last.unsqueeze(0)


class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, dropout):
        super(LSTMModel, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix)
        
        # nn.LSTM
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers, dropout=dropout)

        # my LSTM
        # self.lstm = LSTMNode(input_size=embedding_dim, hidden_size=hidden_dim)

        self.fc = nn.Linear(hidden_dim * n_layers, 2, bias=True)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_tensor, src_seq_lens, hidden, cell):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden
        emb = self.dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # nn.LSTM
        packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        outs, (hidden, cell) = self.lstm(packed) # h0 = zero initialization
        outs, out_lens = pad_packed_sequence(outs, batch_first=False)

        # my LSTM
        # print(hidden.shape, cell.shape)
        # print("input shape : ", emb.shape, hidden.shape) # len * batch * emb, 1 * batch * hidden
        # src_batch_sizes = seqs2batches(src_seq_lens)
        # outs, cell, hidden = self.lstm(emb, hidden, cell, src_batch_sizes) # h0 = zero initialization
        # print("output shape : ", outs.shape, cell.shape, hidden.shape) # len * batch * hidden, 1 * batch * hidden, 1 * batch * hidden

        hidden = hidden.transpose(0, 1)
        hidden = hidden.contiguous().view(hidden.shape[0], -1)

        hidden = self.dropout(hidden)
        logits = self.fc(hidden)
        # logits = self.fc(outs[-1])
        return logits


def seqs2batches(src_seq_lens):
    """
    This is same with batches2seqs() 
    """
    assert src_seq_lens is not None
    assert src_seq_lens[-1] > 0
    src_batch_sizes = torch.zeros(int(src_seq_lens[0]))
    pointer = int(src_seq_lens[0]) - 1
    for i, seq_len in enumerate(src_seq_lens.tolist() + [0]):
        while seq_len <= pointer:
            src_batch_sizes[pointer] = i
            pointer -= 1
    return src_batch_sizes


# Training the baseline model on SST-2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embedding_dim = vector_size # usually bigger, e.g. 128
hidden_dim = 512
n_layers = 1
dropout = 0.5
rnnmodel = LSTMModel(embedding_dim, hidden_dim, n_layers, dropout).to(device)

cel = nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=2e-4)

epochs = 50

for epoch in range(epochs):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    for i, (source, src_seq_lens, target) in enumerate(train_data_loader):
        train_tensor_src = torch.LongTensor(source).to(device)
        train_tensor_tgt = torch.LongTensor(target).to(device)

        sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
        train_sorted_src = train_tensor_src[sorted_indices]
        train_sorted_tgt = train_tensor_tgt[sorted_indices]

        # print(src_seq_lens.shape, src_batch_sizes.shape, seqs2batches(src_batch_sizes).shape)
        h0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        c0 = torch.zeros(1, train_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
        logits = rnnmodel(train_sorted_src, sorted_seq_lens, h0, c0)

        # print(train_tensor_src.shape, train_tensor_tgt.shape, logits.shape)

        optimizer.zero_grad() # reset process
        loss = cel(logits, train_sorted_tgt) # Loss, a.k.a L
        loss.backward() # compute gradients
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), 5) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_preds = torch.max(logits, 1)
        train_accuracy += (train_preds == train_sorted_tgt).sum().float()

        train_data_num += train_sorted_tgt.shape[0]

    print('train:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(train_loss / train_data_num), 
                'acc =', '{:.6f}'.format(train_accuracy / train_data_num))
    
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            for i, (source, src_seq_lens, target) in enumerate(dev_data_loader):
                dev_tensor_src = torch.LongTensor(source).to(device)
                dev_tensor_tgt = torch.LongTensor(target).to(device)
            
                sorted_seq_lens, sorted_indices = torch.sort(src_seq_lens, descending=True)
                dev_sorted_src = dev_tensor_src[sorted_indices]
                dev_sorted_tgt = dev_tensor_tgt[sorted_indices]

                h0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                c0 = torch.zeros(1, dev_sorted_src.shape[0], hidden_dim, requires_grad=True).to(device)
                logits = rnnmodel(dev_sorted_src, sorted_seq_lens, h0, c0)

                loss = cel(logits, dev_sorted_tgt) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_preds = torch.max(logits, 1)
                valid_accuracy += (valid_preds == dev_sorted_tgt).sum().float()

                valid_data_num += dev_sorted_tgt.shape[0]
                
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                'acc =', '{:.6f}'.format(valid_accuracy / valid_data_num))
        


  "num_layers={}".format(dropout, num_layers))


train:: Epoch: 0001 cost = 0.005226, acc = 0.589496
valid:: Epoch: 0001 cost = 0.005301, acc = 0.606651
train:: Epoch: 0002 cost = 0.005099, acc = 0.612199
valid:: Epoch: 0002 cost = 0.005313, acc = 0.618119
train:: Epoch: 0003 cost = 0.005011, acc = 0.628131
valid:: Epoch: 0003 cost = 0.005222, acc = 0.639908
train:: Epoch: 0004 cost = 0.004844, acc = 0.650715
valid:: Epoch: 0004 cost = 0.005139, acc = 0.644495
train:: Epoch: 0005 cost = 0.004574, acc = 0.685741
valid:: Epoch: 0005 cost = 0.004743, acc = 0.700688
train:: Epoch: 0006 cost = 0.004331, acc = 0.709409
valid:: Epoch: 0006 cost = 0.004617, acc = 0.719037
train:: Epoch: 0007 cost = 0.004141, acc = 0.723648
valid:: Epoch: 0007 cost = 0.004763, acc = 0.730505
train:: Epoch: 0008 cost = 0.003944, acc = 0.744480
valid:: Epoch: 0008 cost = 0.004847, acc = 0.685780


KeyboardInterrupt: ignored

In [None]:
train:: Epoch: 0001 cost = 0.005256, acc = 0.581271
valid:: Epoch: 0001 cost = 0.005354, acc = 0.606651
train:: Epoch: 0002 cost = 0.005129, acc = 0.609259
valid:: Epoch: 0002 cost = 0.005264, acc = 0.595183
train:: Epoch: 0003 cost = 0.005034, acc = 0.625043
valid:: Epoch: 0003 cost = 0.005212, acc = 0.626147
train:: Epoch: 0004 cost = 0.004888, acc = 0.646958
valid:: Epoch: 0004 cost = 0.004962, acc = 0.662844
train:: Epoch: 0005 cost = 0.004645, acc = 0.672972
valid:: Epoch: 0005 cost = 0.004899, acc = 0.680046
train:: Epoch: 0006 cost = 0.004435, acc = 0.697457
valid:: Epoch: 0006 cost = 0.004754, acc = 0.682339
train:: Epoch: 0007 cost = 0.004286, acc = 0.715927
valid:: Epoch: 0007 cost = 0.004898, acc = 0.692661
train:: Epoch: 0008 cost = 0.004149, acc = 0.727672
valid:: Epoch: 0008 cost = 0.004482, acc = 0.730505
train:: Epoch: 0009 cost = 0.004053, acc = 0.735467
valid:: Epoch: 0009 cost = 0.004775, acc = 0.716743
train:: Epoch: 0010 cost = 0.003946, acc = 0.744599
valid:: Epoch: 0010 cost = 0.004714, acc = 0.712156
train:: Epoch: 0011 cost = 0.003872, acc = 0.751192
valid:: Epoch: 0011 cost = 0.004655, acc = 0.716743
train:: Epoch: 0012 cost = 0.003791, acc = 0.759150
valid:: Epoch: 0012 cost = 0.004669, acc = 0.719037
train:: Epoch: 0013 cost = 0.003713, acc = 0.766322
valid:: Epoch: 0013 cost = 0.004641, acc = 0.728211
train:: Epoch: 0014 cost = 0.003638, acc = 0.769959
valid:: Epoch: 0014 cost = 0.005130, acc = 0.708716
train:: Epoch: 0015 cost = 0.003591, acc = 0.775854
valid:: Epoch: 0015 cost = 0.004853, acc = 0.722477
train:: Epoch: 0016 cost = 0.003537, acc = 0.779195
valid:: Epoch: 0016 cost = 0.004530, acc = 0.748853
train:: Epoch: 0017 cost = 0.003468, acc = 0.786070
valid:: Epoch: 0017 cost = 0.004969, acc = 0.729358
train:: Epoch: 0018 cost = 0.003400, acc = 0.789143
valid:: Epoch: 0018 cost = 0.004735, acc = 0.707569
train:: Epoch: 0019 cost = 0.003371, acc = 0.791029
valid:: Epoch: 0019 cost = 0.004883, acc = 0.738532
train:: Epoch: 0020 cost = 0.003312, acc = 0.795231
valid:: Epoch: 0020 cost = 0.005051, acc = 0.705275
train:: Epoch: 0021 cost = 0.003258, acc = 0.799774
valid:: Epoch: 0021 cost = 0.004609, acc = 0.738532
train:: Epoch: 0022 cost = 0.003251, acc = 0.801333
valid:: Epoch: 0022 cost = 0.005300, acc = 0.716743
train:: Epoch: 0023 cost = 0.003186, acc = 0.806397
valid:: Epoch: 0023 cost = 0.005136, acc = 0.732798
train:: Epoch: 0024 cost = 0.003164, acc = 0.808193
valid:: Epoch: 0024 cost = 0.004915, acc = 0.748853
train:: Epoch: 0025 cost = 0.003162, acc = 0.807154
valid:: Epoch: 0025 cost = 0.005657, acc = 0.701835
train:: Epoch: 0026 cost = 0.003094, acc = 0.812202
valid:: Epoch: 0026 cost = 0.004950, acc = 0.754587
train:: Epoch: 0027 cost = 0.003060, acc = 0.815201
valid:: Epoch: 0027 cost = 0.005012, acc = 0.738532
train:: Epoch: 0028 cost = 0.003032, acc = 0.816686
valid:: Epoch: 0028 cost = 0.005080, acc = 0.724771
train:: Epoch: 0029 cost = 0.002989, acc = 0.820680
valid:: Epoch: 0029 cost = 0.004741, acc = 0.735092
train:: Epoch: 0030 cost = 0.002963, acc = 0.822328
valid:: Epoch: 0030 cost = 0.006050, acc = 0.693807
train:: Epoch: 0031 cost = 0.002931, acc = 0.823368
valid:: Epoch: 0031 cost = 0.005211, acc = 0.729358
train:: Epoch: 0032 cost = 0.002881, acc = 0.826367
valid:: Epoch: 0032 cost = 0.005319, acc = 0.717890
train:: Epoch: 0033 cost = 0.002865, acc = 0.826575
valid:: Epoch: 0033 cost = 0.005474, acc = 0.737385
train:: Epoch: 0034 cost = 0.002856, acc = 0.830049
valid:: Epoch: 0034 cost = 0.005913, acc = 0.728211
train:: Epoch: 0035 cost = 0.002807, acc = 0.833197
valid:: Epoch: 0035 cost = 0.005829, acc = 0.740826
train:: Epoch: 0036 cost = 0.002784, acc = 0.833494
valid:: Epoch: 0036 cost = 0.005323, acc = 0.751147
train:: Epoch: 0037 cost = 0.002759, acc = 0.836152
valid:: Epoch: 0037 cost = 0.005726, acc = 0.730505
train:: Epoch: 0038 cost = 0.002736, acc = 0.836701
valid:: Epoch: 0038 cost = 0.005862, acc = 0.720183
train:: Epoch: 0039 cost = 0.002696, acc = 0.840116
valid:: Epoch: 0039 cost = 0.005092, acc = 0.754587
train:: Epoch: 0040 cost = 0.002675, acc = 0.841022
valid:: Epoch: 0040 cost = 0.005334, acc = 0.753440
train:: Epoch: 0041 cost = 0.002658, acc = 0.843086
valid:: Epoch: 0041 cost = 0.005361, acc = 0.747706
train:: Epoch: 0042 cost = 0.002640, acc = 0.842982
valid:: Epoch: 0042 cost = 0.005408, acc = 0.740826
train:: Epoch: 0043 cost = 0.002625, acc = 0.845031
valid:: Epoch: 0043 cost = 0.005344, acc = 0.741972
train:: Epoch: 0044 cost = 0.002600, acc = 0.845714
valid:: Epoch: 0044 cost = 0.005625, acc = 0.739679
train:: Epoch: 0045 cost = 0.002575, acc = 0.849812
valid:: Epoch: 0045 cost = 0.006035, acc = 0.727064
train:: Epoch: 0046 cost = 0.002528, acc = 0.850451
valid:: Epoch: 0046 cost = 0.005553, acc = 0.719037
train:: Epoch: 0047 cost = 0.002527, acc = 0.850302
valid:: Epoch: 0047 cost = 0.005731, acc = 0.735092
train:: Epoch: 0048 cost = 0.002498, acc = 0.852544
valid:: Epoch: 0048 cost = 0.005318, acc = 0.725917
train:: Epoch: 0049 cost = 0.002487, acc = 0.854667
valid:: Epoch: 0049 cost = 0.005767, acc = 0.731651
train:: Epoch: 0050 cost = 0.002475, acc = 0.855113
valid:: Epoch: 0050 cost = 0.005406, acc = 0.759174

Problem 5.2 (bonus) (10 points) You can go one step further by using word vectors obtained from pretrained language models. Can you import the word embeddings from bert-base-uncased model (via Hugging Face's transformers: https://huggingface.co/transformers/pretrained_models.html) into your model and improve it further? Report the accuracy on the dev data here. If the score is now higher, explain where the improvement is coming from.

**Answer 5.2** 

In [None]:
.