# KAIST AI605 Assignment 1: Text Classification
TA in charge: Miyoung Ko (miyoungko@kaist.ac.kr)

**Due Date:** March 31 (Thu) 11:00pm, 2022

## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/qjjkqazLvA7tkfUz7). 

You need to submit both (1) a PDF of this notebook, and (2) a link to CoLab for execution (.ipynb file is also allowed).

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points. You can obtain up to 2 bonus points (i.e. max score is 22 points). For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will need Python 3.7+ and PyTorch 1.9+, which are already available on Colab:

In [1]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.13
torch 1.10.0+cu111


## 1. Limitations of Vanilla RNNs
In Lecture 02, we saw that a multi-layer perceptron (MLP) without activation function is equivalent to a single linear transformation with respect to the inputs. One can define a vanilla recurrent neural network without activation as, given inputs $\textbf{x}_1 \dots \textbf{x}_T$, the outputs $\textbf{h}_t$ is obtained by
$$\textbf{h}_t = \textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b},$$
where $\textbf{V}, \textbf{U}, \textbf{b}$ are trainable weights. 

> **Problem 1.1** *(2 point)* Show that such recurrent neural network (RNN) without activation function is equivalent to a single linear transformation with respect to the inputs, which means each $\textbf{h}_t$ is a linear combination of the inputs.



In Lecture 05 and 06, we will see how RNNs can model non-linearity via activation function, but they still suffer from exploding or vanishing gradients. We can mathematically show that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

> **Problem 1.2** *(2 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

> **Problem 1.3** *(2 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 05 and 06 slides for the definition of LSTM.

### Solution
> Problem 1.1: Lets say RNN is defined by equation $$ h_t = tanh(b + W*h_{t-1} + V * x_{t-1}) $$ Where b, W and V are the weights and tanh is the nonlinear activation funtion. Now
 $$h_{t-1} =tanh(b + W*h_{t-2} + V * x_{t-2})$$
Replacing it in above equation
$$ h_t = tanh(b + W*tanh(b + W*h_{t-2} + V * x_{t-2}) + V * x_{t-1}) $$
With tanh activation there remains non linearity in the equation hence previous time step calulation must be done before doing next time step calulation. If we eliminate tanh nonlinaerlity the eq becomes
$$ h_t = b + W*(b + W*h_{t-2} + V * x_{t-2}) + V * x_{t-1} $$
$$ h_t = b + W*b + W^T*W*h_{t-2} + W*V * x_{t-2} + V * x_{t-1} $$
Now this equation is a linear combination of weights, hidden state and input. We can keep repeating the process of replacing $h_t$ with $h_{t-1}$, $h_{t-2}$ and so on and the equation will still stay in a linear combination of input with weights and hidden layer


> Problem 1.2: Exploding gradient means the gradients will grow without a limit and for larger models(or sentences) can go to infinity. To solve this we can set a limit to the maximum value a gradient can have and with that this problem of exploding gradient is mitigated. This process is know as gradient clipping.

> Problem 1.3: LSTMs solve the problem using a gated gradient architecture it control the flow of information from previous time step to next time step using a cell, an input gate, an output gate and a forget gate. With the use of these gates LSTM can allow the gradient to flow uncahnged hence the problem of vanishing gradient can be mitigated


## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank (SST), a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST via Hugging Face
We will use `datasets` package offered by Hugging Face, which allows us to easily download various language datasets, including Stanford Sentiment Treebank.

First, install the package:

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 21.1 MB/s eta 0:00:01[K     |██                              | 20 kB 9.6 MB/s eta 0:00:01[K     |███                             | 30 kB 8.3 MB/s eta 0:00:01[K     |████                            | 40 kB 7.9 MB/s eta 0:00:01[K     |█████                           | 51 kB 4.3 MB/s eta 0:00:01[K     |██████                          | 61 kB 5.1 MB/s eta 0:00:01[K     |███████                         | 71 kB 5.5 MB/s eta 0:00:01[K     |████████                        | 81 kB 5.7 MB/s eta 0:00:01[K     |█████████                       | 92 kB 6.3 MB/s eta 0:00:01[K     |██████████                      | 102 kB 5.3 MB/s eta 0:00:01[K     |███████████                     | 112 kB 5.3 MB/s eta 0:00:01[K     |████████████                    | 122 kB 5.3 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 5.3 MB/s eta 0:00:01[K

Then download SST and print the first example:

In [3]:
from datasets import load_dataset
from pprint import pprint

sst_dataset = load_dataset('sst')
pprint(sst_dataset['train'][0])

Downloading builder script:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

No config specified, defaulting to: sst/default


Downloading and preparing dataset sst/default (download: 6.83 MiB, generated: 3.73 MiB, post-processed: Unknown size, total: 10.56 MiB) to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff...


Downloading data:   0%|          | 0.00/6.37M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/790k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8544 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2210 [00:00<?, ? examples/s]

Dataset sst downloaded and prepared to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'label': 0.6944400072097778,
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' "
             "and that he 's going to make a splash even greater than Arnold "
             'Schwarzenegger , Jean-Claud Van Damme or Steven Segal .',
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}


Note that each `label` is a score between 0 and 1. You will round it to either 0 or 1 for binary classification (positive for 1, negative for 0).
In this first example, the label is rounded to 1, meaning that the sentence is a positive review.
You will only use `sentence` as the input; please ignore other values.

> **Problem 2.1** *(2 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

In [4]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [5]:
# Constructing vocabulary with `UNK`
vocab = ['PAD', 'UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

['PAD', 'UNK', 'Hello', 'world!']
2


> **Problem 2.2** *(1 point)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

### Solution


> Problem 2.1

In [6]:
sstDataset = str(sst_dataset['train']['sentence'] + sst_dataset['test']['sentence'] )
originalVocab  = list(sstDataset.split(' '))
vocab = ['PAD', 'UNK'] + list(set(originalVocab))
print('Original Vocab Size is: ', len(vocab))

Original Vocab Size is:  22262


> Problem 2.2

In [7]:
from collections import Counter

vocab = ['PAD', 'UNK'] + [key for key, value in Counter(originalVocab).items() if value >= 2]
print('Reduced Vocab Size is: ', len(vocab))

word2id = {word: id_ for id_, word in enumerate(vocab)} # Generating word to Id mapping 

Reduced Vocab Size is:  10400


## 3. Text Classification with Multi-Layer Perceptron and Recurrent Neural Network

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to fix the input length (with truncation or padding), flatten the word embeddings, apply a linear transformation followed by an activation, and finally classify the output into the two classes: 

In [8]:
from torch import nn

length = 8
input_ = "hi world!"
input_tokens = input_.split(' ')
input_ids = [word2id[word] if word in word2id else 1 for word in input_tokens] # UNK if word not found
if len(input_ids) < length:
  input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
else:
  input_ids = input_ids[:length]

input_tensor = torch.LongTensor([input_ids]) # the first dimension is minibatch size
print(input_tensor)

tensor([[1, 1, 0, 0, 0, 0, 0, 0]])


In [9]:
# Two-layer MLP classification
class Baseline(nn.Module):
  def __init__(self, d, length):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d * length, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor) # [batch_size, length, d]
    emb_flat = emb.view(emb.size(0), -1) # [batch_size, length*d]
    hidden = self.relu(self.layer(emb_flat))
    logits = self.class_layer(hidden)
    return logits

d = 3 # usually bigger, e.g. 128
baseline = Baseline(d, length)
logits = baseline(input_tensor)
softmax = nn.Softmax(1)
print(softmax(logits)) # probability for each class

tensor([[0.6286, 0.3714]], grad_fn=<SoftmaxBackward0>)


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [10]:
cel = nn.CrossEntropyLoss()
label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
loss = cel(logits, label) # Loss, a.k.a L
print(loss)

tensor(0.9905, grad_fn=<NllLossBackward0>)


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [11]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad() # reset process
loss.backward() # compute gradients
optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [12]:
print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.0000,
         -0.0000,  0.0000, -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.0000,
         -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.0000, -0.0000, -0.0000],
        [-0.0262, -0.0723, -0.0353, -0.0262, -0.0723, -0.0353, -0.0008,  0.0244,
          0.0680, -0.0008,  0.0244,  0.0680, -0.0008,  0.0244,  0.0680, -0.0008,
          0.0244,  0.0680, -0.0008,  0.0244,  0.0680, -0.0008,  0.0244,  0.0680],
        [-0.0528, -0.1456, -0.0710, -0.0528, -0.1456, -0.0710, -0.0017,  0.0492,
          0.1370, -0.0017,  0.0492,  0.1370, -0.0017,  0.0492,  0.1370, -0.0017,
          0.0492,  0.1370, -0.0017,  0.0492,  0.1370, -0.0017,  0.0492,  0.1370]])


> **Problem 3.1** *(2 points)* Properly train a MLP baseline model on SST and report the model's accuracy on the dev data.

> **Problem 3.2** *(2 points)* Implement a recurrent neural network (without using PyTorch's RNN module) with `tanh` activation, and use the output of the RNN at the final time step for the classification. Report the model's accuracy on the dev data.

> **Problem 3.3** *(2 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

> **Problem 3.4** *(1 points)* Why is it numerically unstable if you compute log on top of softmax?

### Solution

In [13]:
length = 8
def sentence2Index(sentences):
  indexs = []
  for input_ in sentences:
    input_tokens = input_.split(' ')
    input_ids = [word2id[word] if word in word2id else 1 for word in input_tokens] # UNK if word not found
    if len(input_ids) < length:
      input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
    else:
      input_ids = input_ids[:length]
    indexs.append(input_ids)
  indexs = torch.LongTensor(indexs)
  return indexs # the first dimension is minibatch size

In [14]:
# Making Batches
batchSize = 32
dataSetSize = len(sst_dataset['train'])
trainingData = sentence2Index(sst_dataset['train']['sentence']).reshape(dataSetSize//batchSize, batchSize, length)
trainingLabels = torch.Tensor(sst_dataset['train']['label']).round()
trainingLabels = trainingLabels.reshape(dataSetSize//batchSize, batchSize).to(torch.long)

def trainModel(model, num_epochs=100, lr=0.0001, data=trainingData, labels=trainingLabels):
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  print("Current used device is:", device)

  model.to(device)
  data = data.to(device)
  labels = labels.to(device)

  optimizer = torch.optim.Adam(model.parameters(), lr=lr)
  criterion = nn.CrossEntropyLoss()

  for epoch in range(num_epochs):
    current_loss = 0
    for i in range(len(data)):
        s = data[i]
        l = labels[i]

        optimizer.zero_grad() # reset process
        logits = model(s)
        loss = criterion(logits, l) # Loss, a.k.a L
        loss.backward() # compute gradients
        optimizer.step() # update parameters

        current_loss += loss.item()
    print('Epoch:', epoch+1, ' Loss:', current_loss/batchSize)

In [15]:
testData = sentence2Index(sst_dataset['test']['sentence'])
testLabel = torch.Tensor(sst_dataset['test']['label']).round().to(torch.long)

def testModel(model, data=testData, label=testLabel):
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  print("Current used device is:", device)
  
  model.to(device)
  data=data.to(device)
  label=label.to(device)

  predictions = model(data).argmax(dim=1)

  print('Model accuracy is:', (label==predictions).sum()/len(sst_dataset['test']))

> Problem 3.1

In [16]:
# Two-layer MLP classification
class Baseline(nn.Module):
  def __init__(self, d, length):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d * length, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)
    
    nn.init.normal_(self.layer.weight, 0, 0.01)
    nn.init.normal_(self.layer.bias, 0, 0.01)
    nn.init.normal_(self.class_layer.weight, 0, 0.01)
    nn.init.normal_(self.class_layer.bias, 0, 0.01)

  def forward(self, x):
    emb = self.embedding(x) # [batch_size, length, d]
    emb_flat = emb.view(emb.size(0), -1) # [batch_size, length*d]
    hidden = self.relu(self.layer(emb_flat))
    logits = self.class_layer(hidden)
    return logits


d = 50
baselineModel = Baseline(d, length)
trainModel(baselineModel)
testModel(baselineModel)

Current used device is: cuda
Epoch: 1  Loss: 5.855256732553244
Epoch: 2  Loss: 5.7876552529633045
Epoch: 3  Loss: 5.765292001888156
Epoch: 4  Loss: 5.736551998183131
Epoch: 5  Loss: 5.698025980964303
Epoch: 6  Loss: 5.6485468707978725
Epoch: 7  Loss: 5.588731094263494
Epoch: 8  Loss: 5.520514510571957
Epoch: 9  Loss: 5.444377202540636
Epoch: 10  Loss: 5.362850101664662
Epoch: 11  Loss: 5.276637480594218
Epoch: 12  Loss: 5.187767146155238
Epoch: 13  Loss: 5.098160369321704
Epoch: 14  Loss: 5.0071788262575865
Epoch: 15  Loss: 4.915527734905481
Epoch: 16  Loss: 4.824512069113553
Epoch: 17  Loss: 4.732862449251115
Epoch: 18  Loss: 4.639154360629618
Epoch: 19  Loss: 4.54683852288872
Epoch: 20  Loss: 4.4537335420027375
Epoch: 21  Loss: 4.359397957101464
Epoch: 22  Loss: 4.265775416046381
Epoch: 23  Loss: 4.170694429427385
Epoch: 24  Loss: 4.076530613936484
Epoch: 25  Loss: 3.9806137550622225
Epoch: 26  Loss: 3.8851571269333363
Epoch: 27  Loss: 3.7874110443517566
Epoch: 28  Loss: 3.6908824779

> Problem 3.2

In [17]:
class RNN(nn.Module):
    def __init__(self, d):
        super(RNN, self).__init__()
        self.hidden_size = d
        self.embedding = nn.Embedding(len(vocab), d)
        self.layer = nn.Linear(d * 2, d)
        self.class_layer = nn.Linear(d, 2)
        nn.init.normal_(self.layer.weight, 0, 0.01)
        nn.init.normal_(self.layer.bias, 0, 0.01)
        nn.init.normal_(self.class_layer.weight, 0, 0.01)
        nn.init.normal_(self.class_layer.bias, 0, 0.01)
    
    def forward(self, x):
      hidden = nn.init.kaiming_uniform_(torch.empty(x.shape[0], self.hidden_size))

      if x.is_cuda:
        device = x.get_device()
        hidden = hidden.to(device)

      for i in range(x.shape[1]):
        word = x[:, i]
        emb = self.embedding(word) # [batch_size, d]
        combined = torch.cat((emb, hidden), 1)
        hidden = torch.tanh(self.layer(combined))

      output = self.class_layer(hidden)
      return output

d = 50
rnnModel = RNN(d)
trainModel(rnnModel)
testModel(rnnModel)

Current used device is: cuda
Epoch: 1  Loss: 5.852286495268345
Epoch: 2  Loss: 5.787801954895258
Epoch: 3  Loss: 5.7819845881313086
Epoch: 4  Loss: 5.7761559169739485
Epoch: 5  Loss: 5.769342046231031
Epoch: 6  Loss: 5.759808072820306
Epoch: 7  Loss: 5.741614473983645
Epoch: 8  Loss: 5.6906804125756025
Epoch: 9  Loss: 5.717555275186896
Epoch: 10  Loss: 5.686293960548937
Epoch: 11  Loss: 5.66786097548902
Epoch: 12  Loss: 5.644720732234418
Epoch: 13  Loss: 5.619270577095449
Epoch: 14  Loss: 5.597526997327805
Epoch: 15  Loss: 5.573094626888633
Epoch: 16  Loss: 5.542904443107545
Epoch: 17  Loss: 5.519987327978015
Epoch: 18  Loss: 5.494157557375729
Epoch: 19  Loss: 5.4698748383671045
Epoch: 20  Loss: 5.438894420862198
Epoch: 21  Loss: 5.409568814560771
Epoch: 22  Loss: 5.379035102203488
Epoch: 23  Loss: 5.349802661687136
Epoch: 24  Loss: 5.30925939232111
Epoch: 25  Loss: 5.279627070762217
Epoch: 26  Loss: 5.243805057369173
Epoch: 27  Loss: 5.2126004276797175
Epoch: 28  Loss: 5.1746652210131

>Problem 3.3: 
Negative log likelihood
In case of binary classification the likelihood can be represented using Bernoulli's distribution. This distribution is optimized by maximizing the function wrt θ. We maximize the log-liklihood in place of likelihood function as applying log changes the product into a summation which offers a more efficient solution. 
 $$ p(y|\pi) = \prod_{i=1}^n {\pi_i}^{y_i} (1-\pi_i)^{1-y_i} $$
$$ p(y|x,θ) = \prod_{i=1}^n p_θ (y|x_i)^{y_i} (1-p_θ (y|x_i))^{1-y_i} $$
$$ L(θ;x,y) = \sum_{i=1}^n y_i \log p_θ (y|x_i) + (1-y_i) \log (1-p_θ (y|x_i)) $$
Cross entropy:
Cross entropy loss for a binary classification problem parametrized by θ, having true and predicted labels as y and pθ(y|x) respectively, can be represented as the following equation. This equation is optimized wrt to θ. 
 $$ BCE(y,x,θ) = -\sum_{i=1}^n y_i \log p_θ (y|x_i) + (1-y_i) \log (1-p_θ (y|x_i)) $$
The above equations show that negative log likelihood of probability distribution (Bernoulli's distribution in our case) and cross entropy loss are equivalent.

>Problem 3.4: In case when the softmax value is too close to zero, applying log on top of softmax would result in a greater negative value. These large negative values can cause our model to show undesirable behaviour and can go unstable due to large number multiplication.

## 4. Text Classification with LSTM and Dropout

Replace your RNN module with an LSTM module. See Lecture slides 05 and 06 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [18]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.0000, 0.6000, 1.0000, 1.4000, 0.0000])


> **Problem 4.1** *(2 points)* Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN. Report the accuracy on the dev data.

> **Problem 4.2** *(2 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data.

### Solution

> Problem 4.1

In [19]:
class LSTM(nn.Module):
  def __init__(self, d):
    super(LSTM, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(2 * d, 4 * d, bias=True)
    self.class_layer = nn.Linear(d, 2, bias=True)
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.hidden_size = d
    nn.init.normal_(self.layer.weight, 0, 0.01)
    nn.init.normal_(self.layer.bias, 0, 0.01)
    nn.init.normal_(self.class_layer.weight, 0, 0.01)
    nn.init.normal_(self.class_layer.bias, 0, 0.01)
  
  def forward(self, x):
    hidden = nn.init.kaiming_uniform_(torch.empty(x.shape[0], self.hidden_size*2))

    if x.is_cuda:
        device = x.get_device()
        hidden = hidden.to(device)
    
    for i in range(x.shape[1]):
      word = x[:, i]
      emb = self.embedding(word) # [batch_size, d]
      prev_h, prev_c = hidden.chunk(2, -1)
      tensor = torch.cat([prev_h, emb], dim=-1)
      tensor = self.layer(tensor)
      input_, forget, output, cand_c = tensor.chunk(4, -1)
      input_ = self.sigmoid(input_)
      forget = self.sigmoid(forget)
      output = self.sigmoid(output) 
      cand_c = self.tanh(cand_c)
      cur_c = input_ * cand_c + forget * prev_c
      cur_h = output * self.tanh(cur_c)
      hidden  = torch.cat([cur_h, cur_c], -1)
    return self.class_layer(output)

d = 50
lstmModel = LSTM(d)
trainModel(lstmModel)
testModel(lstmModel)

Current used device is: cuda
Epoch: 1  Loss: 5.802698826417327
Epoch: 2  Loss: 5.8155719712376595
Epoch: 3  Loss: 5.8483905382454395
Epoch: 4  Loss: 5.862160194665194
Epoch: 5  Loss: 5.866503959521651
Epoch: 6  Loss: 5.86661870777607
Epoch: 7  Loss: 5.864469218999147
Epoch: 8  Loss: 5.860420489683747
Epoch: 9  Loss: 5.853739967569709
Epoch: 10  Loss: 5.842786896973848
Epoch: 11  Loss: 5.825650458224118
Epoch: 12  Loss: 5.804028078913689
Epoch: 13  Loss: 5.772273316979408
Epoch: 14  Loss: 5.735757066868246
Epoch: 15  Loss: 5.691756432875991
Epoch: 16  Loss: 5.644140676595271
Epoch: 17  Loss: 5.596884664148092
Epoch: 18  Loss: 5.544505879282951
Epoch: 19  Loss: 5.4931627893820405
Epoch: 20  Loss: 5.441881031729281
Epoch: 21  Loss: 5.3924176106229424
Epoch: 22  Loss: 5.340746805071831
Epoch: 23  Loss: 5.29076037183404
Epoch: 24  Loss: 5.2429200280457735
Epoch: 25  Loss: 5.200033342465758
Epoch: 26  Loss: 5.149148302152753
Epoch: 27  Loss: 5.104621906764805
Epoch: 28  Loss: 5.0684024207293

> Problem 4.2

In [20]:
class LSTMDropout(nn.Module):
  def __init__(self, d):
    super(LSTMDropout, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(2 * d, 4 * d, bias=True)
    self.class_layer = nn.Linear(d, 2, bias=True)
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.hidden_size = d
    self.dropout = nn.Dropout(0.5)
    nn.init.normal_(self.layer.weight, 0, 0.01)
    nn.init.normal_(self.layer.bias, 0, 0.01)
    nn.init.normal_(self.class_layer.weight, 0, 0.01)
    nn.init.normal_(self.class_layer.bias, 0, 0.01)
  
  def forward(self, x):
    hidden = nn.init.kaiming_uniform_(torch.empty(x.shape[0], self.hidden_size*2))

    if x.is_cuda:
        device = x.get_device()
        hidden = hidden.to(device)
    
    for i in range(x.shape[1]):
      word = x[:, i]
      emb = self.embedding(word) # [batch_size, d]
      prev_h, prev_c = hidden.chunk(2, -1)
      tensor = torch.cat([prev_h, emb], dim=-1)
      tensor = self.layer(tensor)
      input_, forget, output, cand_c = tensor.chunk(4, -1)
      input_ = self.sigmoid(input_)
      forget = self.sigmoid(forget)
      output = self.sigmoid(output) 
      cand_c = self.tanh(cand_c)
      cur_c = input_ * cand_c + forget * prev_c
      cur_h = output * self.tanh(cur_c)
      hidden  = torch.cat([cur_h, cur_c], -1)
    return self.class_layer(self.dropout(output))

d = 50
lstmdropoutModel = LSTMDropout(d)
trainModel(lstmdropoutModel)
testModel(lstmdropoutModel)

Current used device is: cuda
Epoch: 1  Loss: 5.804907591082156
Epoch: 2  Loss: 5.8158073872327805
Epoch: 3  Loss: 5.846118737012148
Epoch: 4  Loss: 5.866849524900317
Epoch: 5  Loss: 5.870863452553749
Epoch: 6  Loss: 5.871144410222769
Epoch: 7  Loss: 5.870280003175139
Epoch: 8  Loss: 5.861121268942952
Epoch: 9  Loss: 5.851257419213653
Epoch: 10  Loss: 5.839736694470048
Epoch: 11  Loss: 5.821695843711495
Epoch: 12  Loss: 5.794440632686019
Epoch: 13  Loss: 5.770312869921327
Epoch: 14  Loss: 5.724709061905742
Epoch: 15  Loss: 5.682821719907224
Epoch: 16  Loss: 5.639177007600665
Epoch: 17  Loss: 5.608760331757367
Epoch: 18  Loss: 5.565666951239109
Epoch: 19  Loss: 5.5117966355755925
Epoch: 20  Loss: 5.4848199263215065
Epoch: 21  Loss: 5.426799094304442
Epoch: 22  Loss: 5.389909945428371
Epoch: 23  Loss: 5.346701493486762
Epoch: 24  Loss: 5.316683175973594
Epoch: 25  Loss: 5.260393305681646
Epoch: 26  Loss: 5.246219001710415
Epoch: 27  Loss: 5.212224384769797
Epoch: 28  Loss: 5.1503115156665

## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

> **Problem 5.1 (bonus)** *(2 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to replace word embeddings in your model from 4.2. Report the model's accuracy on the dev data.

### Solution

> Problem 5.1

In [21]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
!ls -lat

--2022-04-05 11:00:04--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-04-05 11:00:05--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-04-05 11:00:05--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-0

In [22]:
import numpy as np

vocab,embeddings = [],[]
with open('glove.6B.50d.txt','rt') as fi:
    full_content = fi.read().strip().split('\n')
for i in range(len(full_content)):
    i_word = full_content[i].split(' ')[0]
    i_embeddings = [float(val) for val in full_content[i].split(' ')[1:]]
    vocab.append(i_word)
    embeddings.append(i_embeddings)

vocab_npa = np.array(vocab)
embs_npa = np.array(embeddings)

vocab_npa = np.insert(vocab_npa, 0, '<pad>')
vocab_npa = np.insert(vocab_npa, 1, '<unk>')

pad_emb_npa = np.zeros((1,embs_npa.shape[1]))
unk_emb_npa = np.mean(embs_npa,axis=0,keepdims=True)

embs_npa = np.vstack((pad_emb_npa,unk_emb_npa,embs_npa))
word2id = {word: id_ for id_, word in enumerate(vocab_npa)}

In [23]:
class LSTMDropoutPreTrained(nn.Module):
  def __init__(self, d):
    super(LSTMDropoutPreTrained, self).__init__()
    self.layer = nn.Linear(2 * d, 4 * d, bias=True)
    self.class_layer = nn.Linear(d, 2, bias=True)
    self.sigmoid = nn.Sigmoid()
    self.my_embedding_layer = torch.nn.Embedding.from_pretrained(torch.from_numpy(embs_npa).float())
    self.tanh = nn.Tanh()
    self.hidden_size = d
    self.dropout = nn.Dropout(0.5)
    nn.init.normal_(self.layer.weight, 0, 0.01)
    nn.init.normal_(self.layer.bias, 0, 0.01)
    nn.init.normal_(self.class_layer.weight, 0, 0.01)
    nn.init.normal_(self.class_layer.bias, 0, 0.01)
  
  def forward(self, x):
    hidden = nn.init.kaiming_uniform_(torch.empty(x.shape[0], self.hidden_size*2))
    
    if x.is_cuda:
        device = x.get_device()
        hidden = hidden.to(device)
    
    for i in range(x.shape[1]):
      word = x[:, i]
      emb = self.my_embedding_layer(word) # [batch_size, d]
      prev_h, prev_c = hidden.chunk(2, -1)
      tensor = torch.cat([prev_h, emb], dim=-1)
      tensor = self.layer(tensor)
      input_, forget, output, cand_c = tensor.chunk(4, -1)
      input_ = self.sigmoid(input_)
      forget = self.sigmoid(forget)
      output = self.sigmoid(output) 
      cand_c = self.tanh(cand_c)
      cur_c = input_ * cand_c + forget * prev_c
      cur_h = output * self.tanh(cur_c)
      hidden  = torch.cat([cur_h, cur_c], -1)
      
    return self.class_layer(self.dropout(output))

d = 50
lstmDropoutGloveModel = LSTMDropoutPreTrained(d)
trainModel(lstmDropoutGloveModel)
testModel(lstmDropoutGloveModel)

Current used device is: cuda
Epoch: 1  Loss: 5.810080042108893
Epoch: 2  Loss: 5.82396262511611
Epoch: 3  Loss: 5.853065188974142
Epoch: 4  Loss: 5.8663366455584764
Epoch: 5  Loss: 5.869462383911014
Epoch: 6  Loss: 5.871553055942059
Epoch: 7  Loss: 5.875411454588175
Epoch: 8  Loss: 5.866199430078268
Epoch: 9  Loss: 5.866932790726423
Epoch: 10  Loss: 5.868659470230341
Epoch: 11  Loss: 5.862235719338059
Epoch: 12  Loss: 5.863466378301382
Epoch: 13  Loss: 5.855251913890243
Epoch: 14  Loss: 5.858594594523311
Epoch: 15  Loss: 5.853400304913521
Epoch: 16  Loss: 5.853346349671483
Epoch: 17  Loss: 5.84447380527854
Epoch: 18  Loss: 5.844160944223404
Epoch: 19  Loss: 5.840326946228743
Epoch: 20  Loss: 5.837163126096129
Epoch: 21  Loss: 5.8361697178334
Epoch: 22  Loss: 5.8341509234160185
Epoch: 23  Loss: 5.832984020933509
Epoch: 24  Loss: 5.833384791389108
Epoch: 25  Loss: 5.830110834911466
Epoch: 26  Loss: 5.828036533668637
Epoch: 27  Loss: 5.829818746075034
Epoch: 28  Loss: 5.828698327764869
Ep

> Comments:
Accuracy with GloVe was suppose to increase but after seeing the vocab of sst dataset I expected the accuracy with GloVe to decrease because most of the words in orignal vocab are like "The, [Hello, is', and these are the words that don't exist in GloVe. Hence most words will be assigned 1 ID and accuracy will decrease