# HW 2: Text Classification using LSTMs
**Due: February 27, 9:30 AM**

In this homework assignment, you will define and train an [LSTM model](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) for _sentiment analysis_, a text classification task that we learned about in the Week 2 lab and Week 3 lecture. In this task, the model will read a user-generated movie review from the website [IMDb](https://www.imdb.com/) and predict whether the review is a _positive review_ (class `1`) or a _negative review_ (class `0`). For instance, the model should classify `This movie is amazing!` as `1` (positive) and `This movie is terrible!` as `0` (negative).

## Important: Read Before Starting

In the following exercises, you will need to implement functions defined in the `tokenizer`, `model`, and `train_test` modules. **Please write all your code in the respective `.py` files for those modules.** You should not submit this notebook with your solutions, and we will not grade it if you do. Please be aware that code written in a Jupyter notebook may run differently when copied into Python modules.

The outputs shown in this notebook are the outputs that you should get **when all problems have been completed correctly**. You may obtain different results if you attempt to run the code cells before you have completed the problem set, or if you have completed one or more problems incorrectly.

To begin, please run the following `import` statements.

In [2]:
!export LC_ALL=en_US.UTF-8
!export LANG=en_US.UTF-8

In [1]:
import torch.nn as nn
import torch.nn.functional as F
from datasets import load_dataset
from nltk.tokenize import TreebankWordTokenizer

from embeddings import Embeddings
from model import LSTMSentimentClassifier
from tokenizer import Tokenizer
from train_test import evaluate, train

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.floa

In [2]:
embeddings = Embeddings.from_file(filename='data/glove_300d.txt')

In [3]:
import torch
vecs= torch.tensor(embeddings.vectors)

In [4]:
torch.cat([vecs, vecs]).argmax(axis=1).shape

torch.Size([60000])

## Problem 1: Inspect the Data (10 Points in Total)

The official name of the dataset we are using is called the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). More commonly, however, it is simply known as the _IMDb dataset_. It was originally created in 2011 by a group of graduate students at Stanford University supervised by Professors [Andrew Ng](https://www.andrewng.org/) and [Christopher Potts](https://web.stanford.edu/~cgpotts/). The IMDb dataset has since become one of the most commonly used sentiment analysis datasets in NLP research.

### Problem 1a: Load and Inspect Examples (No Submission, 0 Points)

In this assignment, we will work with the IMDb dataset using a Python interface provided by the 🤗 Datasets library. This library stores a number of NLP datasets on a remote server, which you can download using the `load_dataset` function.

In [5]:
from datasets import load_dataset
imdb = load_dataset("imdb")

Found cached dataset imdb (/Users/kenzeng/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

The IMDb dataset is split into three parts: `train`, `test`, and `unsupervised`. There is no validation set. (The `unsupervised` dataset contains movie reviews without labels.)

In [6]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Since we will need a validation set for early stopping and hyperparameter tuning, we will take 20% of the training data and use that as the validation data, as we did during lab.

In [7]:
split = imdb["train"].train_test_split(.2, seed=3463)
imdb["train"] = split["train"]
imdb["val"] = split["test"]
if "unsupervised" in imdb:
    del imdb["unsupervised"]  # Save memory by deleting the unlabeled examples

Loading cached split indices for dataset at /Users/kenzeng/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-5bc6777a554460b1.arrow and /Users/kenzeng/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-68a38e526d30dc3c.arrow


To get an idea of what the data look like, please inspect a few examples from the dataset.

In [8]:
imdb["train"][1000]

{'text': "Ever since I was eight years old I have been a big wrestling fan. It didn't matter what federation I watched. WWE,WCW,USWA. To me the action is all I watched it for.<br /><br />May 23rd 1999. That was my 19 birthday. I ordered Over the Edge and I was just expecting another pay per view. But this time. I was wrong. Instead that was the night one of the best wrestlers to come out of Canada a true human being fell to his death due to a stunt gone wrong. Not much you can do to change the situation. But what happened affter Owens death made me very mad.<br /><br />Rather then ending the pay per view and doing the right thing as human beings the WWE decided to protect what comes first and that was the money by keeping the pay per view going as if Owens death never happened.<br /><br />I gotta tell you. Vince Mchmaon has made some stupid decisions in his life but this was by far the stupidest decision he ever made.<br /><br />And this crap with saying Owen would have wanted the pay 

### Problem 1b: Benchmarks in NLP (Written, 5 Points)

Many papers in the sentiment analysis literature use the IMDb dataset for training and testing. When a lot of papers use the same dataset, we refer to that dataset as a "benchmark," and we often keep track of the best ("state-of-the-art") performance that has been achieved on each benchmark. For instance, the website [NLP-progress](https://nlpprogress.com/) lists state-of-the-art results for several NLP benchmarks, [including IMDb](http://nlpprogress.com/english/sentiment_analysis.html).

Why would so many papers use the same dataset, instead of creating their own datasets? (Assume that it's not merely because of convenience.)

- controls for confounding factors such as the quality of label collection + cleaning, underlying distributions 
- allows for fair comparison of different approaches 

### Problem 1c: Extra Credit (Written, 5 Points)

Why do we use 20% of the training data as the validation dataset, instead of taking 20% from the testing data?

**Hint:** Your answer should be related to your answer for Problem 1b.

- 


### Problem 1d: Understanding the Dataset (Written, 5 Points)

This screenshot shows [a typical movie review from the IMDb website](https://www.imdb.com/title/tt1630029/reviews/?ref_=tt_ql_urv).

<img src="images/imdb-screenshot.png" style="width:75%" alt="A screenshot of a movie review from the IMDb website." />

As you can see, IMDb reviews come with a text and a star rating out of 10; but nowhere does it say whether the review is "positive" or "negative." According to [the original paper describing the IMDb dataset (Maas et al., 2011)](https://aclanthology.org/P11-1015/), how were the labels in the IMDb dataset assigned to each movie review?

**Hint:** Sophie said something about this during lab on February 2, but she was wrong.

## Problem 2: Tokenization (25 Points in Total)

Now that you understand what the IMDb dataset looks like, the first step to building a sentiment analyzer is to build a _tokenizer_. Tokenization is the process of splitting up a string into pieces called _tokens_. For example, in this assignment the text `Hello world!` would be split up into the tokens `['Hello', 'world', '!']`. After tokenization, your tokenizer will replace each token with a unique numerical identifier, which the LSTM model will use to look up the word embedding for each token. Thus, the tokens `['Hello', 'world', '!']` will be replaced by the sequence of numbers `[3147, 207, 36]`.

In this problem, you will implement the `Tokenizer` class in the `tokenizer` module.

### Problem 2a: Inspect the GloVe Vocabulary (No Submission, 0 Points)

The set of all possible unique tokens is called the _vocabulary_. Since we will use pre-trained GloVe embeddings in our sentiment analyzer, the vocabulary we use for tokenization simply consists of all the tokens that appear in the word embedding file.

To help you load the GloVe embeddings, the starter code for this assignment comes with the solution code for Problem 1b of HW 1, where you implemented an `Embeddings` class that holds a set of word embeddings. Here we will use it to load 300-dimensional embeddings trained on 840 billion tokens from the Common Crawl corpus.  

In [9]:
# Load the GloVe embeddings
glove = Embeddings.from_file("data/glove_300d.txt")

# Inspect some of the words in the vocabulary
print(glove.words[:15])

[',', '.', 'the', 'and', 'to', 'of', 'a', 'in', '"', ':', 'is', 'for', 'I', ')', '(']


Take a look at the first 15 words in the GloVe vocabulary. What order do you think the words are in? Is the GloVe vocabulary case-sensitive?

### Problem 2b: Extra Credit (Written, 5 Points)

You might think that all a tokenizer needs to do is to separate out the words in a text. But remember from class that the concept of a "word" is difficult to define linguistically, so sometimes it might make sense for a word to be split up into multiple tokens. For example, observe below that the GloVe vocabulary contains common word pieces like `'s` and `n't`.

In [10]:
# Inspect the 20th and 40th words in the GloVe vocabulary
print(glove.words[20]) 
print(glove.words[40])

's
n't


What is the linguistics term that refers to word pieces like `'s` or `n't`, as well as prefixes and suffixes like `pre-` or `-tion` and "root" words like `berry`? Why should these kinds of word pieces be treated as separate tokens?

### Problem 2c: Understand Tokenizer Usage (No Submission, 0 Points)

Your code for tokenization will belong to the `Tokenizer` class in the `tokenizer` module. Before you implement the tokenizer, take a look at the following usage examples to understand how the tokenizer should work.

A `Tokenizer` object is instantiatiated from a vocabulary, represented as a list of strings where each string is a unique token.

In [11]:
# Load a tokenizer from the GloVe vocabulary
tokenizer = Tokenizer(glove.words)

Once the `Tokenizer` is loaded, its vocabulary will contain all the unique tokens in the provided list of strings. It will also contain four additional tokens:
* `[BOS]` ("beginning of sequence"), which marks the beginning of a text;
* `[EOS]` ("end of sequence"), which marks the end of a text;
* `[UNK]` ("unknown"), which represents any token not in the vocabulary; and
* `[PAD]` ("padding"), a token that is appended to the end of certain texts to make sure that all the inputs in a mini-batch have the same length. (This is so that each mini-batch can be represented as a matrix.)

Each vocabulary item is assigned a unique numerical identifier known as its _index_. The indices are assigned in the same order as in the vocabulary used to instantiate the `Tokenizer`. The four special tokens `[BOS]`, `[EOS]`, `[UNK]`, and `[PAD]` are assigned the last four indices.

In [12]:
# Visualize some of the words and indices
print("Vocabulary size:", len(tokenizer))
sample_words = glove.words[:15] + ["[BOS]", "[EOS]", "[UNK]", "[PAD]"]
for w in sample_words:
    print(w, tokenizer[w], sep="\t")

Vocabulary size: 30004
,	0
.	1
the	2
and	3
to	4
of	5
a	6
in	7
"	8
:	9
is	10
for	11
I	12
)	13
(	14
[BOS]	30000
[EOS]	30001
[UNK]	30002
[PAD]	30003


The `Tokenizer` object can be called as a function. Its input is a `dict` containing the texts and labels for a mini-batch. It should tokenize all the texts in the mini-batch before adding `[BOS]`, `[EOS]`, and `[PAD]` tokens as appropriate. Its output should contain the mini-batch with the texts represented as a PyTorch `LongTensor`, where each token has been replaced by its index. It should also contain the length of each example (including `[BOS]` and `[EOS]`), also represented as a `LongTensor`.

In [15]:
# Defining a raw input batch
text1 = "<em>The Shawshank Redemption</em> was a great movie.<br />I enjoyed " \
        "it a lot!"
text2 = "This movie was terrible. I could barely watch it."
batch = {"text": [text1, text2], "label": [1, 0]}

In [16]:
# Turning the raw input batch into a model input
tokenizer(batch)

{'lengths': tensor([16, 13]),
 'label': tensor([1, 0]),
 'text': tensor([[30000,    22, 30002, 24052,    30,     6,   158,   603,     1,    12,
           2203,    21,     6,   271,    36, 30001],
         [30000,    76,   603,    30,  4534,     1,    12,   121,  5333,   712,
             21,     1, 30001, 30003, 30003, 30003]])}

### Problem 2d: Implement the Tokenization Pipeline (Code, 10 Points)

The first part of the tokenizer that you will implement is called the _tokenization pipeline_. This pipeline consists of three steps: _normalization_, _tokenization_, and _postprocessing_, in that order. After these three steps, an input text such as 

> <em>The Shawshank Redemption</em> was a great movie.<br />I enjoyed it a lot!

will be transformed into a sequence of tokens such as

`['[BOS]', 'The', '[UNK]', 'Redemption', 'was', 'a', 'great', 'movie', '.', 'I', 'enjoyed', 'it', 'a', 'lot', '!', '[EOS]']`


**Normalization:** In the _normalization_ step, HTML tags are removed from the text. This step has already been implemented for you.

In [17]:
raw_text = batch["text"][0]
normalized_text = tokenizer.normalize(raw_text)
print(normalized_text)

The Shawshank Redemption was a great movie. I enjoyed it a lot!


**Tokenization:** In the _tokenization_ step, the text is divided into tokens. Some of the tokens may be out of vocabulary, but we will not worry about that in this step. You will use [NLTK's Penn Treebank Tokenizer](https://www.nltk.org/api/nltk.tokenize.treebank.html) for tokenization.

In [18]:
raw_tokens = tokenizer.tokenize(normalized_text)
print(raw_tokens)

['The', 'Shawshank', 'Redemption', 'was', 'a', 'great', 'movie', '.', 'I', 'enjoyed', 'it', 'a', 'lot', '!']


Note that the NLTK tokenizer does not behave correctly when it is applied to more than one sentence. In the example below, `'great.'` is treated as a single token even though the period should be a separate token: `['great', '.']`.

In [19]:
nltk_tokenizer = TreebankWordTokenizer()
print(nltk_tokenizer.tokenize(normalized_text))

['The', 'Shawshank', 'Redemption', 'was', 'a', 'great', 'movie.', 'I', 'enjoyed', 'it', 'a', 'lot', '!']


In order to obtain the correct behavior, you will need to perform [sentence segmentation](https://www.nltk.org/book/ch03.html) before the tokenization step. See [Section 3.8 of the NLTK book](https://www.nltk.org/book/ch03.html) for more details.

**Postprocessing:** In the _postprocessing_ step, `'[BOS]'` is added to the beginning of each text, and `'[EOS]'` is added to the end of each text. Tokens not in the vocabulary are replaced by `'[UNK]'`.

In [21]:
print(tokenizer.postprocess(raw_tokens))

['[BOS]', 'The', '[UNK]', 'Redemption', 'was', 'a', 'great', 'movie', '.', 'I', 'enjoyed', 'it', 'a', 'lot', '!', '[EOS]']


To complete this problem, you will need to implement the `tokenize` and `postprocess` methods of the `Tokenizer` class.

### Problem 2e: Prepare Model Input (Code, 10 Points)

With the tokenization pipeline complete, your next step is to implement the `Tokenizer.__call__` method, which defines the `Tokenizer`'s behavior when it is called as a function. The `Tokenizer`-as-function should apply the tokenization pipeline to each text in an input batch before converting it into a `LongTensor` with all the words replaced by their indices. All the texts in the batch should be combined into a single matrix (a 2-dimensional `LongTensor`), where the `[PAD]` token is repeatedly appended to shorter texts until all the texts are the same length. In addition, `__call__` should convert the labels to a 1-dimensional `LongTensor` and add a 1-dimensional `LongTensor` to the batch that indicates the number of tokens in each text, not including `'[PAD]'`.

In [20]:
batch = tokenizer(batch)
batch 

{'lengths': tensor([16, 13]),
 'label': tensor([1, 0]),
 'text': tensor([[30000,    22, 30002, 24052,    30,     6,   158,   603,     1,    12,
           2203,    21,     6,   271,    36, 30001],
         [30000,    76,   603,    30,  4534,     1,    12,   121,  5333,   712,
             21,     1, 30001, 30003, 30003, 30003]])}

In [22]:
# Visualize the tokens: notice that text2 has '[PAD]' at the end
for i, text in enumerate(batch["text"]):
    print("Text {}:".format(i), [tokenizer.words[t] for t in text])

Text 0: ['[BOS]', 'The', '[UNK]', 'Redemption', 'was', 'a', 'great', 'movie', '.', 'I', 'enjoyed', 'it', 'a', 'lot', '!', '[EOS]']
Text 1: ['[BOS]', 'This', 'movie', 'was', 'terrible', '.', 'I', 'could', 'barely', 'watch', 'it', '.', '[EOS]', '[PAD]', '[PAD]', '[PAD]']


## Problem 3: Model Architecture Definition (25 Points in Total)

In this problem, you will implement the `LSTMSentimentClassifier` class in the `model` module, which defines the architecture for your sentiment analysis model. The architecture for the model is illustrated in the following diagram.

<img src="images/lstm.png" style="width:75%" alt="The architecture for the sentiment analysis classifier." />

In this architecture, the word embedding for each word is fed into an LSTM encoder network, which computes a sequence of hidden state vectors. The last hidden state vector $\boldsymbol{h}^{(n)}$ (with $n = 5$ in the diagram), computed when the LSTM reads $\overrightarrow{\text{[EOS]}}$, is used as an embedding for the text. A decoder network—here just a linear layer—takes $\boldsymbol{h}^{(n)}$ and predicts the label assigned to the text.

Unlike with the SGNS architecture and the MLP architecture we saw during the Week 3 lab, the LSTM architecture in this assignment has no sigmoid activation function in the decoder. Instead, the decoder's linear layer produces a 2-dimensional output $\hat{\boldsymbol{y}} \in \mathbb{R}^2$:

$$\hat{\boldsymbol{y}} = \boldsymbol{W}\boldsymbol{h}^{(n)} + \boldsymbol{b}$$

where $\text{softmax}(\hat{\boldsymbol{y}})$ contains the probabilities assigned by the model to the two possible labels. The reason the softmax activation function is not shown in the diagram is because [PyTorch's `nn.CrossEntropyLoss` module](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) combines the softmax activation function and the cross-entropy loss function into a single module.

### Problem 3a: Extra Credit (Written, 5 Points)

The cross-entropy loss function for this assignment is given by:

$$L(\text{softmax}(\hat{\boldsymbol{y}}), y) = -\ln(\text{softmax}(\hat{\boldsymbol{y}})_{y + 1}) = \begin{cases}
-\ln(e^{\hat{y}_1}/(e^{\hat{y}_1} + e^{\hat{y}_2})), & y = 0 \\
-\ln(e^{\hat{y}_2}/(e^{\hat{y}_1} + e^{\hat{y}_2})), & y = 1
\end{cases}$$

where $y \in \lbrace 0, 1 \rbrace$ is the true label of the input text. For each $i \in \lbrace 0, 1 \rbrace$, compute 

$$\frac{\partial}{\partial \hat{y}_i} L(\text{softmax}(\hat{\boldsymbol{y}}), y)$$

by hand. Why do you think we are combining softmax and the cross-entropy loss function into a single module, as opposed to using a softmax module in the model architecture?

### Problem 3b: Define Architecture Components (Code, 10 Points)

Please implement the `__init__` method of the `LSTMSentimentClassifier`, paying close attention to the docstrings for the parameters. This method starts out with three incomplete lines of code, which are intended to define PyTorch modules that form part of the architecture. The only thing you need to do is to complete these lines of code by initializing the appropriate module. Please refer to [the `torch.nn` documentation](https://pytorch.org/docs/stable/nn.html) for guidance on the usage of these modules.

The specific modules you should initialize are:
* `nn.Embedding` ([see documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html))
* `nn.LSTM` ([see documentation](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html))
* `nn.Linear` ([see documentation](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html))

Please use the setting `batch_first=True` for the `nn.LSTM` module, and use the default values for all other keyword arguments. Please **do not** use the `nn.LSTMCell` module, which also implements an LSTM network.

When this problem is finished, you should be able to instantiate an `LSTMSentimentClassifier` as follows.

In [93]:
# Create a model with hidden size of 10
from importlib import reload
import model as modellib
reload(modellib)
model = modellib.LSTMSentimentClassifier(len(tokenizer), 300, 10)

### Problem 3c: Load Pre-Trained Embeddings (Code, 5 Points)

Notice that in Problem 3b, you have defined the word embedding vectors used by the `LSTMSentimentClassifier` as parameters of the model. By default, parameters of PyTorch modules are initialized to random values. However, in this assignment you will initialize them to pre-trained GloVe embeddings.

Please implement the `load_pretrained_embeddings` method, which sets the model's word embedding matrix to values defined by an `Embeddings` object. 

**Important:** The last 4 rows of the model's word embedding matrix should _not_ be initalized to pre-trained values, since they represent the `[BOS]`, `[EOS]`, `[UNK]`, and `[PAD]` tokens, respectively.

In [94]:
# Initialize the model's word embeddings to pre-trained GloVe embeddings
model.load_pretrained_embeddings(glove)

# Visualize the pre-trained embeddings
print("GloVe Vectors:", glove.vectors, sep="\n", end="\n\n")
print("Initialized Embeddings:", model.embeddings.weight[:-4], sep="\n", 
      end="\n\n") 

# Your output here doesn't need to match the output shown
print("Random Embeddings for [BOS], [EOS], [UNK], and [PAD]:", 
      model.embeddings.weight[-4:], sep="\n")

GloVe Vectors:
[[-0.082752  0.67204  -0.14987  ... -0.1918   -0.37846  -0.06589 ]
 [ 0.012001  0.20751  -0.12578  ...  0.13871  -0.36049  -0.035   ]
 [ 0.27204  -0.06203  -0.1884   ...  0.13015  -0.18317   0.1323  ]
 ...
 [ 0.44812   0.55796  -0.88695  ...  0.33111  -0.067436 -0.20892 ]
 [-0.60233  -0.30839  -0.24441  ...  0.2574   -0.18594  -0.076442]
 [-0.20681   0.20913   0.1064   ... -0.30741  -0.11888   0.032769]]

Initialized Embeddings:
tensor([[-0.0828,  0.6720, -0.1499,  ..., -0.1918, -0.3785, -0.0659],
        [ 0.0120,  0.2075, -0.1258,  ...,  0.1387, -0.3605, -0.0350],
        [ 0.2720, -0.0620, -0.1884,  ...,  0.1302, -0.1832,  0.1323],
        ...,
        [ 0.4481,  0.5580, -0.8870,  ...,  0.3311, -0.0674, -0.2089],
        [-0.6023, -0.3084, -0.2444,  ...,  0.2574, -0.1859, -0.0764],
        [-0.2068,  0.2091,  0.1064,  ..., -0.3074, -0.1189,  0.0328]])

Random Embeddings for [BOS], [EOS], [UNK], and [PAD]:
tensor([[-0.0581, -0.5485,  1.4825,  ...,  0.3752,  0.0261, -0.

### Problem 3d: Define Forward Pass (Code, 15 Points)

Please implement the `forward` method of the `LSTMSentimentClassifier`. When running the forward computation of the model, we do not call `forward` directly, but rather call the `LSTMSentimentClassifier` object as a function. The following snippet shows how this should work.

In [82]:
hidden_states, _ = model.lstm(model.embeddings(batch["text"]))

In [77]:
output[5,].shape

torch.Size([398, 10])

In [96]:
logits.shape

torch.Size([6, 2])

In [95]:
# Get the logit scores for each text 
logits = model(batch["text"], batch["lengths"])

# Visualize the logit scores and probability scores
print("Logit scores:", logits, sep="\n", end="\n\n")
print("Probabilities:", F.softmax(logits, dim=-1), sep="\n")

Logit scores:
tensor([[0.0963, 0.4546],
        [0.0611, 0.4811],
        [0.0781, 0.3523],
        [0.0925, 0.3837],
        [0.0537, 0.4864],
        [0.0843, 0.4594]], grad_fn=<AddmmBackward>)

Probabilities:
tensor([[0.4114, 0.5886],
        [0.3965, 0.6035],
        [0.4319, 0.5681],
        [0.4277, 0.5723],
        [0.3935, 0.6065],
        [0.4073, 0.5927]], grad_fn=<SoftmaxBackward>)


In [104]:
logits.argmax(axis=1).shape, batch['label'].shape

(torch.Size([6]), torch.Size([6]))

In [105]:
logits.argmax(axis=1) == batch['label']

tensor([False, False, False, False, False, False])

In [106]:
batch['label']

tensor([0, 0, 0, 0, 0, 0])

In the output above, the first matrix ("logit scores") shows the output of the model, while the second matrix shows the softmax of the first matrix. The interpretation of the model's output is as follows:
* the model predicts there is a 48.59% chance that `text1` has label `0`, and a 51.41% chance that `text1` has label `1`
* the model predicts there is a 48.97% chance that `text2` has label `0`, and a 51.03% chance that `text2` has label `1`.

Remember that this is a randomly initialized model that has not yet been trained, so these probabilities should not mean anything.

## Problem 4: Training and Evaluation (40 Points in Total)

You will now write code to train and test a sentiment classification model. For this problem, you will implement the `evaluate` and `train` functions in the `train_test` module.

### Problem 4a: Dataset Preparation (No Submission, 0 Points)

Before training and testing a model, you will first prepare the IMDb dataset by applying the data processing code you implemented in Problem 2.

In [27]:
train_data = imdb["train"].with_transform(tokenizer)
val_data = imdb["val"].with_transform(tokenizer)
test_data = imdb["test"].with_transform(tokenizer)

20000

After running the code above, taking a slice from `train_data`, `val_data`, or `test_data` will result in a fully processed batch ready to be used as a model input.

In [28]:
# Taking a batch from the training data
train_data[:3]

{'lengths': tensor([237, 302, 176]),
 'label': tensor([1, 0, 0]),
 'text': tensor([[30000,    12,    69,   601,   279,    21,   167,    56,     5,   167,
              1,   569,     0,    83,    23,    85,  8891,   823,    28,   133,
             28,  4511,    15,    23, 30002,    67,  2861,     2, 30002,     5,
              2,   667,     1,    22,   448,   382,    10,   158,     3,   133,
           6987,    54,    18,    38,  1328,   157,     4,   712,    27,  9536,
              4,  1434,   700,    21,     1,    22,  1713,    23,  5436,     3,
             59,    10,     2,   448,   382,     0,  1456,     4,    26,   568,
            102,   372,   180,     4,   709, 20264,    34,   525,     0,    21,
             20,     6,   667,   161,    43,    10,  3813,     7,     2,   283,
              1,    22,  6368,    10,   158,     3,     2,  2354,     5, 30002,
          10948,     3,   667, 30002,   827,     7,     6,   158,   667,    15,
             10,  5790,   465,   639,    12,  

For the purpose of testing your code, we will also create small versions of the IMDb dataset by taking 1% of the training data.

In [29]:
train_data_small = train_data.shard(num_shards=100, index=0)
val_data_small = train_data_small
test_data_small = train_data_small
print(train_data_small)

Dataset({
    features: ['text', 'label'],
    num_rows: 200
})


In [31]:
batch

{'lengths': tensor([16, 13]),
 'label': tensor([1, 0]),
 'text': tensor([[30000,    22, 30002, 24052,    30,     6,   158,   603,     1,    12,
           2203,    21,     6,   271,    36, 30001],
         [30000,    76,   603,    30,  4534,     1,    12,   121,  5333,   712,
             21,     1, 30001, 30003, 30003, 30003]])}

Notice in the snippet above that we are using the _same_ small data for training, validation, and testing. This is okay when it is solely being used to test your code, but not when we are actually training and evaluating a full model.

### Problem 4b: Model Evaluation (Code, 10 Points)

Please implement the `evaluate` function in `train_test`. This function should evaluate a model on a given testing dataset, returning its classification accuracy. In order to save memory, the evaluation should occur in mini-batches.

In [116]:
import train_test
reload(train_test)

<module 'train_test' from '/Users/kenzeng/Desktop/College/DSCI/NLU/hw2-main/train_test.py'>

In [108]:
test_data = test_data_small
correct = 0
with torch.no_grad():
    for i in range(0, len(test_data), batch_size):
        batch = test_data[i:i + batch_size]
        output = model(batch['text'], batch['lengths'])
        batch_correct = torch.sum(output.argmax(axis=1) == batch['label'])
        correct += batch_correct

In [114]:
correct * 1.0 / len(test_data)

tensor(0.5350)

In [117]:
test_acc = train_test.evaluate(model, test_data_small)
print("Test accuracy before training: {:.3f}".format(test_acc))

100%|██████████| 7/7 [00:01<00:00,  6.49it/s]

Test accuracy before training: 0.535





### Problem 4c: Model Training (Code, 20 Points)

Please implement the `train` function in `train_test`. This function should train a model using a given training set and validation set.

In [120]:
# Train the model 
train_test.train(model, train_data_small, val_data_small, max_epochs=20)

# Evaluate the model
test_acc = train_test.evaluate(model, test_data_small)
print("Test accuracy after training: {:.3f}".format(test_acc))

Epoch 1 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.60it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.46it/s]


Validation accuracy: 1.000
Epoch 2 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.18it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  4.40it/s]


Validation accuracy: 1.000
Epoch 3 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.66it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  6.01it/s]


Validation accuracy: 1.000
Epoch 4 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.91it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  6.63it/s]


Validation accuracy: 1.000
Epoch 5 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.95it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  6.52it/s]


Validation accuracy: 1.000
Epoch 6 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.83it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  6.48it/s]


Validation accuracy: 1.000
Epoch 7 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.82it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  6.14it/s]


Validation accuracy: 1.000
Epoch 8 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.29it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.29it/s]


Validation accuracy: 1.000
Epoch 9 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.27it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  4.72it/s]


Validation accuracy: 1.000
Epoch 10 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.38it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.35it/s]


Validation accuracy: 1.000
Epoch 11 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.47it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.25it/s]


Validation accuracy: 1.000
Epoch 12 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.44it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.23it/s]


Validation accuracy: 1.000
Epoch 13 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.55it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  6.64it/s]


Validation accuracy: 1.000
Epoch 14 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.79it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.83it/s]


Validation accuracy: 1.000
Epoch 15 of 20
Training...


100%|██████████| 7/7 [00:02<00:00,  2.64it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  4.44it/s]


Validation accuracy: 1.000
Epoch 16 of 20
Training...


100%|██████████| 7/7 [00:04<00:00,  1.60it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:02<00:00,  2.87it/s]


Validation accuracy: 1.000
Epoch 17 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.25it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.07it/s]


Validation accuracy: 1.000
Epoch 18 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.26it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.26it/s]


Validation accuracy: 1.000
Epoch 19 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.28it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  4.52it/s]


Validation accuracy: 1.000
Epoch 20 of 20
Training...


100%|██████████| 7/7 [00:03<00:00,  2.21it/s]


Evaluating on validation data...


100%|██████████| 7/7 [00:01<00:00,  5.57it/s]


Validation accuracy: 1.000


100%|██████████| 7/7 [00:01<00:00,  5.78it/s]

Test accuracy after training: 1.000





Your `train` function must fulfill the following criteria.
* It must use the `nn.CrossEntropyLoss` module as the loss function. (This has already been instantiated for you; see [the documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) for usage details.)
* It must use `optim.Adam` as the optimization algorithm. (This has already been instantiated for you; see [the documentation](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) for usage details.)
* It must save the version of the model with the best validation accuracy to a specified filename, and load this version model at the end of training. (See [the documentation on saving and loading models](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for information on how to do this.)
* The `train` function has no return value, but after it has finished running, the model passed to the function must contain the version of the model with the best validation performance.
* An _epoch_ is when the `train` function loops through all the examples in the training data. Training should stop when a certain number of epochs (given by the `max_epochs` hyperparameter) have been completed.
* The `train` function must use early stopping. That is, if a certain number of consecutive epochs (given by the `patience` hyperparameter) have been completed without attaining the highest validation accuracy so far, then training must end immediately even if the maximum number of epochs has not yet been reached.
* The `train` function must allow for tuning the Adam learning rate and the batch size.

### Problem 4d: Experiment (Written, 10 Points + 5 EC)

To complete the assignment, you will perform an experiment that measures the impact of initializing the model's word embeddings to pre-trained GloVe embeddings. Using the default hyperparameter settings (or the settings found using hyperparameter tuning, if you choose to do the optional extra credit problem below), please train and test two LSTM sentiment classifiers with an embedding size of 300 and hidden size of 10: one with the embedding matrix initialized to pre-trained GloVe embeddings, and one with the embedding matrix initialized randomly. Report the final test accuracies attained for both models, and comment (1 to 2 sentences) on any differences you observe between the two models.

**Warning:** The code can take a long time to run.

**Optional:** Up to 5 extra credit points may be award for hyperparameter tuning. Before completing this problem, please report the highest validation accuracy attained for at least 3 combinations of values for the following hyperparameters: `batch_size`, `max_epochs`, `patience`, and `lr` (the Adam learning rate). At least one of the 3 configurations must be the default settings. Then, complete this problem using the hyperparameter values with the best validation accuracy.