# Homework 3: Recurrent Neural Networks and Transformers
#### CSCI 3832 Natural Language Processing

Julia Troni

julia.troni@colorado.edu or jutr6738@colorado.edu

## Section 1: Recurrent Neural Networks

In this section, we'll revisit the sentiment analysis problem one final time, this time using an LSTM to classify the movie reviews. 

In [1]:
# Necessary Imports

import os, random, sys, copy
import torch, torch.nn as nn, numpy as np
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from nltk.tokenize import word_tokenize

As before, load the Glove embeddings:

In [2]:
glove_file = 'glove.6B.50d.txt'

embeddings_dict = {}

with open(glove_file, 'r', encoding='utf8') as f:
    for i, line in enumerate(f):
        line = line.strip().split(' ')
        word = line[0]
        embed = np.asarray(line[1:], "float")

        embeddings_dict[word] = embed

print('Loaded {} words from glove'.format(len(embeddings_dict)))

low = -1.0 / 3
high = 1.0 / 3
embedding_matrix = np.random.uniform(low=low, high=high, size=(len(embeddings_dict)+1, 50))

word2id = {}
for i, word in enumerate(embeddings_dict.keys(), 1):

    word2id[word] = i                                
    embedding_matrix[i] = embeddings_dict[word]      

word2id['<pad>'] = 0

Loaded 400000 words from glove


There are some differences from the previous homework, namely that we set the `<pad>` token to be the 0th word in our vocabulary. This is important when working with recurrent networks, as it PyTorch specifically expects that value when calculating losses and packing sequences (unless you specify otherwise). 

Furthermore, we're no longer initializing our embedding matrix as a matrix of zeros. This doesn't have a practical effect for this homework, since our vocabulary is exactly that of the Glove vocabulary, however it becomes important if we want to add tokens which are not found in Glove (what happens if we embed a word as a vector of zeros?). 

In [3]:
class RNNMovieReviewDataset(torch.utils.data.Dataset):

    def __init__(self, directory=None, split=None, word2id=None, finalized_data=None, data_limit=250, max_length=256):
        """


        :param directory: The location of aclImdb
        :param split: Train or test
        :param word2id: The generated glove word2id dictionary
        :param finalized_data: We'll use this to initialize a validation set without reloading the data.
        :param data_limit: Limiter on the number of examples we load
        :param max_length: Maximum length of the sequence
        """

        self.data_limit = data_limit
        self.max_length = max_length
        self.word2id = word2id

        if finalized_data:
            self.data = finalized_data

        else:

            pos_dir = directory + '{}/pos/'.format(split)
            neg_dir = directory + '{}/neg/'.format(split)

            pos_examples = self.read_folder(pos_dir)
            neg_examples = self.read_folder(neg_dir)

            pos_examples_tokenized = [(ids, 1) for ids in self.tokenize(pos_examples)]
            neg_examples_tokenized = [(ids, 0) for ids in self.tokenize(neg_examples)]

            self.data = pos_examples_tokenized + neg_examples_tokenized
            random.seed(42)
            random.shuffle(self.data)

    def read_folder(self, folder):
        examples = []
        files = os.listdir(folder)
        files.sort()
        for fname in files[:self.data_limit]:
            with open(os.path.join(folder, fname), encoding='utf8') as f:
                examples.append(f.readline().strip())
        return examples

    def tokenize(self, examples):

        example_ids = []
        misses = 0
        total = 0
        for example in tqdm(examples):
            tokens = word_tokenize(example)
            ids = []
            for tok in tokens:
                if tok in word2id:
                    ids.append(word2id[tok])
                else:
                    misses += 1
                    ids.append(word2id['unk'])
                total += 1

            if len(ids) >= self.max_length:
                ids = ids[:self.max_length]
                length = self.max_length
            else:
                length = len(ids)
                ids = ids + [word2id['<pad>']]*(self.max_length - len(ids))

            example_ids.append((torch.tensor(ids), length))
        print('Missed {} out of {} words -- {:.2f}%'.format(misses, total, misses/total))
        return example_ids

    def generate_validation_split(self, ratio=0.8):

        split_idx = int(ratio * len(self.data))

        # Take a chunk of the processed data, and return it in order to initialize a validation dataset
        validation_split = self.data[split_idx:]

        #We'll remove this data from the training data to prevent leakage
        self.data = self.data[:split_idx]

        return validation_split


    def __getitem__(self, item):
        return self.data[item]

    def __len__(self):
        return len(self.data)

We'll also initialize our movie review dataset again, but with another slight change: while we're still padding the examples to the maximum length we describe, we need to keep track of the original length. This will help us save on some computations down the line. 

Now we'll load a train and validation dataset. We won't be training any model's in this homework, so we'll only load a couple of examples to use. 

In [4]:
train_dataset = RNNMovieReviewDataset('aclImdb/', 'train', word2id, data_limit=100)
validation_examples = train_dataset.generate_validation_split()
print('Loaded {} train examples'.format(train_dataset.__len__()))

valid_dataset = RNNMovieReviewDataset(finalized_data=validation_examples, word2id=word2id)
print('Loaded {} validation examples'.format(valid_dataset.__len__()))

print(valid_dataset[0])

  0%|          | 0/100 [00:00<?, ?it/s]

Missed 3285 out of 28372 words -- 0.12%


  0%|          | 0/100 [00:00<?, ?it/s]

Missed 3757 out of 29419 words -- 0.13%
Loaded 160 train examples
Loaded 40 validation examples
((tensor([201535,     10,      8,   5349,     13,     95,     34,    493,    242,
          4588,    589,      2,    455,      1,    249,   3499,      3, 201535,
         17377,   8210,      1,   1000,      4,   1614,   2833,    754,      6,
          4582,  10380,      3, 201535,   6096,  17377, 201535,      8,   2324,
           965,      5,   1281,     27,    761,      5,      3, 201535,     64,
             2,     54,   1817,    102,    132,      1,  38083,      4,     45,
          5297,      2,     35,    416,    215,    386,      1,    621,      3,
        201535,    329,      5,    329,      2,     19,  10348,      6,     32,
            78,   1715,      4,  17146,      3, 201535,      1,    247,      6,
             1,   7659,      2,      1,    301,     15,    922,      6, 201535,
         19796,  30411,    275,  12258,  19796,  30411,    275,  12258, 201535,
             2,      1

Initialize the prediction function and training loop as before

In [5]:
def predict(model, valid_dataloader):

    sigmoid = nn.Sigmoid()

    total_correct = 0
    total_examples = len(valid_dataloader.dataset)

    for (x, x_lengths), y in valid_dataloader:
        
        output = sigmoid(model(x, x_lengths))
        
        for i in range(output.shape[0]):
            if (output[i] < 0.5 and y[i] == 0) or (output[i] >= 0.5 and y[i] == 1):
                total_correct += 1

    accuracy = total_correct / total_examples
    print('accuracy: {}'.format(accuracy))
    return accuracy

In [6]:
def train_lstm_classification(model, train_dataset, valid_dataset, epochs=10, batch_size=32, learning_rate=.001, print_frequency=25):

    criteria = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


    epochs = epochs
    batch_size = batch_size
    print_frequency = print_frequency

    #We'll create an instance of a torch dataloader to collate our data. This class handles batching and shuffling (should be done each epoch)
    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
    valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=128, shuffle=False)

    print('Total train batches: {}'.format(train_dataset.__len__() / batch_size))

    best_accuracy = 0.0
    best_model_sd = None

    for i in range(epochs):
        print('### Epoch: ' + str(i+1) + ' ###')
    
        model.train()

        avg_loss = 0

        for step, data in enumerate(train_dataloader):

            (x, x_lengths), y = data	# Our dataset is returning the input example x and also the lengths of the examples, so we'll unpack that here

            optimizer.zero_grad()

            model_output = model(x, x_lengths)

            loss = criteria(model_output.squeeze(1), y.float())

            loss.backward()
            optimizer.step()

            avg_loss += loss.item()

            if step % print_frequency == (print_frequency - 1):
                print('epoch: {} batch: {} loss: {}'.format(
                    i,
                    step,
                    avg_loss / print_frequency
                ))
                avg_loss = 0

        print('Evaluating...')
        model.eval()
        with torch.no_grad(): 
            acc = predict(model, valid_dataloader)
            if acc > best_accuracy: #all diff
                best_model_sd = copy.deepcopy(model.state_dict())
                best_accuracy = acc

    return model.state_dict(), best_model_sd

There are some differences in the training loop compared to the previous homework, namely with how we evaluate during training.

**Question (5 points)**: Identify these changes, and describe why might they be useful (consider what might happen if we train for a large number of epochs) 

_Your answer here_
The differences compared to the previous homework: 
Tracking the best accuracy and best model 

<code> if acc > best_accuracy: #all diff
                best_model_sd = copy.deepcopy(model.state_dict())
                best_accuracy = acc
                </code>

We'll now define our LSTM model:

In [8]:
class LSTMModel(nn.Module):

    def __init__(self, embedding_matrix, lstm_hidden_size=50, num_lstm_layers=1, bidirectional=True):

        super().__init__()
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(embedding_matrix))
        self.lstm = nn.LSTM(input_size = embedding_matrix.shape[1],
                            hidden_size = lstm_hidden_size,
                            num_layers = num_lstm_layers,
                            bidirectional = bidirectional,
                            batch_first = True)
        
        self.hidden_1 = nn.Linear(lstm_hidden_size * 2, lstm_hidden_size)
        self.hidden_2 = nn.Linear(lstm_hidden_size, 1)
        self.num_directions = 2 if bidirectional else 1
        self.relu = nn.ReLU()

    def forward(self, input_batch, input_lengths):
        
        print('Input batch shape: {}'.format(input_batch.shape))
        embedded_input = self.embedding(input_batch)
        
        print('Embedded input shape: {}'.format(embedded_input.shape))
        packed_input = pack_padded_sequence(embedded_input, input_lengths, batch_first=True, enforce_sorted=False)

        packed_output, (hn, cn) = self.lstm(packed_input) # See docs linked below for description of hn.shape
        print('hn shape: {}'.format(hn.shape))
  
        hn_view = hn.view(self.lstm.num_layers, self.num_directions, input_batch.shape[0], self.lstm.hidden_size) # Reshape hn for clarity -- first dimension now represents each layer (total set by num_lstm_layers)
        print('hn_view shape: {}'.format(hn_view.shape))
        
        hn_view_last_layer = hn_view[-1]   # Taking the last layer for our final LSTM output
        print('hn_view_last_layer shape: {}'.format(hn_view_last_layer.shape))
       
        hn_cat = torch.cat([hn_view_last_layer[-2, :, :], hn_view_last_layer[-1, :, :]], dim=1) # Each layer has two directions. We want to use both of these vectors, so concatenate them
        print('hn_cat shape: {}'.format(hn_cat.shape))
  
        hid = self.relu(self.hidden_1(hn_cat))
        print('hid shape: {}'.format(hid.shape))
  
        output = self.hidden_2(hid)
        print('output shape: {}'.format(output.shape))
  
        # raise KeyError
        
        return output

Modify the forward pass above to print the shapes of the following variables: `input_batch`, `embedded_input`, `hn`, `hn_view`, `hn_view_last_layer`, `hn_cat`, `hid`, `output`. **(5 pts)**

Initialize the model and run the model using the block below to print out the shapes (stop the training loop after one forward pass for cleaner outputs). Then, with references to the shapes, describe what each step of the forward pass is doing. **(40 pts)**

Hint: For the output shape of the LSTM, remember that we are using a bidrectional LSTM with 2 layers. This means that there is a forward and backward final hidden state for each layer. In our code, `hn` represents the output from the _final timestep_ of the sequence. In general, the first dimension of `hn` when using a multi-layer bidirectional LSTM will be `(layer_1_forward, layer_1_backward, layer_2_forward, layer_2_backward, layer_3_forward, layer_4_backward, ..., layer_n_backward)`.

See [here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) for the LSTM documentation (check the output section) -- for our code, `batch_first = True`. 

In [9]:
model = LSTMModel(embedding_matrix, lstm_hidden_size=50, num_lstm_layers=2, bidirectional=True)
model, best_model = train_lstm_classification(model, train_dataset, valid_dataset, batch_size=128, epochs=1)

Total train batches: 1.25
### Epoch: 1 ###
Input batch shape: torch.Size([128, 256])
Embedded input shape: torch.Size([128, 256, 50])
hn shape: torch.Size([4, 128, 50])
hn_view shape: torch.Size([2, 2, 128, 50])
hn_view_last_layer shape: torch.Size([2, 128, 50])
hn_cat shape: torch.Size([128, 100])
hid shape: torch.Size([128, 50])
output shape: torch.Size([128, 1])
Input batch shape: torch.Size([32, 256])
Embedded input shape: torch.Size([32, 256, 50])
hn shape: torch.Size([4, 32, 50])
hn_view shape: torch.Size([2, 2, 32, 50])
hn_view_last_layer shape: torch.Size([2, 32, 50])
hn_cat shape: torch.Size([32, 100])
hid shape: torch.Size([32, 50])
output shape: torch.Size([32, 1])
Evaluating...
Input batch shape: torch.Size([40, 256])
Embedded input shape: torch.Size([40, 256, 50])
hn shape: torch.Size([4, 40, 50])
hn_view shape: torch.Size([2, 2, 40, 50])
hn_view_last_layer shape: torch.Size([2, 40, 50])
hn_cat shape: torch.Size([40, 100])
hid shape: torch.Size([40, 50])
output shape: torc

_Your answer here_

1. `input_batch` - shape: [128, 256] - The first dimension is 128 as this represents the batch size. For each element in the batch, we have set the maximum sequence length to be 256
2. `embedded_input` - shape: [128, 256, 50] - After passing `input_batch` through the embedding layer, for each example in the batch, every input token is now represented by a vector of size [50]
3. `hn` - shape: 
4. `hn_view` - shape: 
5. `hn_view_last_layer` = shape: 
6. `hn_cat` - shape: 
7. `hid` - shape: 
8. `output` - shape: 

A couple things to note when training neural networks:
- Models can be sensitive to the random seed/starting initialization for the weights. To measure this sensitivity, we can train a number of models with different initializations. To get an overall performance for our model, we'll calculate the accuracy of each instance on the test set, and report the mean. 
- With enough training, models will almost always begin to overfit to the training set. The most common hallmark of this is if you notice your validation accuracy start to decrease, while your training accuracy continues to improve. To prevent this, we'll keep track of the validation accuracies throughout training -- every time we see a better validation accuracy than previously seen, we'll save the model weights at that time. This way, after training we'll have two sets of model weights: the one with best validation accuracy, and the one after the complete training loop.


We've trained 5 LSTM models on the full movie reviews training set (each with a different seed). You can download the model weights from canvas (check for a link in the assignment). The codeblock below shows how to load the saved weights:

In [10]:
# First initialize the model -- right now it'll be random. Don't change these parameters, or the saved weights won't load
loaded_model_example = LSTMModel(embedding_matrix, lstm_hidden_size=50, num_lstm_layers=2, bidirectional=True)

# Set the weights using the saved state dictionary
loaded_model_example.load_state_dict(torch.load('models/model_0.pt'))

# The state dictionary is a dictionary matching layers with saved weights
example_state_dict = torch.load('models/model_0.pt')
print(example_state_dict.keys())

# Since we are only using the model for evaluation, we'll set it in evaluation mode so any potential normalization or dropout layers work correctly
loaded_model_example.eval()

# Now we can evaluate the model on the validation set
valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=128, shuffle=False) 
predict(loaded_model_example, valid_dataloader)

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Each of the 5 trained models (i.e. model_0 .. model_4) has two versions: the fully trained one and the one saved from the best checkpoint (denoted with _best). 

Complete the following steps **(20 pts)**:

1. Load the full test dataset and create a dataloader (example above).
2. Load all 10 models, and calculate the accuracy of each model on the test set. 
3. For each model, report the difference in accuracy between the best checkpoint and fully trained model. Report the mean difference across all models.
4. Report the mean and standard deviation of accuracies of the fully trained models.
5. Report the mean and standard deviation of accuracies of the best-checkpoint models. 
6. Create a box-and-whisker plot comparing the the distributions of accuracies across the best-checkpoint and fully trained model. (x-axis will be one of [best-checkpoint|fully-trained], and y-axis will the the box and whisker)

In [None]:
'''

Your code here

'''

## Section 2: Transformers with Huggingface

Huggingface is a high-level library which abstracts away most of the implementations we've done in the homeworks -- this includes tokenization, creating a vocabulary, initializing a model, the training loop, saving/loading models, and evaluation. We'll provide a brief introduction to how the library works.

For this section, we'll focus on another task: question answering. If you haven't before, install the `transformers`, `datasets`, and `evaluate` libraries. 

In [3]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
from datasets import load_dataset
from evaluate import load
from tqdm.notebook import tqdm

import random

There are different flavors of how QA datasets are formulated. Some are _extractive_, where the answers can be found in the context, and are represented by start/end spans. Others are _multiple choice_, where the model is given answer choices and asked to classify them. For today, we'll focus on extractive QA. 

We'll load [SQuADv1.1](https://rajpurkar.github.io/SQuAD-explorer/), which is a popularly used QA dataset from Stanford. Examples consist of a context, question, and answer.

In [4]:
squad_dataset = load_dataset('squad', split='validation') # Makes the process of loading datasets much easier than before
squad_dataset = squad_dataset.select(random.choices([i for i in range(len(squad_dataset))], k=1000))

print('Loaded {} examples'.format(len(squad_dataset)))
print(squad_dataset[0]['context'])
print('Q: ' + squad_dataset[0]['question'])
print('A: ' + squad_dataset[0]['answers']['text'][0])

# Load evaluation metric
squad_evaluate = load('squad')



Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to C:/Users/julia/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to C:/Users/julia/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.
Loaded 1000 examples
The league announced on October 16, 2012, that the two finalists were Sun Life Stadium and Levi's Stadium. The South Florida/Miami area has previously hosted the event 10 times (tied for most with New Orleans), with the most recent one being Super Bowl XLIV in 2010. The San Francisco Bay Area last hosted in 1985 (Super Bowl XIX), held at Stanford Stadium in Stanford, California, won by the home team 49ers. The Miami bid depended on whether the stadium underwent renovations. However, on May 3, 2013, the Florida legislature refused to approve the funding plan to pay for the renovations, dealing a significant blow to Miami's chances.
Q: What was the other finalist besides Levi's Stadium?
A: Sun Life Stadium


Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

Define the model and tokenizer.

The library offers `pipelines` which handle the conversion of examples to model inputs for us. The `evaluate` library has a wrapped version of the official evaluation script which we can use. 

In [None]:
model_name = 'deepset/roberta-base-squad2'

def evaluate_hf_model(model_name):
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)       # Initialize the model
    tokenizer = AutoTokenizer.from_pretrained(model_name)                   # Initialize the tokenizer

    processor = pipeline('question-answering', model=model, tokenizer=tokenizer)

    def dataset_generator(dataset):
        for ex in dataset:
            yield (ex,
                {'question' : ex['question'], 'context': ex['context']})
            
    predictions = []
    references = []

    # Get predictions, and save corresponding reference (if we were using the whole dataset, we wouldn't need this step)
    for ex in tqdm(dataset_generator(squad_dataset), total=len(squad_dataset)):

        predictions.append({
                'id' : ex[0]['id'],
                'prediction_text' : processor(ex[1])['answer']
        }
        )

        # In each example, there are multiple possible answers which we compare to. Here we are converting from them from the datasets format to the one expected by the evaluation metric. 
        references.append({
            'id' : ex[0]['id'],
            'answers' : [{'text' : z[0], 'answer_start' : z[1]} for z in zip(ex[0]['answers']['text'], ex[0]['answers']['answer_start'])]
        })

    # Compute metrics
    print('Performance of {} : {}'.format(model_name, squad_evaluate.compute(predictions=predictions, references=references)))

evaluate_hf_model(model_name)

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

Explore the [SQuAD dataset page](https://huggingface.co/datasets/squad) on Huggingface. There, you can find models which have been finetuned on the dataset. Load a 3+ models using `model_name` and evaluate them. Plot their performance below (F1 scores), and include a brief description of the model, and what sets it apart from the others you are evaluating (this could be the number of parameters, the training objective, training time, etc.). This information should be available on the model card in Huggingface, or in an associated paper if it exists.  

Some ideas: look at distilled/compressed models, models with different training objectives, models that use different training data, etc. 

In [None]:
'''
Code for plot here (10 pts.)

'''

_Describe models here (20 pts.)_