# HW 4 - All About Attention

Welcome to CS 287 HW4. To begin this assignment first turn on the Python 3 and GPU backend for this Colab by clicking `Runtime > Change Runtime Type` above.  

In this homework you will be reproducing the decomposable attention model in Parikh et al. https://aclweb.org/anthology/D16-1244. (This is one of the models that inspired development of the transformer). 



## Goal

We ask that you finish the following goals in PyTorch:

1. Implement the vanilla decomposable attention model as described in that paper.
2. Implement the decomposable attention model with intra attention or another extension.
3. Visualize the attentions in the above two parts.
4. Implement a mixture of models with uniform prior and perform training with exact log marginal likelihood (see below for detailed instructions)
5. Train the mixture of models in part 4 with VAE. (This may not produce a better model, this is still a research area) 
6. Interpret which component specializes at which type of tasks using the posterior.

Consult the paper for model architecture and hyperparameters, but you are also allowed to tune the hyperparameters yourself. 

## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [10]:
!pip install -q torch torchtext opt_einsum git+https://github.com/harvardnlp/namedtensor

In [2]:
import torch
# Text text processing library and methods for pretrained word embeddings
import torchtext
from torchtext.vocab import Vectors, GloVe

# Named Tensor wrappers
from namedtensor import ntorch, NamedTensor
from namedtensor.text import NamedField

In [4]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

The dataset we will use of this problem is known as the Stanford Natural Language Inference (SNLI) Corpus ( https://nlp.stanford.edu/projects/snli/ ). It is collection of 570k English sentence pairs with relationships entailment, contradiction, or neutral, supporting the task of natural language inference (NLI). 

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [6]:
# Our input $x$
TEXT = NamedField(names=('seqlen',))

# Our labels $y$
LABEL = NamedField(sequential=False, names=())

Next we input our data. Here we will use the standard SNLI train split, and tell it the fields.

In [7]:
train, val, test = torchtext.datasets.SNLI.splits(
    TEXT, LABEL)

downloading snli_1.0.zip


snli_1.0.zip: 100%|██████████| 94.6M/94.6M [00:25<00:00, 3.73MB/s]


extracting


Let's look at this data. It's still in its original form, we can see that each example consists of a premise, a hypothesis and a label.

In [8]:
print('len(train)', len(train))
print('vars(train[0])', vars(train[0]))

len(train) 549367
vars(train[0]) {'premise': ['A', 'person', 'on', 'a', 'horse', 'jumps', 'over', 'a', 'broken', 'down', 'airplane.'], 'hypothesis': ['A', 'person', 'is', 'training', 'his', 'horse', 'for', 'a', 'competition.'], 'label': 'neutral'}


In order to map this data to features, we need to assign an index to each word an label. The function build vocab allows us to do this and provides useful options that we will need in future assignments.

In [9]:
TEXT.build_vocab(train)
LABEL.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))
print('LABEL.vocab', LABEL.vocab)

len(TEXT.vocab) 62998
LABEL.vocab <torchtext.vocab.Vocab object at 0x104429588>


Finally we are ready to create batches of our training data that can be used for training and validating the model. This function produces 3 iterators that will let us go through the train, val and test data. 

In [13]:
train_iter, val_iter, test_iter = torchtext.data.BucketIterator.splits(
    (train, val, test), batch_size=16, device=device, repeat=False)

Let's look at a single batch from one of these iterators.

In [14]:
batch = next(iter(train_iter))
print("Size of premise batch:", batch.premise.shape)
print("Size of hypothesis batch:", batch.hypothesis.shape)
premise = batch.premise.get("batch", 1)
print("Second premise in batch", premise)
print("Converted back to string:", " ".join([TEXT.vocab.itos[i] for i in premise.tolist()]))
hypothesis = batch.hypothesis.get("batch", 1)
print("Second hypothesis in batch", hypothesis)
print("Converted back to string:", " ".join([TEXT.vocab.itos[i] for i in hypothesis.tolist()]))

Size of premise batch: OrderedDict([('seqlen', 22), ('batch', 16)])
Size of hypothesis batch: OrderedDict([('seqlen', 17), ('batch', 16)])
Second premise in batch NamedTensor(
	tensor([  3,  37,  11,  18,  10, 251,   4,   6, 226, 190,  24, 741,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1]),
	('seqlen',))
Converted back to string: A group of people are laying in the grass under an umbrella. <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Second hypothesis in batch NamedTensor(
	tensor([ 52, 251,   4,   6, 226,   4,   6, 748,   1,   1,   1,   1,   1,   1,
          1,   1,   1]),
	('seqlen',))
Converted back to string: People laying in the grass in the rain. <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Similarly it produces a vector for each of the labels in the batch. 

In [None]:
print("Size of label batch:", batch.label.shape)
example = batch.label.get("batch", 1)
print("Second in batch", example.item())
print("Converted back to string:", LABEL.vocab.itos[example.item()])

Size of label batch: OrderedDict([('batch', 10)])
Second in batch 3
Converted back to string: neutral


Finally the Vocab object can be used to map pretrained word vectors to the indices in the vocabulary.  

In [None]:
# Build the vocabulary with word embeddings
# Out-of-vocabulary (OOV) words are hashed to one of 100 random embeddings each
# initialized to mean 0 and standarad deviation 1 (Sec 5.1)
import random
unk_vectors = [torch.randn(300) for _ in range(100)]
TEXT.vocab.load_vectors(vectors='glove.6B.300d',
                        unk_init=lambda x:random.choice(unk_vectors))
# normalized to have l_2 norm of 1
vectors = TEXT.vocab.vectors
vectors = vectors / vectors.norm(dim=1,keepdim=True)
vectors = NamedTensor(vectors, ('word', 'embedding'))
TEXT.vocab.vectors = vectors
print("Word embeddings shape:", TEXT.vocab.vectors.shape)
print("Word embedding of 'follows', first 10 dim ",
      TEXT.vocab.vectors.get('word', TEXT.vocab.stoi['follows']) \
                        .narrow('embedding', 0, 10))

Word embeddings shape: OrderedDict([('word', 62998), ('embedding', 300)])
Word embedding of 'follows', first 10 dim  NamedTensor(
	tensor([-0.0452, -0.0213,  0.0814,  0.0006, -0.0474,  0.0151, -0.0625, -0.0058,
         0.0476, -0.1896]),
	('embedding',))


## Assignment

Now it is your turn to implement the models described at the top of the assignment using the data given by this iterator. 



### Instructions for latent variable mixture model.

For the last part of this assignment we will consider a latent variable version of this model. This is a use of latent variable as a form of ensembling.

Instead of a single model, we use $K$ models $p(y | \mathbf{a}, \mathbf{b}; \theta_k)$ ($k=1,\cdots,K$), where $K$ is a hyperparameter. Let's introduce a discrete latent variable $c\sim \text{Uniform}(1,\cdots, K)$ denoting which model is being used to produce the label $y$, then the marginal likelihood is


$$
p(y|\mathbf{a}, \mathbf{b}; \theta) = \sum_{c=1}^K p(c) p(y | \mathbf{a}, \mathbf{b}; \theta_c)
$$

When $K$ is small, we can *enumerate* all possible values of $c$ to maximize the log marginal likelihood. 

We can also use variational auto encoding to perform efficient training. We first introduce an inference network $q(c| y, \mathbf{a}, \mathbf{b})$, and the ELBO is

$$
\log p(y|\mathbf{a}, \mathbf{b}; \theta)  \ge \mathbb{E}_{c \sim q(c|y, \mathbf{a}, \mathbf{b})} \log p(y|\mathbf{a},\mathbf{b}; \theta_c) - KL(q(c|y, \mathbf{a}, \mathbf{b})|| p(c)),
$$

where $p(c)$ is the prior uniform distribution. We can calculate the $KL$ term in closed form, but for the first term in ELBO, due to the discreteness of $c$, we cannot use the reparameterization trick. Instead we use REINFORCE to estimate the gradients (or see slides):

$$
\nabla \mathbb{E}_{c \sim q(c|y, \mathbf{a}, \mathbf{b})} \log p(y|\mathbf{a},\mathbf{b}; \theta_c) = \mathbb{E}_{c \sim q(c|y, \mathbf{a}, \mathbf{b})} \left [\nabla \log p(y|\mathbf{a},\mathbf{b}; \theta_c) + \log p(y|\mathbf{a},\mathbf{b}; \theta_c)  \nabla \log q(c|y, \mathbf{a}, \mathbf{b})\right]
$$


At inference time, to get $p(y|\mathbf{a}, \mathbf{b}; \theta)$ we use enumeration to calculate it exactly. For posterior inference, we can either use $q(c| y, \mathbf{a}, \mathbf{b})$ to approximate the true posterior or use Bayes rule to calculate the posterior exactly.

To interpret what specialized knowledge each component $c$ learns, we can find those examples whose posterior reaches maximum at $c$. 

When a model is trained, use the following test function to produce predictions, and then upload your best result to the kaggle competition:  https://www.kaggle.com/c/harvard-cs287-s19-hw4

In [None]:
def test_code(model):
    "All models should be able to be run with following command."
    upload = []
    # Update: for kaggle the bucket iterator needs to have batch_size 10
    test_iter = torchtext.data.BucketIterator(test, train=False, batch_size=10)
    for batch in test_iter:
        # Your prediction data here (don't cheat!)
        probs = model(batch.text)
        # here we assume that the name for dimension classes is `classes`
        _, argmax = probs.max('classes')
        upload += argmax.tolist()

    with open("predictions.txt", "w") as f:
        for u in upload:
            f.write(str(u) + "\n")

In addition, you should put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/nlp-template