# CS585: Homework 3

**This HW is due on Dec 5th, 2019, submitted via Gradescope as a PDF (File > Print > Save as PDF). 90 points total.**


**IMPORTANT**: After copying this notebook to your Google Drive, please add a link to it below. To get a publicly-accessible link, hit the Share button at the top right, then click "Get shareable link" and copy over the result. If you fail to do this, you will receive no credit for this homework!
LINK: **paste your link here**



##### How to do this problem set:

- Most of these questions require writing Python code and computing results, and the rest of them have textual answers. To generate the answers, you will have to fill out all code-blocks that say `ENTER CODE HERE`.

- For all of the textual answers you have to write your answers under placeholder text which says  `Enter your answers here` 
 
- Some experiments make take long to run (those involving ELMO and BERT), so **start early**! 

##### How to submit this problem set:
- Write all the answers in this ipython notebook. Once you are finished (1) Generate a PDF via (File -> Print -> Save as PDF) and upload to Gradescope.
  
- **Important:** check your PDF before you turn it in to Gradescope to make sure it exported correctly. If Colab gets confused about your syntax, it will sometimes terminate the PDF creation routine early.

- When creating your final version of the PDF to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right. One handy way to do this is by clicking `Runtime -> Run All` in the notebook menu.

##### Academic honesty 

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your PDF. If you turn in correct answers on your PDF without code that actually generates those answers, we will consider this a serious case of cheating. See the course page for honesty policies.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

To get started, first run the following cell to create a PyDrive client and download data to your own Google Drive.






### Overview
In this homework, we will be studying probe tasks. Probe tasks are special tasks designed to interpret neural networks (especially, deep networks or word embeddings in NLP). Probe tasks generally use simple classifiers and special datasets to analyze the linguistic content stored in dense representations.

We will also use this assignment to learn AllenNLP (https://github.com/allenai/allennlp), an excellent tool to build deep models for NLP and seamlessly integrate pretrained word embeddings.
(NOTE - This assignment is written using `allennlp` as a Python library. A faster way of using `allennlp` is via JSONNET configuration files, as described [here](https://github.com/allenai/allennlp/blob/master/tutorials/getting_started/walk_through_allennlp/configuration.md). However, learning to use AllenNLP as a library is essential when you want to write custom AllenNLP modules. [This](https://allennlp.org/tutorials) is a good tutorial on using AllenNLP as a library.)

Let's start by setting up Google Drive to download data.

(*Run the cell below. No need to edit any code here.*)

In [None]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print('success!')

### Install AllenNLP

Run the below cell once per session to install `allennlp` locally. You might need to restart the runtime after doing this. No need to re-run it for a different runtime in the same session.

(*Run the cell below. No need to edit any code here.*)

In [None]:
# can take about a minute
!pip install allennlp

### Subject-Verb Agreement

In this assignment, we will design a probe task to test whether word embeddings capture subject-verb agreement. For instance, in English we use the singular verb in this sentence,


*   CORRECT - This assignment **is** very easy.
*   WRONG - This assignment **are** very easy.

Subject-Verb agreement is an important probe task in NLP literature (see [Linzen et al. 2016](https://arxiv.org/abs/1611.01368) and [Gulordova et al. 2018](https://arxiv.org/abs/1803.11138)). We will be using the evaluation set described in  [Linzen et al. 2016](https://arxiv.org/abs/1611.01368), but formulating the problem in a different way.

We start by downloading our dataset. Each `inp` variable represents an English sentence. Each `out` variable has four parts, a) index of verb b) correct verb form c) wrong verb form d) number of agreement attractors (we will not need this data for our experiments, but you should read about them in [Linzen et al. 2016](https://arxiv.org/abs/1611.01368) to help you explain some of the results we obtain).

(*Run the cell below. No need to edit any code here.*)


In [None]:
import pickle

f_agreement = drive.CreateFile({'id': '1S1_RQQHTwwf0F6IM2MmbIr4inINyNbsQ'})
f_agreement.GetContentFile('./agreement.pkl') 


with open('./agreement.pkl', 'rb') as f_in:
    agreement_inp, agreement_output = pickle.load(f_in)


agreement_inp = agreement_inp.strip().split('\n')
agreement_output = agreement_output.strip().split('\n')

data_points = [(inp, out) for inp, out in zip(agreement_inp, agreement_output)]
print("Length of all data = %d" % len(data_points))

# Splitting into 10%-15% train-valid splits. We are not using the full dataset for computational reasons.
# We take a small training dataset since we want to assess the linguistic knowledge stored in pretrained embeddings.
len_data = len(data_points)
train_data = data_points[0 : int(0.1 * len_data)]
valid_data = data_points[int(0.1 * len_data) : int(0.25 * len_data)]
# Actual dataset sizes will be two times this, both the positive and negative example of these sentences
print("train, valid data lengths = %d, %d" % (len(train_data), len(valid_data)))
# print(train_data)
print(train_data[1])


### AllenNLP DatasetReader
We next write a `DatasetReader` in `allennlp`, which is essentially a PyTorch `DatasetReader` with some syntactic sugar. The use of `Field` objects is essential to allow seamless padding and integration with rest of the `allennlp` pipeline.

We will model our probe task as a binary classification task, where given an input sentence, a network has to determine whether the subject-verb agreement is correct or wrong. Notice how we generate two data points per sentence  in the `_read()` function, one with the correct verb (labelled `"correct"`) and the other with the wrong verb (labelled `"wrong"`).

(*Run the cell below. No need to edit any code here.*)

In [None]:
import allennlp
from allennlp.data.dataset_readers import DatasetReader

from allennlp.data import Instance
from allennlp.data.fields import TextField, LabelField

from allennlp.data.token_indexers import TokenIndexer
from allennlp.data.tokenizers import Token

from allennlp.data.vocabulary import Vocabulary

from typing import Iterator, List, Dict

class AgreementDatasetReader(DatasetReader):
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
        super().__init__(lazy=False)
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}

    def text_to_instance(self, tokens: List[str], label: str) -> Instance:
        tokens = [Token(x) for x in tokens]
        fields = {
            "sentence": TextField(tokens, self.token_indexers),
            "labels": LabelField(label)
        }
        return Instance(fields)

    def _read(self, dataset) -> Iterator[Instance]:
        for inp, out in dataset:
            correct_input = [x for x in inp.split()[:-1]]
            position, correct, wrong, _ = out.split('\t')
            position = int(position)
            # verifying whether input is in correct form
            assert correct_input[position] == correct.lower()
            # yield both the correct and wrong forms of the agreement
            wrong_input = [x for x in correct_input]
            wrong_input[position] = wrong.lower()
            
            yield self.text_to_instance(correct_input, "correct")
            yield self.text_to_instance(wrong_input, "wrong")

### Question 1.1 - AllenNLP Model (20 points)

We will now build a simple one-layer classifer on top of average-pooled embeddings.  Notice how the `word_embeddings` are passed as a parameter to the model. This helps abstract the word embeddings outside the model, so it is very simple to swap random vectors with word2vec, GloVE, ELMo or BERT (as we will do in this exercise).

Implement the `forward()` function for this model. Make sure you implement masking correctly, to avoid including masked vectors in the average pooling.

(*Implement the missing sections in the code below. This is the only code you need to implement, so be careful!*)

In [None]:
# requires code
import torch
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.nn.util import get_text_field_mask

class AgreementProbeTask(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)
        self.word_embeddings = word_embeddings

        self.final_linear = torch.nn.Linear(
            in_features=word_embeddings.get_output_dim(),
            out_features=vocab.get_vocab_size('labels')
        )
        self.criterion = torch.nn.CrossEntropyLoss()
        self.accuracy = CategoricalAccuracy()

    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> Dict[str, torch.Tensor]:

        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)

        # obtain a single vector for each element of the minibatch from the embeddings matrix
        if 'bert' in self.word_embeddings._token_embedders:
            # For BERT, we use the first token [CLS] for classification
            # construct logits using the first embedding vector of the sequence only
            #
            # ENTER CODE HERE 1
            pass
        else:
            # For other models, we will average the embeddings across the sequence dimension
            # Use the `mask` variable to correctly normalize the sum of embeddings by the sequence lengths
            #
            # ENTER CODE HERE 2
            pass

                        
        
        # Use the linear layer declared in __init__ to construct the logits
        # 
        # ENTER CODE HERE 3
        pass

        output = {"logits": None}

        if labels is not None:
            self.accuracy(logits, labels)
            output["loss"] = self.criterion(logits, labels)

        return output
    
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

### A Primer on Fine-Tuning

Before we analyze some word embeddings, we should understand some terminology (used in rest of the assignment). While using pre-trained weights for a downstream task, typically two approaches are used - 1) they are kept constant during the training 2) they optimized jointly with the rest of the network on the downstream task objective. The first approach (we will call this **frozen embeddings**) saves training cost and is arguably a better indicator of the native linguistic knowledge stored in the original embeddings. The second approach (we will call this **tunable embeddings**) generally leads to the best performance on the downstream task, but often suffers from catastrophic forgetting.

It is an open research problem to fully understand the cases where fine-tuning word embeddings is useful.  [Peters et al. 2019](https://arxiv.org/abs/1903.05987) is a recent research paper on this topic.

### Putting the Pieces Together ...

The code below is an implementation of the training infrastructure in AllenNLP, specific to our application. For the rest of the assignment, you only need to worry about modifying `config` and explaining the results you get. Run the cell below once to register the functions.

*(Run the cell below. No need to edit any code here.)*

In [None]:
from allennlp.modules.token_embedders import (
    Embedding,
    ElmoTokenEmbedder,
    PretrainedBertEmbedder
)
from allennlp.modules.token_embedders.embedding import _read_pretrained_embeddings_file
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer

from allennlp.data.token_indexers import (
    SingleIdTokenIndexer,
    ELMoTokenCharactersIndexer,
    PretrainedBertIndexer
)

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

def reset_weights_fn(m):
    if hasattr(m, "reset_parameters"):
        m.reset_parameters()

def run_experiment(config):
    embedding_type = config['embedding_type']
    embedding_size = config['embedding_size']
    urls = config['urls']
    tunable = config['tunable']
    batch_size = config['batch_size']
    num_epochs = config['num_epochs']
    reset_weights = config['reset_weights']
    
    # A token indexer is necessary to inform the DatasetReader the indexing process
    if embedding_type in ['random', 'glove']:
        logger.info("Using a single ID token indexer...")
        token_indexers = {
            'tokens': SingleIdTokenIndexer()
        }

    elif embedding_type == 'elmo':
        logger.info("Using elmo character token indexer...")
        token_indexers = {
            'elmo': ELMoTokenCharactersIndexer()
        }

    elif embedding_type == 'bert':
        logger.info("Using bert token indexer...")
        token_indexers = {
            'bert': PretrainedBertIndexer('bert-base-uncased')
        } 

    else:
        logger.info("Invalid embeddings type, quitting...")
        return

    # Loading training and validation datasets via our reader
    reader = AgreementDatasetReader(token_indexers)
    train_dataset = reader.read(train_data)
    valid_dataset = reader.read(valid_data)
    # `vocab` contains both the input vocab and output label space
    vocab = Vocabulary.from_instances(train_dataset + valid_dataset)

    if embedding_type == 'random':
        logger.info("Using random embeddings...")
        token_embedding = Embedding(
            num_embeddings=vocab.get_vocab_size('tokens'),
            embedding_dim=embedding_size,
            trainable=tunable
        )
        word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

    elif embedding_type == 'glove':
        logger.info("Using glove embeddings...")
        weight = _read_pretrained_embeddings_file(
            urls[0],
            embedding_size,
            vocab
        )
        token_embedding = Embedding(
            num_embeddings=vocab.get_vocab_size('tokens'),
            weight=weight,
            embedding_dim=embedding_size,
            trainable=tunable
        )
        word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})   

    elif embedding_type == 'elmo':
        logger.info("Using elmo embeddings...")
        elmo_token_embedding = ElmoTokenEmbedder(
            urls[0], urls[1], dropout=0, requires_grad=tunable
        )
        if reset_weights is True:
            logger.info("Resetting elmo weights...")
            
            elmo_token_embedding.apply(reset_weights_fn)
        word_embeddings = BasicTextFieldEmbedder({"elmo": elmo_token_embedding})
    
    elif embedding_type == 'bert':
        logger.info("Using bert embeddings...")
        bert_token_embedding = PretrainedBertEmbedder(
            'bert-base-uncased', requires_grad=tunable
        )
        if reset_weights is True:
            logger.info("Resetting bert weights...")
            bert_token_embedding.apply(reset_weights_fn)
        word_embeddings = BasicTextFieldEmbedder(
            {"bert": bert_token_embedding},
            {"bert": ['bert']},
            allow_unmatched_keys=True
        )

    else:
        logger.info("Invalid embeddings type, quitting...")
        return


    model = AgreementProbeTask(word_embeddings, vocab)
    
    if torch.cuda.is_available():
        cuda_device = 0
        model = model.cuda(cuda_device)
    else:
        cuda_device = -1

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    # A bucket iterator is needed to sort the data by sequence length and reduce padding overhead
    iterator = BucketIterator(batch_size=batch_size,
                              sorting_keys=[("sentence", "num_tokens")])
    iterator.index_with(vocab)

    # This function wraps all the whole training loop into a single object
    trainer = Trainer(model=model,
                      optimizer=optimizer,
                      iterator=iterator,
                      train_dataset=train_dataset,
                      validation_dataset=valid_dataset,
                      patience=20,
                      validation_metric='+accuracy',
                      num_epochs=num_epochs,
                      cuda_device=cuda_device)
    return trainer.train()

###  Question 1.2 - GloVE Embeddings (20 points)

We are ready to train our model! We will start by training a baseline classification model which will use GLoVE embeddings. We will test both cases where the GLoVE embeddings are trainable and fixed. 

1. Try varying the embedding size (50, 100, 300) and report **early stopping** validation accuracy in the text box below. AllenNLP will provide this information to you at the end of training. Keep the embeddings frozen.
2. Next, report the performance for tunable embeddings.
3. Explain your observations / trends. What do you notice about training time / number of epochs across different runs? You are encouraged to think critically about the trends you notice in this question and the following questions. Most of the points awarded will depend on your explanations.

(NOTE - The loss and accuracy values are in a similar range in some parts of this assignment, read the logs carefully!)

*Hint - As a verification of your implementation, your 300 dimensional tunable accuracy should about 61%. This could vary slightly due to stochasticity.*


The three sizes of GloVE embeddings that we are going to use can be found here -


*  50 dimensional = https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.6B.50d.txt.gz
*  100 dimensional = https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.6B.100d.txt.gz
*  300 dimensional = https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.840B.300d.txt.gz

**For your reference, the largest GloVe setting (300, tunable) takes about 3 minutes 30 seconds to finish finetuning**




### Enter your answers here


**Frozen Embeddings**  
50 dimensional = 

100 dimensional =  

300 dimensional =  




**Tunable Embeddings**  
50 dimensional =  

100 dimensional =  

300 dimensional =  

**Explain your results here**



In [None]:
config = {
    'embedding_type': 'glove',
    'embedding_size': 50,
    'urls': [
        'https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.6B.50d.txt.gz'
    ],
    'tunable': True,
    'batch_size': 100,
    'num_epochs': 100,
    'reset_weights': False
}
run_experiment(config)

###  Question 1.3 - Contextualized Word Embeddings - ELMo (20 points)

Having studied standard word embeddings, we are now going to explore modern sentence embedding techniques such as ELMo.

1. Run the model with frozen and tunable ELMo embeddings and report the **early stopping** validation accuracy results as in Question 1.2.
2. Run a baseline ELMo model with randomly initialized weights (Use the `reset_weights` configuration) and report the results. Consider both settings - frozen embeddings and tunable embeddings.
3. Explain your results, compare it with GloVE embeddings. Why was this model so slow?

**For your reference, the worst case of ELMo finetuning (random weights, tunable) takes about 1 hour and 20 minutes.**

### Enter your answers here



**Frozen Embeddings**  
random weights =  

pretrained weights = 



**Tunable Embeddings**  
random weights =  

pretrained weights = 


**Explain your results here**





In [None]:
config = {
    'embedding_type': 'elmo',
    'embedding_size': 1024,
    'urls': [
        'https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json',
        'https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5'
    ],
    'tunable': True,
    'batch_size': 100,
    'num_epochs': 100,
    'reset_weights': True
}

run_experiment(config)


###  Question 1.4 - Contextualized Word Embeddings - BERT (20 points)

As our final step, we will analyze representations learnt from BERT.

1. Run the model with frozen BERT embeddings and report the **early stopping** validation accuracy as earlier.
2. Run a baseline BERT model with randomly initialized weights (Use the `reset_weights` configuration). Consider only the frozen weight setting.
3. Explain your results, compare it with GloVE and ELMo embeddings. Why was this faster (or slower) than ELMo?



(Note - Fine-tuning BERT was crashing colab, probably due to a memory error. Feel free to check if this is the case and find any workarounds for this. Maybe, vary the batch_size)

**For your references, finetuning BERT with randomly initialized weights takes about 30 minutes**

### Enter your answers here





**Frozen Embeddings**  
random weights =    

pretrained weights = 


**Explain your results here**




In [None]:
config = {
    'embedding_type': 'bert',
    'embedding_size': 768,
    'urls': [],
    'tunable': False,
    'batch_size': 100,
    'num_epochs': 100,
    'reset_weights': False
}
run_experiment(config)

###  Question 1.5 - Sentence Composition (10 points)

One drawback of our analysis was the word embedding aggregation algorithm (averaging) used to build the sentence embedding was lossy. Is there a better way to aggregate sentences to avoid being lossy and improve the performance of frozen embedding models? (Please give at least three solutions to get full credits)

### Enter your answer here



<br>
