In [0]:
!pip install transformers  # Used to compute BERT embeddings.
!pip install textblob # for sentiment classification

import IPython.display
IPython.display.clear_output()


In [3]:
import numpy as np
import copy
import csv
import random
import IPython.display

# from tqdm.autonotebook import tqdm
from tqdm import tqdm
import pickle
import nltk
import string
nltk.download('punkt')

import tensorflow as tf
tf.enable_v2_behavior()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Natural Language Understanding
When you read a story, you understand many things about it.
You understand who the characters are, where the story is taking place, and the events that have happened so far. Given your comprehension of the story so far and your common-sense understanding of the world, you can often predict what will or will not happen next.

This skill is innate for you, but the ability to guess at likely future directions for a story is actually a really difficult task to teach computers. In this homework, you will be exploring two tasks that were designed to evaluate how well computers can tell a probable story continuation from an improbable one.


# ROCStories
The [ROCStories task](https://cs.rochester.edu/nlp/rocstories/) involves predicting which sentence best ends a short story. The stories look something like this:

**Story**
```
Dorothy's cat was pregnant.
She didn't know how it happened.
She convinced the family to keep the kittens.
It wound up having 7 kittens.
```
**Candidate Ending 1**
```
Dorothy made sure to buy lots of cat food.
```
**Candidate Ending 2**
```
Dorothy went to the pet store and bought a new hamster.
```

The bad ending sentences are designed to be on topic but clearly incorrect to a human. Despite Ending 2 mentioning a pet store, you should have quickly guessed that Ending 1 is the correct one.

The tricky part about ROCStories is that the training set only contains 5-sentence stories with good ending sentences.
However, at test time you see two possible 5th sentences and need to classify which is better.
You can read up on the dataset and how it was collected in the [paper introducing the dataset](https://www.aclweb.org/anthology/N16-1098.pdf).

In those homework, you will investigate how a very simple sentiment-based approach can get reasonable accuracy at this task, revealing the challenges behind designing datasets and tasks for evaluating natural language and commonsense understanding.

You will also train a neural network to perform the task, hopefully achieving higher accuracy.

Lastly, if you choose to, you can submit to the official leaderboard for extra credit.

### Download the data
There are two versions of the ROCStories dataset. After its original 2016 release, researchers noticed problematic biases in the data. The 2017/2018 version was an attempt to resolve these biases. You should report your results on both the 2016 and 2018 validation sets, as well as the 2016 test set. (The 2018 test set is available for download online as well, but the labels for it are still being withheld.)

In [0]:
### Download the data
!mkdir rocstories_data
!wget -nc -O rocstories_data/train2017.csv https://docs.google.com/spreadsheets/d/1emH8KL8NVCCumZc2oMu-3YqRWddD3AqZEHvNqMdfgKA/export?format=csv
!wget -nc -O rocstories_data/valid2018.csv https://docs.google.com/spreadsheets/d/1F9vtluzD3kZOn7ULKyMQZfoRnSRzRnnaePyswkRqIdY/export?format=csv
!wget -nc -O rocstories_data/valid2016.csv https://docs.google.com/spreadsheets/d/1FkdPMd7ZEw_Z38AsFSTzgXeiJoLdLyXY_0B_0JIJIbw/export?format=csv
!wget -nc -O rocstories_data/test2016.csv  https://docs.google.com/spreadsheets/d/11tfmMQeifqP-Elh74gi2NELp0rx9JMMjnQ_oyGKqCEg/export?format=csv

IPython.display.clear_output()  # Clear the stdout/

def read_rocstories_valid_csv(path):
  examples = []
  with open(path) as f:
    reader = csv.DictReader(f)
    for line in reader:
      context = [line['InputSentence1'], line['InputSentence2'],
                 line['InputSentence3'], line['InputSentence4']]
      option_0 = line['RandomFifthSentenceQuiz1']
      option_1 = line['RandomFifthSentenceQuiz2']
      label = int(line['AnswerRightEnding']) - 1
      examples.append({'context': context, 
                       'options': [option_0, option_1],
                       'label': label})
  return examples

def read_rocstories_train_csv(path):
  examples = []
  with open(path) as f:
    reader = csv.DictReader(f)
    for line in reader:
      story = [line['sentence1'], line['sentence2'],
               line['sentence3'], line['sentence4'],
               line['sentence5']]
      examples.append({'story': story})
  return examples

train_data = read_rocstories_train_csv('/content/rocstories_data/train2017.csv')
valid_2016_data = read_rocstories_valid_csv('/content/rocstories_data/valid2016.csv')
valid_2018_data = read_rocstories_valid_csv('/content/rocstories_data/valid2018.csv')
test_2016_data = read_rocstories_valid_csv('/content/rocstories_data/test2016.csv')

Here's what an example in the train dataset looks like:
```
> print(train_data[123])
{'story': ["Sam's dog Rex escaped from their yard.",
           'Sam was distraught.',
           'He went out calling for Rex.',
           'Then he saw Rex come running up the street!',
           'Sam was so relieved, he almost cried!']}
```
Here's what an example in one of the validation datasets looks like:
```
> print(valid_2016_data[123])
{'context': ["Jen got sent to her aunt's for the summer.",
            'She hated the thought of being away from her local library all summer.',
            'She took a few books with her but she would go through those quickly.',
            'When she arrived her aunt took her into a special room in her home.'],
 'label': 1,
 'options': ['Jen saw her books burning in the fireplace.',
             'The room was full of shelves of books that appealed to girls.']}
```

### Classify with sentiment analysis
After the ROCStories dataset was released in 2016, researchers soon realized that it has undesired biases. The correct next sentences tends to be more positive than the incorrect next sentences. An updated version of the dataset was released in 2018 that attempted to eliminate this bias.

Implement a function that makes a prediction based on sentiment.
You can use either [AllenNLP](https://demo.allennlp.org/sentiment-analysis) or [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis).
Your function should compute the sentiment of each 5th sentence option and predict the one with the more positive sentiment.

**In your report, list the validation and test accuracies you get with your sentiment classifier. Also show a couple of negative examples where the model incorrectly,**

*Hint: We were able to get over 60\% accuracy using only the sentiment of the 5th sentences, but you can also experiment with running sentiment analysis on the context sentences as well to see if you can improve upon this.*

In [0]:
# Computes an accuracy given the data dictionary and a list of [0, 1] predictions.

def compute_accuracy(data, predictions):
  ground_truth = np.array([ex['label'] for ex in data])
  predictions = np.array(predictions)
  assert len(ground_truth) == len(predictions)

  return np.sum(np.equal(ground_truth, predictions)) / float(len(ground_truth))

In [6]:
from random import choice
from textblob import TextBlob

def predict_based_on_sentiment(data):
  """Returns a list with one value per example in data.

  List values should either be 0 or 1 indicating which ending is predicted.
  """
  predictions = list()

  for datum in data:
    context = datum["context"]
    options = datum["options"]

    average_context_polarity = sum([TextBlob(text).sentiment.polarity for text in options]) / 4
    # todo: use averaged context polarity to make prediction

    sentiment1 = TextBlob(options[0]).sentiment.polarity
    sentiment2 = TextBlob(options[1]).sentiment.polarity

    if sentiment1 > sentiment2:
      predictions.append(0)
    elif sentiment1 < sentiment2:
      predictions.append(1)
    else:
      predictions.append(choice([0, 1]))

  return predictions

predictions_valid_2016 = predict_based_on_sentiment(valid_2016_data)
print('\n2016 validation accuracy: ' )
print(compute_accuracy(valid_2016_data, predictions_valid_2016))

predictions_valid_2018 = predict_based_on_sentiment(valid_2018_data)
print('\n2018 validation accuracy: ' )
print(compute_accuracy(valid_2018_data, predictions_valid_2018))

predictions_test_2016 = predict_based_on_sentiment(test_2016_data)
print('\n2016 test accuracy: ' )
print(compute_accuracy(test_2016_data, predictions_test_2016))


2016 validation accuracy: 
nan


  import sys



2018 validation accuracy: 
0.5747931253978358

2016 test accuracy: 
0.5611972207375735


# Now using context sentences

In [30]:
from random import choice
from textblob import TextBlob

def predict_based_on_sentiment(data):
  """Returns a list with one value per example in data.

  List values should either be 0 or 1 indicating which ending is predicted.
  """
  predictions = list()

  for datum in data:
    context = datum["context"]
    options = datum["options"]

    average_context_polarity = sum([TextBlob(text).sentiment.polarity for text in options]) / 4
    # todo: use averaged context polarity to make prediction

    sentiment1 = TextBlob(options[0]).sentiment.polarity
    sentiment2 = TextBlob(options[1]).sentiment.polarity

    delta1 = abs(sentiment1 - average_context_polarity)
    delta2 = abs(sentiment2 - average_context_polarity)

    if sentiment1 - sentiment2 > .1:
      predictions.append(0)
    elif sentiment2 - sentiment1 > .1:
      predictions.append(1)
    elif delta1 > delta2:
      predictions.append(0)
    else:
      predictions.append(1)

    print("Context " + str(context))
    print("Prediction " +  options[predictions[-1]])
    print("Option 0 " +  options[0])
    print("Option 1 " +  options[1])
    print("Sentiment of 0 " + str(sentiment1))
    print("Sentiment of 1 " + str(sentiment2))
    print("Average context sentiment " + str(average_context_polarity))

  return predictions

predictions_valid_2016 = predict_based_on_sentiment(valid_2016_data)
print('\n2016 validation accuracy: ' )
print(compute_accuracy(valid_2016_data, predictions_valid_2016))

predictions_valid_2018 = predict_based_on_sentiment(valid_2018_data)
print('\n2018 validation accuracy: ' )
print(compute_accuracy(valid_2018_data, predictions_valid_2018))

predictions_test_2016 = predict_based_on_sentiment(test_2016_data)
print('\n2016 test accuracy: ' )
print(compute_accuracy(test_2016_data, predictions_test_2016))


2016 validation accuracy: 
nan
Context ['Rick grew up in a troubled household.', 'He never found good support in family, and turned to gangs.', "It wasn't long before Rick got shot in a robbery.", 'The incident caused him to turn a new leaf.']
Prediction He is happy now.
Option 0 He is happy now.
Option 1 He joined a gang.
Sentiment of 0 0.8
Sentiment of 1 0.0
Average context sentiment 0.2
Context ["Laverne needs to prepare something for her friend's party.", 'She decides to bake a batch of brownies.', 'She chooses a recipe and follows it closely.', 'Laverne tests one of the brownies to make sure it is delicious.']
Prediction The brownies are so delicious Laverne eats two of them.
Option 0 The brownies are so delicious Laverne eats two of them.
Option 1 Laverne doesn't go to her friend's party.
Sentiment of 0 1.0
Sentiment of 1 0.0
Average context sentiment 0.25
Context ['Sarah had been dreaming of visiting Europe for years.', 'She had finally saved enough for the trip.', 'She landed 

  import sys


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Option 0 Patty had a great time with her friends.
Option 1 Patty decided to take everyone out to a restaurant.
Sentiment of 0 0.8
Sentiment of 1 0.0
Average context sentiment 0.2
Context ['Erica had a coupon for a dozen doughnuts.', 'Erica decided to take them to work.', "Erica's coworkers were surprised that she brought doughnuts.", 'The co workers thanked Erica for her kindness.']
Prediction Erica was happy to help her co workers.
Option 0 Erica felt bad for what she had done.
Option 1 Erica was happy to help her co workers.
Sentiment of 0 -0.6999999999999998
Sentiment of 1 0.8
Average context sentiment 0.02500000000000005
Context ['Maya was afraid that she was too old to have a family.', 'She dreamed of meeting the perfect man and having kids.', 'Finally she joined a dating service to try to meet the right man.', 'She met a great guy who eventually asked her to marry him.']
Prediction Maya decided to never get married.

### Train a classifier using BERT embeddings.
**Important: Go to `Runtime > Change runtime type` and make sure you have a GPU in your runtime before completing this section.**

In this section, you'll train a classifier to predict which ending is correct. Ideally, you'd finetune a large pre-trained language model (a.k.a. BERT) on the classification task, but since finetuning is pretty slow, we'll instead pre-compute BERT embeddings for each story context and for each possible ending. We'll then train a new model on top of these pre-computed embeddings.

**But how do we do classification if the training set only contains positive examples?**

We invent negative examples! At each training step, we pick a random set of 5th sentences from all of the 5th sentences in the training set to act as distractors.
The hyperparameter `NUM_CANDIDATES` sets the number of distractors that are chosen. If `NUM_CANDIDATES` is set to 50, that means we do 50-way classification in our loss.

**What should the neural network look like?**

The goal of the neural network is to project the embedding of the context into the embedding space of endings. 
This way at evaluation time, we can compute a score for each candidate ending by taking the dot-product between the predicted embedding returned by the neural network and the embeddings of each ending. Whichever ending has the highest score wins.

You are free to implement the neural network however you'd like, but you'll probably want to start with a simple [MLP](https://www.tensorflow.org/guide/keras/overview#sequential_model). (You'll want to omit the final softmax layer since that's taken care of for you in the train loop.)


**What you should complete in this section:**
* **Fill in the `get_model` function. Try at least two different architectures (varying the number of layers, hidden size, activation functions, etc.) and include a discussion of their relative performance in your report.**
* **Try training with at least two different hyperparameter settings to see if you can improve performance. Include a discussion of the experiments you tried and their performance in your report.**


#### Compute/retrieve BERT embeddings.

Note that we've commented out the lines to generate the BERT embeddings and instead provided you with pre-computed files since running BERT on 100k+ sequences can take a few hours.

However, you are welcome to uncomment and experiment with computing the embeddings yourself.

In [0]:
from transformers import BertTokenizer, BertModel
import torch

def load_bert():
  tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
  model = BertModel.from_pretrained('bert-base-uncased')
  return model, tokenizer

def bert_embedding(text):
  inputs_ids = TOKENIZER.encode(text)
  input_ids = torch.tensor(inputs_ids).unsqueeze(0)  # Batch size 1

  _, merged_embedding = BERT_MODEL(input_ids)
  return merged_embedding.detach().numpy()
  
def get_train_embeddings(data):
  """Computes embeddings for each example in the provided train set."""
  context_embeddings = []
  ending_embeddings = []
  print('Starting')
  # for example in tqdm(data, desc='Computing BERT embeddings '):
  for idx, example in enumerate(data):
    if idx % 20 == 0:
      print('{}/{}'.format(idx+1, len(data)))
      print(' '.join(example['story']))
    context_embedding = bert_embedding(' '.join(example['story'][:4]))
    ending_embedding = bert_embedding(example['story'][4])

    context_embeddings.append(context_embedding)
    ending_embeddings.append(ending_embedding)
  context_embeddings = np.concatenate(context_embeddings, axis=0)
  ending_embeddings = np.concatenate(ending_embeddings, axis=0)
  return context_embeddings, ending_embeddings

def get_valid_embeddings(data):
  """Computes embeddings for each example in the provided validation set."""
  context_embeddings = []
  ending_0_embeddings = []
  ending_1_embeddings = []
  for example in tqdm(data, desc='Computing BERT embeddings '):
    context_embedding = bert_embedding(' '.join(example['context'][:4]))
    ending_0_embedding = bert_embedding(example['options'][0])
    ending_1_embedding = bert_embedding(example['options'][1])

    context_embeddings.append(context_embedding)
    ending_0_embeddings.append(ending_0_embedding)
    ending_1_embeddings.append(ending_1_embedding)

  context_embeddings = np.concatenate(context_embeddings, axis=0)
  ending_0_embeddings = np.concatenate(ending_0_embeddings, axis=0)
  ending_1_embeddings = np.concatenate(ending_1_embeddings, axis=0)
  return context_embeddings, ending_0_embeddings, ending_1_embeddings

# These are the lines I used to generate BERT embeddings. Since, they are slow
# to compute, we've provided the outputs as .pkl files.
# BERT_MODEL, TOKENIZER = load_bert()
# train_context_embs, train_ending_embs = get_train_embeddings(train_data)
# valid_2016_context_embs, valid_2016_ending_0_embs, valid_2016_ending_1_embs = get_valid_embeddings(valid_2016_data)
# valid_2018_context_embs, valid_2018_ending_0_embs, valid_2018_ending_1_embs = get_valid_embeddings(valid_2018_data)
# test_2016_context_embs, test_2016_ending_0_embs, test_2016_ending_1_embs = get_valid_embeddings(test_2018_data)


In [0]:
!gsutil cp gs://cis700_shared_data/rocstories_data/rocstories_train.pkl /content/rocstories_train.pkl
with open('/content/rocstories_train.pkl', 'rb') as f:
  data = pickle.load(f)
  train_context_embs = data['contexts']
  train_ending_embs = data['endings']

!gsutil cp gs://cis700_shared_data/rocstories_data/rocstories_valid_2016.pkl /content/rocstories_valid_2016.pkl
with open('/content/rocstories_valid_2016.pkl', 'rb') as f:
  data = pickle.load(f)
  valid_2016_context_embs = data['contexts']
  valid_2016_ending_0_embs = data['endings_0']
  valid_2016_ending_1_embs = data['endings_1']

!gsutil cp gs://cis700_shared_data/rocstories_data/rocstories_valid_2018.pkl /content/rocstories_valid_2018.pkl
with open('/content/rocstories_valid_2018.pkl', 'rb') as f:
  data = pickle.load(f)
  valid_2018_context_embs = data['contexts']
  valid_2018_ending_0_embs = data['endings_0']
  valid_2018_ending_1_embs = data['endings_1']

!gsutil cp gs://cis700_shared_data/rocstories_data/rocstories_test_2016.pkl /content/rocstories_test_2016.pkl
with open('/content/rocstories_test_2016.pkl', 'rb') as f:
  data = pickle.load(f)
  test_2016_context_embs = data['contexts']
  test_2016_ending_0_embs = data['endings_0']
  test_2016_ending_1_embs = data['endings_1']

#### Train a model

In [0]:
def get_batch(batch_size, num_candidates):
  """Returns a single training batch.
  
  Returns:
  batch_inputs: [batch_size, embedding_size] matrix of context embeddings.
  batch_candidates: [num_candidates, embedding_size] matrix of embeddings of 
    candidate 5th sentence embeddings. The groundtruth 5th sentence for the ith
    example in batch_inputs is in the ith row of batch_candidates.
  labels: [batch_size] For each example in batch_inputs, the index of the true
    5th sentence in batch_candidates.
  """
  if num_candidates < batch_size:
    raise ValueError(
        'At minimum the number of candidates is at least all of the other 5th '
        'sentences in the batch.')
    
  batch_inputs = []
  batch_candidates = []
  batch_labels = []
  for i in range(batch_size):
    rand_ex_index = random.randint(0, train_context_embs.shape[0]-1)
    batch_inputs.append(train_context_embs[rand_ex_index, :])
    batch_candidates.append(train_ending_embs[rand_ex_index, :])
    # The true next embedding is in the ith position in the candidates
    batch_labels.append(i)

  # Increase the number of "distractor" candidates to num_candidates.
  for i in range(num_candidates - batch_size):
    rand_ex_index = random.randint(0, train_context_embs.shape[0]-1)
    batch_candidates.append(train_ending_embs[rand_ex_index, :])

  batch_inputs = np.stack(batch_inputs, axis=0)
  batch_candidates = np.stack(batch_candidates, axis=0)
  return batch_inputs, batch_candidates, batch_labels

def predict_based_on_bert_classifier(
    context_embs, ending_0_embs, ending_1_embs, model):
  """Returns a list of predictions based on model."""
  predicted_embs = model(context_embs)
  
  predictions = []
  for idx in range(predicted_embs.shape[0]):
    pred_emb = predicted_embs[idx, :]
    score_0 = np.dot(pred_emb, ending_0_embs[idx, :])
    score_1 = np.dot(pred_emb, ending_1_embs[idx, :])
    predictions.append(score_0 < score_1)
  return predictions
  
def get_model():
  """Returns a Keras model.
  The model should input a [batch_size, embedding_size] tensor and output a new
  [batch_size, embedding_size] tensor. At it's simplest, it could just be a
  single dense layer. You should experiment with adding layers, changing the
  activation function, or otherwise modifying the architecture defined below.
  See:
  https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
  
  """

  # This is an example of a very simple network consisting of a single nonlinear
  # layer followed by a linear projection back to the BERT embedding size.
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(512, activation="relu"))
  tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
  model.add(tf.keras.layers.Dense(128, activation="relu"))
  model.add(tf.keras.layers.Dense(768, activation="linear"))
  
  return model


You should experiment with the hyperparamters (IN ALL_CAPS) below to see if you can improve performance. I was able to get 67% validation accuracy with the provided values and using a two-layer network with a ReLU nonlinearity between the layers. Training took about an hour.

In [0]:
#### HYPERPARAMETERS ####
NUM_TRAIN_STEPS = 40000  # How many step to train for.
BATCH_SIZE = 32  # Number of examples used in step of training.
NUM_CANDIDATES = 60  # Number of candidate 5th sentences classifier must decide between.
LEARNING_RATE = 0.002  # Learning rate.
# If your loss is barely going down, learning rate might be too small.
# If your loss is jumping around, it might be too big.

# You may experiment with other optimizers or loss functions if you'd like.
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model = get_model()

# Iterate over the batches of a dataset.
for train_step in range(NUM_TRAIN_STEPS):
  with tf.GradientTape() as tape:
    batch_inputs, batch_candidates, batch_labels = get_batch(BATCH_SIZE, NUM_CANDIDATES)

    # Predicted 5th sentence embedding for each batch position/
    outputs = model(batch_inputs)
    # The logits will be batch_size * num_candidates, giving a score for each
    # candidate 5th sentence. We'd like the true 5th sentence to have the
    # highest score.
    logits = tf.matmul(outputs, batch_candidates, transpose_b=True)
    # Loss value for this minibatch
    loss_value = loss_fn(batch_labels, logits)

  grads = tape.gradient(loss_value, model.trainable_weights)
  optimizer.apply_gradients(zip(grads, model.trainable_weights))

  if train_step % 100 == 0:
    print('Step {}, batch_train_loss={}'.format(train_step, loss_value))
  if train_step % 1000 == 0:
    predictions_2016 = predict_based_on_bert_classifier(valid_2016_context_embs, valid_2016_ending_0_embs, valid_2016_ending_1_embs,model)
    predictions_2018 = predict_based_on_bert_classifier(valid_2018_context_embs, valid_2018_ending_0_embs, valid_2018_ending_1_embs,model)
    
    print('2016 validation accuracy: {}'.format(compute_accuracy(valid_2016_data, predictions_2016)))
    print('2018 validation accuracy: {}'.format(compute_accuracy(valid_2018_data, predictions_2018)))

**What is overfitting?**

You may have observed when training that your validation accuracy goes up for a while and then eventually starts going down. This is called overfitting, because your model is learning to be really good at classifying examples from the training set at the expense of dong a good job at classifying usneen exampes in the validation set. You could prevent overfitting by automatically stopping training when the validation accuracy has not improved in X steps (where X is another hyperparamter you'd have to decide upon).

#### Evaluate your model

In [0]:
predictions_2016 = predict_based_on_bert_classifier(
    valid_2016_context_embs, valid_2016_ending_0_embs, valid_2016_ending_1_embs,
    model)
print('\n2016 validation accuracy: ' )
print(compute_accuracy(valid_2016_data, predictions_2016))

predictions_2018 = predict_based_on_bert_classifier(
    valid_2018_context_embs, valid_2018_ending_0_embs, valid_2018_ending_1_embs,
    model)
print('\n2018 validation accuracy: ' )
print(compute_accuracy(valid_2018_data, predictions_2018))

predictions_2016 = predict_based_on_bert_classifier(
    test_2016_context_embs, test_2016_ending_0_embs, test_2016_ending_1_embs,
    model)
print('\n2016 test accuracy: ' )
print(compute_accuracy(test_2016_data, predictions_2016))

### Train classifier on validation set
Part of the difficulty (and interestingness) of the ROCStories task is that the training set contains only positive examples. However, researchers have found that accuracies as high as 90% are possible if you cheat and train a supervised classifier using the examples with both positive and negative examples found in the validation set.

**Run the train code below, experimenting with at least twos variant, either modifying the hyperparamters or model architecture. You could also try to find a way to take advantage of the large unlabeled dataset in addition to the labeled data. Include a discussion in your report.**

In [0]:
def get_batch_from_valid(batch_size, inputs, labels):
  """Returns a single training batch extracted form the validation set.

  Inputs:
  batch_size: The batch size.
  inputs: [dataset_size, 2*embedding_size] matrix of all inputs in the training
    set.
  labels: [dataset_size] for each example, 0 if example has the incorrect ending
    embedding, 1 if it has the correct ending embedding.
  
  Returns:
  batch_inputs: [batch_size, 2*embedding_size] matrix of embeddings (each
    embedding is a context embedding concatenated with an ending embedding).
  labels: [batch_size] For each example in batch_inputs, contains either 0 or 1,
    indicating whether the 5th ending is the correct one.
  """
  batch_inputs = []
  batch_labels = []
  for i in range(batch_size):
    rand_ex_index = random.randint(0, inputs.shape[0]-1)    
    batch_inputs.append(inputs[rand_ex_index, :])
    batch_labels.append(labels[rand_ex_index])
    
  batch_inputs = np.stack(batch_inputs, axis=0)
  return batch_inputs, batch_labels

# Each input example consists of a context_embedding concatenated with an ending embedding.
def build_dataset():
  """Builds a dataset out of the validation set examples.

  Each example in valid_2016 and valid_2018 becomes two exampes in this new 
  dataset:
  * one where ending_0's embedding is concatenated to the context embedding
  * one where ending_1's embedding is concatenated to the context embedding

  The label for each example is 1 if the correct ending's embedding is present,
  0 if the incorrect ending's embedding is present.

  Returns:
  all_inputs: [new_dataset_size, embedding_size*2]
  all_labels: [new_dataset_size]
  """
  inputs_2016 = tf.concat(
      [tf.concat([valid_2016_context_embs, valid_2016_ending_0_embs], axis=-1),
      tf.concat([valid_2016_context_embs, valid_2016_ending_1_embs], axis=-1)], axis=0)
  labels = [ex['label'] for ex in valid_2016_data]
  labels_2016 = labels + [1 - label for label in labels]

  inputs_2018 = tf.concat(
      [tf.concat([valid_2018_context_embs, valid_2018_ending_0_embs], axis=-1),
      tf.concat([valid_2018_context_embs, valid_2018_ending_1_embs], axis=-1)], axis=0)
  labels = [ex['label'] for ex in valid_2018_data]
  labels_2018 = labels + [1 - label for label in labels]

  all_inputs = tf.concat([inputs_2016, inputs_2018], axis=0)
  all_labels = labels_2016 + labels_2018

  return all_inputs, all_labels

def predict_based_on_bert_binary_classifier(
    context_embs, ending_0_embs, ending_1_embs, model):
  """Returns a list of predictions based on binary classification model."""
  scores_ending_0 = model(tf.concat([context_embs, ending_0_embs], -1))
  scores_ending_1 = model(tf.concat([context_embs, ending_1_embs], -1))
  predictions = tf.greater(scores_ending_0, scores_ending_1)[:, 1]
  return predictions

def get_binary_classifier():
  """Returns a Keras model.
  The model should input a [batch_size, 2*embedding_size] tensor and output a
  [batch_size, 2] tensor. The final final dimension needs to be 2 because we are
  doing binary classification.
  
  You should experiment with modifying the architecture below.
  See:
  https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
  
  """

  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(1000, activation="relu"))
  model.add(tf.keras.layers.Dense(128, activation="relu"))
  tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
  model.add(tf.keras.layers.Dense(2, activation="linear"))
  
  return model

In [0]:
NUM_TRAIN_STEPS = 30000  # How many step to train for.
BATCH_SIZE = 32  # Number of examples used in step of training.
LEARNING_RATE = 0.001  # Learning rate.

NUM_TRAIN_EXAMPLES = 6000 # How many examples from the valid set to use for training.
# The remainder will be placed into a new valid set.

# You should with varying NUM_TRAIN_EXAMPLES. If it is larger, you will train a 
# better model, but you will have fewer examples available your validation set
# for tuning other hyperparameters.
all_inputs, all_labels = build_dataset()
train_inputs = all_inputs[:NUM_TRAIN_EXAMPLES, :]
train_labels = all_labels[:NUM_TRAIN_EXAMPLES]
valid_inputs = all_inputs[NUM_TRAIN_EXAMPLES:, :]
valid_labels = all_labels[NUM_TRAIN_EXAMPLES:]

# You may experiment with other optimizers or loss functions if you'd like.
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model_2 = get_binary_classifier()

# Iterate over the batches of a dataset.
for train_step in range(NUM_TRAIN_STEPS):
  with tf.GradientTape() as tape:
    batch_inputs, batch_labels = get_batch_from_valid(
        BATCH_SIZE, train_inputs, train_labels)

    logits = model_2(batch_inputs)
    loss_value = loss_fn(batch_labels, logits)

  grads = tape.gradient(loss_value, model_2.trainable_weights)
  optimizer.apply_gradients(zip(grads, model_2.trainable_weights))

  if train_step % 100 == 0:
    batch_acc = sum(tf.equal(batch_labels, tf.argmax(logits, axis=-1)).numpy()) / BATCH_SIZE
    print('Step {0}, batch_loss={1:.5f}, batch_acc={2:.3f}'.format(
        train_step, loss_value, batch_acc))
  if train_step % 1000 == 0:
    valid_logits = model_2(valid_inputs)
    num_correct = sum(tf.equal(valid_labels, tf.argmax(valid_logits, axis=-1)).numpy())
    print('Validation accuracy: {0:.3f}'.format(num_correct / len(valid_labels)))

In [0]:
# We can no longer fairly evaluate on the 2016 and 2018 validation sets since
# they've been used for training. Instead, we only evaluate on the 2016 test set.

predictions_2016 = predict_based_on_bert_binary_classifier(
    test_2016_context_embs, test_2016_ending_0_embs, test_2016_ending_1_embs,
    model_2)
print('\n2016 test accuracy: ' )
print(compute_accuracy(test_2016_data, predictions_2016))

# Extra Credit
For extra credit, make an account on [Codalab](https://competitions.codalab.org/) and submit your best model's outputs to the [Winter 2018 leaderboard](https://competitions.codalab.org/competitions/15333#participate-submit_results). You'll need to download the Winter 2018 CSV and create BERT embeddings for it.

**If you choose to do the extra credit, please take a screenshot of the Codalab leaderboard  (including your submission) and paste it into your report. Your report should also include a description of the method that you used in your submission.**