# Homework and bake-off: pragmatic color descriptions

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Summer 2022"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [All two-word examples as a dev corpus](#All-two-word-examples-as-a-dev-corpus)
1. [Dev dataset](#Dev-dataset)
1. [Random train–test split for development](#Random-train–test-split-for-development)
1. [Question 1: Improve the tokenizer [1 point]](#Question-1:-Improve-the-tokenizer-[1-point])
1. [Use the tokenizer](#Use-the-tokenizer)
1. [Question 2: Improve the color representations [1 point]](#Question-2:-Improve-the-color-representations-[1-point])
1. [Use the color representer](#Use-the-color-representer)
1. [Initial model](#Initial-model)
1. [Question 3: GloVe embeddings [1 point]](#Question-3:-GloVe-embeddings-[1-point])
1. [Try the GloVe representations](#Try-the-GloVe-representations)
1. [Question 4: Color context [3 points]](#Question-4:-Color-context-[3-points])
1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bakeoff [1 point]](#Bakeoff-[1-point])
1. [Submission Instruction](#Submission-Instruction)

## Overview

This homework and associated bake-off are oriented toward building an effective system for generating color descriptions that are pragmatic in the sense that they would help a reader/listener figure out which color was being referred to in a shared context consisting of a target color (whose identity is known only to the describer/speaker) and a set of distractors.

The notebook [colors_overview.ipynb](colors_overview.ipynb) should be studied before work on this homework begins. That notebook provides backgroud on the task, the dataset, and the modeling code that you will be using and adapting.

The homework questions are more open-ended than previous ones have been. Rather than asking you to implement pre-defined functionality, they ask you to try to improve baseline components of the full system in ways that you find to be effective. As usual, this culminates in a prompt asking you to develop a novel system for entry into the bake-off. In this case, though, the work you do for the homework will likely be directly incorporated into that system (not required, but an efficient way to work at the very least).

## Set-up

See [colors_overview.ipynb](colors_overview.ipynb) for set-up in instructions and other background details.

In [2]:
from colors import ColorsCorpusReader
from nltk.translate.bleu_score import corpus_bleu
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from torch_color_describer import ContextualColorDescriber
from torch_color_describer import create_example_dataset
import torch
import pickle

import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL

In [3]:
utils.fix_random_seeds()

In [4]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")

## All two-word examples as a dev corpus

So that you don't have to sit through excessively long training runs during development, I suggest working with the two-word-only subset of the corpus until you enter into the late stages of system testing.

In [4]:
dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=2,
    normalize_colors=True)

In [5]:
dev_examples = list(dev_corpus.read())

This subset has about one-third the examples of the full corpus:

In [6]:
len(dev_examples)

13890

We __should__ worry that it's not a fully representative sample. Most of the descriptions in the full corpus are shorter, and a large proportion are longer. So this dataset is mainly for debugging, development, and general hill-climbing. All findings should be validated on the full dataset at some point.

## Dev dataset

The first step is to extract the raw color and raw texts from the corpus:

In [7]:
"""dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])"""

'dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])'

The raw color representations are suitable inputs to a model, but the texts are just strings, so they can't really be processed as-is. Question 1 asks you to do some tokenizing!

## Random train–test split for development

For the sake of development runs, we create a random train–test split:

In [8]:
"""dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)"""

'dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test =     train_test_split(dev_rawcols, dev_texts)'

## Question 1: Improve the tokenizer [1 point]

This is the first required question – the first required modification to the default pipeline.

The function `tokenize_example` simply splits its string on whitespace and adds the required start and end symbols:

In [6]:
from transformers import BertModel, BertTokenizer

weights_name = 'bert-base-uncased'
bert_tokenizer = BertTokenizer.from_pretrained(weights_name)

def tokenize_example(s, verbose=False):
    encoding = bert_tokenizer.encode(s)
    return bert_tokenizer.convert_ids_to_tokens(encoding)

"""def bert_tokenize(texts):
    example_ids = bert_tokenizer.batch_encode_plus(
        texts,
        add_special_tokens=True,
        return_attention_mask=True,
        padding = 'longest')
    return example_ids
    input_ids = example_ids['input_ids']
    attention = example_ids['attention_mask']
    return input_ids, attention
    # tokens = bert_tokenizer.convert_ids_to_tokens(input_ids), attention"""

"def bert_tokenize(texts):\n    example_ids = bert_tokenizer.batch_encode_plus(\n        texts,\n        add_special_tokens=True,\n        return_attention_mask=True,\n        padding = 'longest')\n    return example_ids\n    input_ids = example_ids['input_ids']\n    attention = example_ids['attention_mask']\n    return input_ids, attention\n    # tokens = bert_tokenizer.convert_ids_to_tokens(input_ids), attention"

## Use the tokenizer

Once the tokenizer is working, run the following cell to tokenize your inputs:

We use only the train set to derive a vocabulary for the model:

In [7]:
def get_vocab_from_bert(bert_tokens):
    vocab = {UNK_SYMBOL}
    for row in bert_tokens['input_ids']:
        vocab = vocab.union({tok for tok in bert_tokenizer.convert_ids_to_tokens(row)})
    vocab.remove('[PAD]')
    return list(sorted(vocab))

# bert_vocab = get_vocab_from_bert(bert_tokens_train)

It's important that the `UNK_SYMBOL` is included somewhere in this list. In test examples, words not seen in training will be mapped to `UNK_SYMBOL`. 

Conceptual note: If you model's vocab is the same as your train vocab, then `UNK_SYMBOL` will never be encountered during training, so it will be a random vector at test time.

## Question 2: Improve the color representations [1 point]

This is the second required pipeline improvement for the assignment. 

The following functions do nothing at all to the raw input colors we get from the corpus. 

In [8]:
from itertools import product

def represent_color_context(colors):

    # Improve me!

    return [represent_color(color) for color in colors]


def represent_color(color):
    
    [H, S, V] = convert_hls_to_hsv(color)
    
    representation = []
    for j, k, l in product((0, 1, 2), repeat = 3):
        f_jkl = np.exp(-2j*np.pi*(j*H + k*S + l*V))
        representation += [f_jkl.real, f_jkl.imag]

    return representation

def convert_hls_to_hsv(color):
    [H, L, S] = color
    
    H_new = H
    V_new = L + S * min(L, 1-L)
    S_new = 2 * (1 - L/V_new) if V_new != 0 else 0
    
    return [H_new, S_new, V_new]

The following test seeks to ensure only that the output of your `represent_color_context` will be compatible with the models we are creating:

## Use the color representer

The following cell just runs your `represent_color_context` on the train and test sets:

At this point, our preprocessing steps are complete, and we can fit a first model.

In [9]:
from torch_color_describer import Decoder
import torch
import torch.nn as nn

class BertColorContextDecoder(Decoder):
    def __init__(self, color_dim, idx2bertids, *args, **kwargs):
        self.color_dim = color_dim
        self.weights_name = 'bert-base-uncased'
        bert_model = BertModel.from_pretrained(self.weights_name)
        self.embed_dim = bert_model.embeddings.word_embeddings.embedding_dim
        self.idx2bertids = idx2bertids
        self.attention_mask = None
        super().__init__(*args, **kwargs)
        
        self.bert = bert_model
        self.bert.train()
        
        self.rnn = nn.GRU(
            input_size = self.embed_dim + color_dim,
            hidden_size = self.hidden_dim,
            batch_first = True)
        
    def get_embeddings(self, word_seqs, target_colors=None):
        # The key question is what is word_seqs, I think it should
        # probably be BERT tokens, then we can feed these into the
        # the bert model and get the reps, basically disconnect from
        # vocab, this means reworking the tokenize pipeline, the issue
        # before was just that we were tokenizing the vocab, not the
        # examples
        
        bert_id_seqs = self.idx2bertids[word_seqs]
        
        attention_mask = torch.zeros(word_seqs.shape)
        attention_mask[bert_id_seqs.nonzero(as_tuple=True)] = 1
        
        # with torch.no_grad():
        token_embeddings = self.bert(word_seqs, attention_mask=self.attention_mask).last_hidden_state
            
        color_embeddings = (torch.stack([
            torch.repeat_interleave(
                target_color.reshape(1, -1),
                token_embeddings.shape[1],
                dim=0)
            for target_color in target_colors], dim=0))
        
        # print(color_embeddings.shape)
        
        return torch.cat((token_embeddings, color_embeddings), dim=2)
        
        

class ColorContextDecoder(Decoder):
    def __init__(self, color_dim, *args, **kwargs):
        self.color_dim = color_dim
        super().__init__(*args, **kwargs)

        # Fix the `self.rnn` attribute:
        ##### YOUR CODE HERE
        self.rnn = nn.GRU(
            input_size = self.embed_dim + color_dim,
            hidden_size = self.hidden_dim,
            batch_first = True)


    def get_embeddings(self, word_seqs, target_colors=None, attention_masks=None):
        """
        You can assume that `target_colors` is a tensor of shape
        (m, n), where m is the length of the batch (same as
        `word_seqs.shape[0]`) and n is the dimensionality of the
        color representations the model is using. The goal is
        to attached each color vector i to each of the tokens in
        the ith sequence of (the embedded version of) `word_seqs`.

        """
        ##### YOUR CODE HERE
        
        word_embeddings = self.embedding(word_seqs)
        
        color_embeddings = (torch.stack([
            torch.repeat_interleave(
                target_color.reshape(1, -1),
                word_embeddings.shape[1],
                dim=0)
            for target_color in target_colors], dim=0))
        
        
        return torch.cat((word_embeddings, color_embeddings), dim=2)

Step 1 is the most demanding of the steps in terms of tensor wrangling. It's important to have a clear idea of what you are trying to achieve and to unit test `get_embeddings` so that you can check that it has realized your vision. The following test should help with that:

__Step 2__: Modify the `EncoderDecoder`. For this, you just need to make a small change to the `forward` method: extract the target colors from `color_seqs` and feed them to the decoder.

In [10]:
from torch_color_describer import EncoderDecoder

class ColorizedEncoderDecoder(EncoderDecoder):

    def forward(self,
            color_seqs,
            word_seqs,
            seq_lengths=None,
            hidden=None,
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)

        # Extract the target colors from `color_seqs` and
        # feed them to the decoder, which already has a
        # `target_colors` keyword.

        ##### YOUR CODE HERE
        output, hidden = self.decoder(
            word_seqs,
            seq_lengths=seq_lengths,
            hidden=hidden,
            target_colors=color_seqs[:,2],
        )        



        # Your decoder will return `output, hidden` pairs; the
        # following will handle the two return situations that
        # the code needs to consider -- training and prediction.
        if self.training:
            return output
        else:
            return output, hidden
        
class BertEncoderDecoder(EncoderDecoder):
    
    def forward(self,
            color_seqs,
            word_seqs,
            # attention_masks,
            seq_lengths=None,
            hidden=None,
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)

        # Extract the target colors from `color_seqs` and
        # feed them to the decoder, which already has a
        # `target_colors` keyword.
        
        ##### YOUR CODE HERE
        output, hidden = self.decoder(
            word_seqs,
            seq_lengths=seq_lengths,
            hidden=hidden,
            target_colors=color_seqs[:,2],
            # attention_masks=attention_masks
        )        



        # Your decoder will return `output, hidden` pairs; the
        # following will handle the two return situations that
        # the code needs to consider -- training and prediction.
        if self.training:
            return output
        else:
            return output, hidden

__Step 3__: Finally, as in the examples in [Modifying the core model](colors_overview.ipynb#Modifying-the-core-model), you need to modify the `build_graph` method of `ContextualColorDescriber` so that it uses your new `ColorContextDecoder` and `ColorizedEncoderDecoder`. Here's starter code:

In [11]:
from torch_color_describer import Encoder, ColorDataset

BERT_EMBED_DIM = 768

class ColorizedInputDescriber(ContextualColorDescriber):

    def build_graph(self):

        # We didn't modify the encoder, so this is
        # just copied over from the original:
        encoder = Encoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim)

        # Use your `ColorContextDecoder`, making sure
        # to pass in all the keyword arguments coming
        # from `ColorizedInputDescriber`:

        ##### YOUR CODE HERE
        decoder = ColorContextDecoder(
            color_dim = self.color_dim,
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            hidden_dim=self.hidden_dim
        )

        # Return a `ColorizedEncoderDecoder` that uses
        # your encoder and decoder:

        ##### YOUR CODE HERE
        return ColorizedEncoderDecoder(encoder, decoder)
    
class BertColorizedInputDescriber(ContextualColorDescriber):
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, *kwargs)
        self.idx2bertids = torch.tensor(bert_tokenizer.convert_tokens_to_ids(self.vocab))
        
    
    def build_graph(self):
        encoder = Encoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim
        )
        
        decoder = BertColorContextDecoder(
            color_dim=self.color_dim,
            idx2bertids=self.idx2bertids,
            vocab_size=self.vocab_size,
            embed_dim=BERT_EMBED_DIM,
            hidden_dim=self.hidden_dim
        )
        
        return BertEncoderDecoder(encoder, decoder)

In [12]:
# toy_examples = dev_examples

toy_rawcols, toy_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples[:500]])

toy_rawcols_train, toy_rawcols_test, toy_texts_train, toy_texts_test = \
    train_test_split(toy_rawcols, toy_texts)

del toy_rawcols, toy_texts#, toy_examples

toy_seqs_train = [tokenize_example(s) for s in toy_texts_train]
toy_seqs_test = [tokenize_example(s) for s in toy_texts_test]

toy_cols_train = [represent_color_context(colors) for colors in toy_rawcols_train]
toy_cols_test = [represent_color_context(colors) for colors in toy_rawcols_test]

toy_vocab = ['[PAD]'] + sorted({w for toks in toy_seqs_train for w in toks}) + [UNK_SYMBOL]

In [None]:
shouldTrain = True

if shouldTrain:
    toy_model = BertColorizedInputDescriber(toy_vocab,
                                        early_stopping=True,
                                        freeze_embedding=True)

    %time _ = toy_model.fit(toy_cols_train, toy_seqs_train)

    with open("bert-finetuned-model.p", "wb") as f:
        pickle.dump(toy_model, f)
        
else:
    with open("bert-finetuned-model.p", "rb") as f:
        toy_model = pickle.load(f)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [14]:
toy_model.listener_accuracy(toy_cols_train, toy_seqs_train),\
toy_model.listener_accuracy(toy_cols_test, toy_seqs_test)

(0.8571428571428571, 1.0)

In [None]:
toy_seqs_train[0]
bert_tokenizer.convert_tokens_to_ids(toy_seqs_train[0])

In [19]:
preds = toy_model.predict(toy_cols_test)

In [20]:
preds

[['[CLS]', 'cam', '##o', 'green', '[SEP]'],
 ['[CLS]', 'neon', 'medium', 'pink', '[SEP]'],
 ['[CLS]', 'dark', '##ish', '##ish', 'red', '[SEP]']]

In [21]:
toy_seqs_test

[['[CLS]', 'cam', '##o', 'green', '[SEP]'],
 ['[CLS]', 'mint', 'green', '.', '[SEP]'],
 ['[CLS]', 'dark', '##ish', 'red', '[SEP]']]

That's it! Since these modifications are pretty intricate, you might want to use [a toy dataset](colors_overview.ipynb#Toy-problems-for-development-work) to debug it:

In [26]:
def test_full_system(describer_class):
    toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
        group_size=50, vec_dim=2)

    toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
        train_test_split(toy_color_seqs, toy_word_seqs)

    toy_mod = describer_class(toy_vocab)

    _ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)

    acc = toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)

    return acc

In [30]:
test_full_system(ColorizedInputDescriber)

NameError: name 'test_full_system' is not defined

In [87]:
bert_tokenizer.convert_ids_to_tokens([0])
bert_tokenizer.convert_tokens_to_ids(['[PAD]'])

[0]

If that worked, then you can now try this model on SCC problems!

## Your original system [3 points]

There are many options for your original system, which consists of the full pipeline – all preprocessing and modeling steps. You are free to use any model you like, as long as you subclass `ContextualColorDescriber` in a way that allows its `evaluate` method to behave in the expected way.

So that we can evaluate models in a uniform way for the bake-off, we ask that you modify the function `evaluate_original_system` below so that it accepts a trained instance of your model and does any preprocessing steps required by your model.

If we seek to reproduce your results, we will rerun this entire notebook. Thus, it is fine if your `evaluate_original_system` makes use of functions you wrote or modified above this cell.

In [10]:
def evaluate_original_system(trained_model, color_seqs_test, texts_test):
    """
    Feel free to modify this code to accommodate the needs of
    your system. Just keep in mind that it will get raw corpus
    examples as inputs for the bake-off.

    """
    # `word_seqs_test` is a list of strings, so tokenize each of
    # its elements:
    tok_seqs = [tokenize_example(s) for s in texts_test]

    col_seqs = [represent_color_context(colors)
                for colors in color_seqs_test]


    # Optionally include other preprocessing steps here. Note:
    # DO NOT RETRAIN YOUR MODEL AS PART OF THIS EVALUATION!
    # It's a tempting step, but it's a mistake and will get
    # you disqualified!

    # The following core score calculations are required:
    evaluation = trained_model.evaluate(col_seqs, tok_seqs)

    return evaluation

If `evaluate_original_system` works on test sets you create from the corpus distribution, then it will work for the bake-off, so consider checking that. For example, this would check that `dev_mod` above passes muster:

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies. We also ask that you report the best **listener_accuracy** score your system got during development, just to help us understand how systems performed overall.

<font color='red'>Please review the descriptions in the following comment and follow the instructions.</font>

In [None]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
#   3) The score achieved by your system in place of MY_NUMBER.
#        With no other changes to that line.
#        You should report your score as a decimal value <=1.0
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# NOTE: MODULES, CODE AND DATASETS REQUIRED FOR YOUR ORIGINAL SYSTEM
# SHOULD BE ADDED BELOW THE 'IS_GRADESCOPE_ENV' CHECK CONDITION. DOING
# SO ABOVE THE CHECK MAY CAUSE THE AUTOGRADER TO FAIL.

# START COMMENT: Enter your system description in this cell.
# For our original model, we pieced together many aspects of what was
# taught in the homework into a fully working pipeline. We then trained
# the resulting model on the full colors in context corpus to produce
# our final model. Specifically, our model included the following:
#
# 1. We used the ideas from the Monroe et. al. 2017 paper to tokenize
#    our inputs. We lowercased the entire string, removed any punctation,
#    and then split on whitespace. We then removed the common endings of
#    "-er", "-est" and "-ish" from all of our tokens. Once we had fully
#    tokenized our training sequences, we then took another pass over
#    those tokens to replaced any that only occured with the UNK_TOKEN.
# 2. Instead of leaving our color representations as simple HLS values,
#    we first converted those to HSV values and then applied the fourier
#    transform described in Monroe et. al. 2016 to gives us some
#    representations that are mapped on to a space more intuitive for 
#    how human understand colors.
# 3. Next, we computed the glove embeddings for our vocab to give our
#    model reasonable starting embeddings for our tokens.
# 4. Finally, we fed our tokenized sequences, modified color
#    representations, and initial glove embeddings into the Colorized
#    Input Describer model that we built for Q4 of this homework.
#    Training this on the full Stanford Colors in Context corpus gave us
#    our final model that we submitted into the bakeoff.

# My peak score was: 0.838
if 'IS_GRADESCOPE_ENV' not in os.environ:
    from collections import Counter

    # Filters the training data so that any token that only appears once
    # gets replaced by the unk token
    def filter_vocab(seqs_train):
        flat_list = [token for sequence in seqs_train for token in sequence]
        vocab_count = Counter(flat_list)

        for sequence in seqs_train:
            for i, token in enumerate(sequence):
                if vocab_count[token] <= 1:
                    sequence[i] = UNK_SYMBOL
                    
    full_corpus = ColorsCorpusReader(
        COLORS_SRC_FILENAME,
        normalize_colors=True
    )
    full_examples = list(full_corpus.read())

    full_rawcols, full_texts = zip(*[[ex.colors, ex.contents] for ex in full_examples])

    full_rawcols_train, full_rawcols_test, full_texts_train, full_texts_test = \
        train_test_split(full_rawcols, full_texts)
    
    del full_rawcols, full_texts, full_examples

    full_seqs_train = [tokenize_example(s) for s in full_texts_train]
    full_seqs_test = [tokenize_example(s) for s in full_texts_test]

    full_cols_train = [represent_color_context(colors) for colors in full_rawcols_train]
    full_cols_test = [represent_color_context(colors) for colors in full_rawcols_test]

    full_vocab = ['[PAD]'] + sorted({w for toks in full_seqs_train for w in toks}) + [UNK_SYMBOL]

    shouldTrain = True
    if shouldTrain:
    
        full_colorized_model = BertColorizedInputDescriber(
            full_vocab,
            early_stopping=True,
            freeze_embedding=True)

        %time _ = full_colorized_model.fit(full_cols_train, full_seqs_train)
    
        with open("bert-finetuned-model.p", "wb") as f:
            pickle.dump(full_colorized_model, f)

    else:
        with open("bert-finetuned-model.p", "rb") as f:
            full_colorized_model = pickle.load(f)

    evaluation = evaluate_original_system(
            full_colorized_model,
            full_rawcols_test,
            full_texts_test
        )

    print(evaluation['listener_accuracy'])                    
                    
# STOP COMMENT: Please do not remove this comment.

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


KeyboardInterrupt: 

  perp = [np.prod(s)**(-1/len(s)) for s in scores]


## Bakeoff [1 point]

For the bake-off, we will use our original test set. The function you need to run for the submission is the following, which uses your `evaluate_original_system` from above:

In [None]:
def create_bakeoff_submission(
        trained_model,
        output_filename='cs224u-colors-bakeoff-entry.csv'):
    bakeoff_src_filename = os.path.join(
        "data", "colors", "cs224u-colors-test.csv")

    bakeoff_corpus = ColorsCorpusReader(bakeoff_src_filename)

    # This code just extracts the colors and texts from the new corpus:
    bakeoff_rawcols, bakeoff_texts = zip(*[
        [ex.colors, ex.contents] for ex in bakeoff_corpus.read()])

    # Original system function call; `trained_model` is your trained model:
    evaluation = evaluate_original_system(
        trained_model, bakeoff_rawcols, bakeoff_texts)

    evaluation['bakeoff_text'] = bakeoff_texts

    df = pd.DataFrame(evaluation)
    df.to_csv(output_filename)

In [None]:
# This check ensure that the following code only runs on the local environment only.
# The following call will not be run on the autograder environment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    create_bakeoff_submission(full_colorized_model)

This creates a file `cs224u-colors-bakeoff-entry.csv` in the current directory. That file should be uploaded as-is. Please do not change its name.

Only one upload per team is permitted, and you should do no tuning of your system based on what you see in the file – you should not study that file in anyway, beyond perhaps checking that it contains what you expected it to contain. The upload function will do some additional checking to ensure that your file is well-formed.

The nature of our evaluation is such that we have to release the full test set with all labels. Thus, we have to trust you not to make any use of the test set during development. Recall:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

Systems will be ranked primarily by `listener_accuracy`, but we will also consider their `corpus_bleu` scores. However, the BLEU score is just a simple check that your system is speaking some version of English that corresponds in some meaningful way to the gold descriptions, so you should concentrate on `listener_accuracy`.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points.

## Submission Instruction

Review and follow the [Homework and bake-off code: Formatting guide](hw_formatting_guide.ipynb).
Please do not change the file name as described below.

Submit the following files to Gradescope:

- `hw_colors.ipynb` (this notebook)
- `cs224u-colors-bakeoff-entry.csv` (bake-off output)