# Homework and bake-off: pragmatic color descriptions

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [All two-word examples as a dev corpus](#All-two-word-examples-as-a-dev-corpus)
1. [Dev dataset](#Dev-dataset)
1. [Random train–test split for development](#Random-train–test-split-for-development)
1. [Question 1: Improve the tokenizer [1 point]](#Question-1:-Improve-the-tokenizer-[1-point])
1. [Use the tokenizer](#Use-the-tokenizer)
1. [Question 2: Improve the color representations [1 point]](#Question-2:-Improve-the-color-representations-[1-point])
1. [Use the color representer](#Use-the-color-representer)
1. [Initial model](#Initial-model)
1. [Question 3: GloVe embeddings [1 points]](#Question-3:-GloVe-embeddings-[1-points])
1. [Try the GloVe representations](#Try-the-GloVe-representations)
1. [Question 4: Color context [3 points]](#Question-4:-Color-context-[3-points])
1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bakeoff [1 point]](#Bakeoff-[1-point])

## Overview

This homework and associated bake-off are oriented toward building an effective system for generating color descriptions that are pragmatic in the sense that they would help a reader/listener figure out which color was being referred to in a shared context consisting of a target color (whose identity is known only to the describer/speaker) and a set of distractors.

The notebook [colors_overview.ipynb](colors_overview.ipynb) should be studied before work on this homework begins. That notebook provides backgroud on the task, the dataset, and the modeling code that you will be using and adapting.

The homework questions are more open-ended than previous ones have been. Rather than asking you to implement pre-defined functionality, they ask you to try to improve baseline components of the full system in ways that you find to be effective. As usual, this culiminates in a prompt asking you to develop a novel system for entry into the bake-off. In this case, though, the work you do for the homework will likely be directly incorporated into that system.

## Set-up

See [colors_overview.ipynb](colors_overview.ipynb) for set-up in instructions and other background details.

In [2]:
from colors import ColorsCorpusReader
import os
from sklearn.model_selection import train_test_split
from torch_color_describer import (
    ContextualColorDescriber, create_example_dataset)
import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL

In [3]:
utils.fix_random_seeds()

In [4]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")

## All two-word examples as a dev corpus

So that you don't have to sit through excessively long training runs during development, I suggest working with the two-word-only subset of the corpus until you enter into the late stages of system testing.

In [5]:
dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME, 
    word_count=2, 
    normalize_colors=True)

In [6]:
dev_examples = list(dev_corpus.read())

This subset has about one-third the examples of the full corpus:

In [7]:
len(dev_examples)

13890

We __should__ worry that it's not a fully representative sample. Most of the descriptions in the full corpus are shorter, and a large proportion are longer. So this dataset is mainly for debugging, development, and general hill-climbing. All findings should be validated on the full dataset at some point.

## Dev dataset

The first step is to extract the raw color and raw texts from the corpus:

In [8]:
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])

The raw color representations are suitable inputs to a model, but the texts are just strings, so they can't really be processed as-is. Question 1 asks you to do some tokenizing!

## Random train–test split for development

For the sake of development runs, we create a random train–test split:

In [9]:
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)

## Question 1: Improve the tokenizer [1 point]

This is the first required question – the first required modification to the default pipeline.

The function `tokenize_example` simply splits its string on whitespace and adds the required start and end symbols:

In [103]:
import re

def tokenize_example(s, occuredOnce_w=None):
    
    # Improve me!
    s = s.lower()
    s = re.sub('([\\\\\'.,!?()\;\=\-\/\*])', r' \1 ', s)
    s = re.sub('\s{2,}', ' ', s)
    
    if occuredOnce_w is None:
        return [START_SYMBOL] + s.split() + [END_SYMBOL]
    else:
        results = [START_SYMBOL] + s.split() + [END_SYMBOL]
        for w_ind, w in enumerate(results):
            if w in occuredOnce_w:
                results[w_ind] = UNK_SYMBOL
        return results

In [104]:
tokenize_example(dev_texts_train[376])

['<s>', 'aqua', ',', 'teal', '</s>']

__Your task__: Modify `tokenize_example` so that it does something more sophisticated with the input text. 

__Notes__:

* There are useful ideas for this in [Monroe et al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142)
* There is no requirement that you do word-level tokenization. Sub-word and multi-word are options.
* This question can interact with the size of your vocabulary (see just below), and in turn with decisions about how to use `UNK_SYMBOL`.

__Important__: don't forget to add the start and end symbols, else the resulting models will definitely be terrible!

## Use the tokenizer

Once the tokenizer is working, run the following cell to tokenize your inputs:

In [229]:
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]

dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]

We use only the train set to derive a vocabulary for the model:

In [230]:
occuredOnce_w = set()
all_words_occs = set()
for toks in dev_seqs_train:
    for w in toks:
        if w in occuredOnce_w:
            occuredOnce_w.remove(w)
        elif w not in all_words_occs:
            occuredOnce_w.add(w)
        all_words_occs.add(w)
            
dev_seqs_train = [tokenize_example(s, occuredOnce_w) for s in dev_texts_train]

In [235]:
# dev_vocab = sorted({w for toks in dev_seqs_train for w in toks}) + [UNK_SYMBOL]
dev_vocab = set(sorted({w for toks in dev_seqs_train for w in toks})).add(UNK_SYMBOL)

In [236]:
dev_vocab

['!',
 '###',
 '$UNK',
 '&',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '1',
 '10',
 '15',
 '2',
 '20',
 '2nd',
 '3',
 '50',
 '50%',
 '6',
 ':',
 ':d',
 ':s',
 ';',
 '</s>',
 '<s>',
 '=',
 '>:o',
 '?',
 '\\',
 '_',
 '_____',
 'a',
 'about',
 'actual',
 'actually',
 'added',
 'after',
 'again',
 'ago',
 'agree',
 'ah',
 'ahaha',
 'ahh',
 'ahhhh',
 'airport',
 'alike',
 'all',
 'allowed',
 'almost',
 'almosy',
 'also',
 'am',
 'american',
 'amethyst',
 'amount',
 'an',
 'and',
 'angry',
 'another',
 'answer',
 'any',
 'anyone',
 'anything',
 'anyway',
 'apple',
 'appliance',
 'aqua',
 'aquaish',
 'aquamarine',
 'are',
 'aren',
 'argh',
 'army',
 'around',
 'as',
 'ash',
 'ashy',
 'ask',
 'at',
 'attention',
 'auqa',
 'avacado',
 'avocado',
 'awesome',
 'awwww',
 'azure',
 'b',
 'baby',
 'bad',
 'ball',
 'ballet',
 'bam',
 'banana',
 'barbie',
 'barely',
 'bark',
 'barney',
 'barnie',
 'barny',
 'basic',
 'basically',
 'battleship',
 'bbright',
 'be',
 'beautiful',
 'because',
 'bee

It's important that the `UNK_SYMBOL` is included somewhere in this list. Test examples with word not seen in training will be mapped to `UNK_SYMBOL`. If you model's vocab is the same as your train vocab, then `UNK_SYMBOL` will never be encountered during training, so it will be a random vector at test time.

In [237]:
len(dev_vocab)

1358

## Question 2: Improve the color representations [1 point]

This is the second required pipeline improvement for the assignment. 

The following functions do nothing at all to the raw input colors we get from the corpus. 

In [148]:
# Reference: [Monroe et al. 2017](https://github.com/futurulus/colors-in-context/blob/2e7b830668cd039830154e7e8f211c6d4415d30f/vectorizers.py) 
import numpy as np

RANGES_RGB = (256.0, 256.0, 256.0)
RANGES_HSV = (361.0, 101.0, 101.0)
C_EPSILON = 1e-4

class ColorVectorizer(object):
    def vectorize_all(self, colors, hsv=None):
        '''
        :param colors: A sequence of length-3 vectors or 1D array-like objects containing
                      RGB coordinates in the range [0, 256).
        :param bool hsv: If `True`, input is assumed to be in HSV space in the range
                         [0, 360], [0, 100], [0, 100]; if `False`, input should be in RGB
                         space in the range [0, 256). `None` (default) means take the
                         color space from the value given to the constructor.
        :return np.ndarray: An array of the vectorized form of each color in `colors`
                            (first dimension is the index of the color in the `colors`).
        >>> BucketsVectorizer((2, 2, 2)).vectorize_all([(0, 0, 0), (255, 0, 0)])
        array([0, 4], dtype=int32)
        '''
        return np.array([self.vectorize(c, hsv=hsv) for c in colors])

    def unvectorize_all(self, colors, random=False, hsv=None):
        '''
        :param Sequence colors: An array or sequence of vectorized colors
        :param random: If true, sample a random color from each bucket. Otherwise,
                       return the center of the bucket. Some vectorizers map colors
                       one-to-one to vectorized versions; these vectorizers will
                       ignore the `random` argument.
        :param hsv: If `True`, return colors in HSV format; otherwise, RGB.
                    `None` (default) means take the color space from the value
                    given to the constructor.
        :return list(tuple(int)): The unvectorized version of each color in `colors`
        >>> BucketsVectorizer((2, 2, 2)).unvectorize_all([0, 4])
        [(64, 64, 64), (192, 64, 64)]
        >>> BucketsVectorizer((2, 2, 2)).unvectorize_all([0, 4], hsv=True)
        [(0, 0, 25), (0, 67, 75)]
        '''
        return [self.unvectorize(c, random=random, hsv=hsv) for c in colors]

    def visualize_distribution(self, dist):
        '''
        :param dist: A distribution over the buckets defined by this vectorizer
        :type dist: array-like with shape `(self.num_types,)``
        :return images: `list(`3-D `np.array` with `shape[2] == 3)`, three images
            with the last dimension being the channels (RGB) of cross-sections
            along each axis, showing the strength of the distribution as the
            intensity of the channel perpendicular to the cross-section.
        '''
        raise NotImplementedError

    def get_input_vars(self, id=None, recurrent=False):
        '''
        :param id: The string tag to use as a prefix in the variable names.
            If `None`, no prefix will be added. (Passing an empty string will
            result in adding a bare `'/'`, which is legal but probably not what
            you want.)
        :type id: str or None
        :param bool recurrent: If `True`, return input variables reflecting
            copying the input `k` times, where `k` is the recurrent sequence
            length. This means the input variables will have one more dimension
            than they would if they were input to a simple feed-forward layer.
        :return list(T.TensorVariable): The variables that should feed into the
            color component of the input layer of a neural network using this
            vectorizer.
        '''
        id_tag = (id + '/') if id else ''
        return [(T.itensor3 if recurrent else T.imatrix)(id_tag + 'colors')]

    def get_input_layer(self, input_vars, recurrent_length=0, cell_size=20, context_len=1, id=None):
        '''
        :param input_vars: The input variables returned from
            `get_input_vars`.
        :type input_vars: list(T.TensorVariable)
        :param recurrent_length: The number of steps to copy color representations
            for input to a recurrent unit. If `None`, allow variable lengths; if 0,
            produce output for a non-recurrent layer (this will create an input layer
            producing a tensor of rank one lower than the recurrent version).
        :type recurrent_length: int or None
        :param int cell_size: The number of dimensions of the final color representation.
        :param id: The string tag to use as a prefix in the layer names.
            If `None`, no prefix will be added. (Passing an empty string will
            result in adding a bare `'/'`, which is legal but probably not what
            you want.)
        :return Lasagne.Layer, list(Lasagne.Layer): The layer producing the color
            representation, and the list of input layers corresponding to each of
            the input variables (in the same order).
        '''
        raise NotImplementedError(self.get_input_layer)

class FourierVectorizer(ColorVectorizer):
    '''
    Vectorizes colors by converting them to a truncated frequency representation.
    This vectorizer can only vectorize, not unvectorize.
    '''
    def __init__(self, resolution, hsv=False):
        '''
        :param resolution: The number of dimensions to truncate the frequency
                           representation (the vectorized representation will be
                           *twice* this, because the frequency representation uses
                           complex numbers). Should be an even number between 0 and
                           the range of each internal color space dimension, or a
                           length-3 sequence of such numbers.
        :param bool hsv: If `True`, the internal color space used by the vectorizer
                         will be HSV. Input and output color spaces can be configured
                         on a per-call basis by using the `hsv` parameter of
                         `vectorize` and `unvectorize`.
        '''
        if len(resolution) == 1:
            resolution = resolution * 3
        self.resolution = resolution
        self.output_size = np.prod(resolution) * 2
        self.hsv = hsv

    def vectorize(self, color, hsv=None):
        '''
        :param color: An length-3 vector or 1D array-like object containing
                      color coordinates.
        :param bool hsv: If `True`, input is assumed to be in HSV space in the range
                         [0, 360], [0, 100], [0, 100]; if `False`, input should be in RGB
                         space in the range [0, 255]. `None` (default) means take the
                         color space from the value given to the constructor.
        :return np.ndarray: The color in the Fourier representation,
                            a vector of shape `(prod(resolution) * 2,)`.
        >>> normalize = lambda v: np.where(v.round(2) == 0.0, 0.0, v.round(2))
        >>> normalize(FourierVectorizer([2]).vectorize((255, 0, 0)))
        array([ 1.,  1.,  1.,  1., -1., -1., -1., -1.,  0.,  0.,  0.,  0.,  0.,
                0.,  0.,  0.], dtype=float32)
        >>> normalize(FourierVectorizer([2]).vectorize((180, 100, 100), hsv=True))
        array([ 1., -1., -1.,  1.,  1., -1., -1.,  1.,  0.,  0.,  0.,  0.,  0.,
                0.,  0.,  0.], dtype=float32)
        >>> normalize(FourierVectorizer([2], hsv=True).vectorize((0, 100, 100)))
        array([ 1., -1., -1.,  1.,  1., -1., -1.,  1.,  0.,  0.,  0.,  0.,  0.,
                0.,  0.,  0.], dtype=float32)
        >>> normalize(FourierVectorizer([2], hsv=True).vectorize((0, 255, 255), hsv=False))
        array([ 1., -1., -1.,  1., -1.,  1.,  1., -1.,  0.,  0.,  0.,  0.,  0.,
                0.,  0.,  0.], dtype=float32)
        '''
        return self.vectorize_all([color], hsv=hsv)[0]

    def vectorize_all(self, colors, hsv=None):
        '''
        >>> normalize = lambda v: np.where(v.round(2) == 0.0, 0.0, v.round(2))
        >>> normalize(FourierVectorizer([2]).vectorize_all([(255, 0, 0), (0, 255, 255)]))
        array([[ 1.,  1.,  1.,  1., -1., -1., -1., -1.,  0.,  0.,  0.,  0.,  0.,
                 0.,  0.,  0.],
               [ 1., -1., -1.,  1.,  1., -1., -1.,  1.,  0.,  0.,  0.,  0.,  0.,
                 0.,  0.,  0.]], dtype=float32)
        '''
        if hsv is None:
            hsv = self.hsv

        colors = np.array([colors])
        assert len(colors.shape) == 3, colors.shape
        assert colors.shape[2] == 3, colors.shape

        ranges = np.array(RANGES_HSV if self.hsv else RANGES_RGB)
        if hsv and not self.hsv:
            c_hsv = colors
            color_0_1 = skimage.color.hsv2rgb(c_hsv / (np.array(RANGES_HSV) - 1.0))
        elif not hsv and self.hsv:
            c_rgb = colors
            color_0_1 = skimage.color.rgb2hsv(c_rgb / (np.array(RANGES_RGB) - 1.0))
        else:
            color_0_1 = colors / (ranges - 1.0)

        # Using a Fourier representation causes colors at the boundary of the
        # space to behave as if the space is toroidal: red = 255 would be
        # about the same as red = 0. We don't want this...
        xyz = color_0_1[0] / 2.0
        if self.hsv:
            # ...*except* in the case of HSV: H is in fact a polar coordinate.
            xyz[:, 0] *= 2.0

        # ax, ay, az = [np.hstack([np.arange(0, g / 2), np.arange(r - g / 2, r)])
        #               for g, r in zip(self.resolution, ranges)]
        ax, ay, az = [np.arange(0, g) for g, r in zip(self.resolution, ranges)]
        gx, gy, gz = np.meshgrid(ax, ay, az)

        arg = (np.multiply.outer(xyz[:, 0], gx) +
               np.multiply.outer(xyz[:, 1], gy) +
               np.multiply.outer(xyz[:, 2], gz))
        assert arg.shape == (xyz.shape[0],) + tuple(self.resolution), arg.shape
        repr_complex = np.exp(-2j * np.pi * (arg % 1.0)).swapaxes(1, 2).reshape((xyz.shape[0], -1))
        result = np.hstack([repr_complex.real, repr_complex.imag]).astype(np.float32)
        return result

    def unvectorize(self, color, random='ignored', hsv=None):
        # Exact unvectorization for the frequency distribution is impossible
        # unless the representation is not truncated. For now this should
        # just be a speaker representation.
        raise NotImplementedError

    def get_input_vars(self, id=None, recurrent=False):
        id_tag = (id + '/') if id else ''
        return [(T.tensor3 if recurrent else T.matrix)(id_tag + 'colors')]

    def get_input_layer(self, input_vars, recurrent_length=0, cell_size=20,
                        context_len=1, id=None):
        id_tag = (id + '/') if id else ''
        (input_var,) = input_vars
        shape = ((None, self.output_size * context_len)
                 if recurrent_length == 0 else
                 (None, recurrent_length, self.output_size * context_len))
        l_color = InputLayer(shape=shape, input_var=input_var,
                             name=id_tag + 'color_input')
        return l_color, [l_color]

normalize = lambda v: np.where(v.round(2) == 0.0, 0.0, v.round(2))

In [158]:
# print(normalize(FourierVectorizer([2]).vectorize((255, 0, 0))))
# print(normalize(FourierVectorizer([2]).vectorize_all([(255, 0, 0), (0, 255, 255)])))

In [159]:
import colorsys

def represent_color_context(colors):
    
    # Improve me!
    
    return [represent_color(color) for color in colors]


def represent_color(color):
    # Improve me!
    color = np.array(colorsys.hls_to_rgb(*color))
    color *= 255.0
    
    return normalize(FourierVectorizer([2]).vectorize(color))

In [160]:
represent_color_context(dev_rawcols_train[0])

[array([ 1.  ,  0.17, -0.17, -1.  , -0.11, -1.  , -0.96,  0.11,  0.  ,
        -0.99, -0.99,  0.  , -0.99, -0.06,  0.28,  0.99], dtype=float32),
 array([ 1.  , -0.41,  0.41, -1.  ,  0.01, -0.92, -0.91, -0.01,  0.  ,
        -0.91, -0.91,  0.  , -1.  ,  0.4 , -0.42,  1.  ], dtype=float32),
 array([ 1.  ,  0.91, -0.91, -1.  ,  0.3 , -0.12, -0.67, -0.3 ,  0.  ,
        -0.41, -0.41,  0.  , -0.95, -0.99,  0.75,  0.95], dtype=float32)]

__Your task__: Modify `represent_color_context` and/or `represent_color` to represent colors in a new way.
    
__Notes__:

* The Fourier-transform method of [Monroe et al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142) is a proven choice.
* You are not required to keep `represent_color`. This might be unnatural if you want to perform an operation on each color trio all at once.
* For that matter, if you want to process all of the color contexts in the entire data set all at once, that is fine too, as long as you can also perform the operation at test time with an unknown number of examples being tested.

## Use the color representer

The following cell just runs your `represent_color_context` on the train and test sets:

In [161]:
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]

dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]

At this point, our preprocessing steps are complete, and we can fit a first model.

## Initial model

The first model is configured right now to be a small model run for just a few iterations. It should be enough to get traction, but it's unlikely to be a great model. You are free to modify this configuration if you wish; it is here just for demonstration and testing:

In [188]:
dev_mod = ContextualColorDescriber(
    dev_vocab, 
    embed_dim=10, 
    hidden_dim=10, 
    max_iter=5, 
    batch_size=128)

In [189]:
_ = dev_mod.fit(dev_cols_train, dev_seqs_train)

Epoch 5; err = 116.67428576946259

As discussed in [colors_overview.ipynb](colors_overview.ipynb), our primary metric is `listener_accuracy`:

In [190]:
dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test)

0.6547653325655053

We can also see the model's predicted sequences given color context inputs:

In [171]:
dev_mod.predict(dev_cols_test[:1])

[['<s>', 'bright', 'purple', '</s>']]

In [166]:
dev_seqs_test[:1]

[['<s>', 'right', 'side', '###', 'purple', 'pinkish', '</s>']]

## Question 3: GloVe embeddings [1 points]

The above model uses a random initial embedding, as configured by the decoder used by `ContextualColorDescriber`. This homework question asks you to consider using GloVe inputs. 

__Your task__: Complete `create_glove_embedding` so that it creates a GloVe embedding based on your model vocabulary. This isn't mean to be analytically challenging, but rather just to create a basis for you to try out other kinds of rich initialization.

In [172]:
GLOVE_HOME = os.path.join('data', 'glove.6B')

In [181]:
def create_glove_embedding(vocab, glove_base_filename='glove.6B.50d.txt'):
    
    # Use `utils.glove2dict` to read in the GloVe file:    
    ##### YOUR CODE HERE
    DATA_HOME = 'data'
    GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')
    GLOVE = utils.glove2dict(os.path.join(GLOVE_HOME, glove_base_filename))
    
    # Use `utils.create_pretrained_embedding` to create the embedding.
    # This function will, by default, ensure that START_TOKEN, 
    # END_TOKEN, and UNK_TOKEN are included in the embedding.
    ##### YOUR CODE HERE
    embedding, vocab = utils.create_pretrained_embedding(GLOVE, vocab)

    
    # Be sure to return the embedding you create as well as the
    # vocabulary returned by `utils.create_pretrained_embedding`,
    # which is likely to have been modified from the input `vocab`.
    
    ##### YOUR CODE HERE

    return embedding, vocab


## Try the GloVe representations

Let's see if GloVe helped for our development data:

In [182]:
dev_glove_embedding, dev_glove_vocab = create_glove_embedding(dev_vocab)

The above might dramatically change your vocabulary, depending on how many items from your vocab are in the Glove space:

In [183]:
len(dev_vocab)

485

In [184]:
len(dev_glove_vocab)

485

In [185]:
dev_mod_glove = ContextualColorDescriber(
    dev_glove_vocab, 
    embedding=dev_glove_embedding,
    hidden_dim=10, 
    max_iter=5, 
    batch_size=128)

In [186]:
_ = dev_mod_glove.fit(dev_cols_train, dev_seqs_train)

Epoch 5; err = 116.58577048778534

In [187]:
dev_mod_glove.listener_accuracy(dev_cols_test, dev_seqs_test)

0.7391304347826086

You probably saw a small boost, assuming your tokeization scheme leads to good overlap with the GloVe vocabulary. The input representations are larger than in our previous model (at least as I configured things), so we would need to do more runs with higher `max_iter` values to see whether this is worthwhile overall.

## Question 4: Color context [3 points]

The final required homework question is the most challenging, but it should set you up to think in much more flexible ways about the underlying model we're using.

The question asks you to modify various model components in `torch_color_describer.py`. The section called [Modifying the core model](colors_overview.ipynb#Modifying-the-core-model) from the core unit notebook provides a number of examples illustrating the basic techniques, so you might review that material if you get stuck here.

__Your task__: [Monroe et al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142) append the target color (the final one in the context) to each input token that gets processed by the decoder. The question asks you to subclass the `Decoder` and `EncoderDecoder` from `torch_color_describer.py` so that you can build models that do this.

__Step 1__: Modify the `Decoder` so that the input vector to the model at each timestep is not just a token representaton `x` but the concatenation of `x` with the representation of the target color.

__Notes__:

* You might notice at this point that the original `Decoder.forward` method has an optional keyword argument `target_colors` that is passed to `Decoder.get_embeddings`. Because this is already in place, all you have to do is modify the `get_embeddings` method to use this argument.

* The change affects the configuration of `self.rnn`, so you need to subclass the `__init__` method as well, so that its `input_size` argument accomodates the embedding as well as the color representations.

* You can do the relevant operations efficiently in pure PyTorch using `repeat_interleave` and `cat`, but the important thing is to get a working implementation – you can always optimize the code later if the ideas prove useful to you. 

Here's skeleton code for you to flesh out:

In [216]:
import torch
import torch.nn as nn
from torch_color_describer import Decoder

class ColorContextDecoder(Decoder):    
    def __init__(self, color_dim, *args, **kwargs):
        self.color_dim = color_dim
        super().__init__(*args, **kwargs)
        
        # Fix the `self.rnn` attribute:
        ##### YOUR CODE HERE
        self.rnn = nn.GRU(
            input_size=self.embed_dim + self.color_dim,
            hidden_size=self.hidden_dim,
            batch_first=True)



        

    def get_embeddings(self, word_seqs, target_colors=None):  
        """You can assume that `target_colors` is a tensor of shape 
        (m, n), where m is the length of the batch (same as 
        `word_seqs.shape[0]`) and n is the dimensionality of the 
        color representations the model is using. The goal is
        to attached each color vector i to each of the tokens in
        the ith sequence of (the embedded version of) `word_seqs`.
        
        """        
        ##### YOUR CODE HERE
        embedding = self.embedding(word_seqs)
        if target_colors is not None:
            interleaved_colors =  torch.repeat_interleave(target_colors, word_seqs.shape[1], dim=1)
            return torch.cat((embedding,interleaved_colors), dim=2)
        else:
            return embedding




__Step 2__: Modify the `EncoderDecoder`. For this, you just need to make a small change to the `forward` method: extract the target colors from `color_seqs` and feed them to the decoder.

In [206]:
from torch_color_describer import EncoderDecoder

class ColorizedEncoderDecoder(EncoderDecoder):
    
    def forward(self, 
            color_seqs, 
            word_seqs, 
            seq_lengths=None, 
            hidden=None, 
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)
            
        # Extract the target colors from `color_seqs` and 
        # feed them to the decoder, which already has a
        # `target_colors` keyword.        
        ##### YOUR CODE HERE

        output, hidden = self.decoder(
            word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=color_seqs[:,-1:,:])
        
        return output, hidden, targets

__Step 3__: Finally, as in the examples in [Modifying the core model](colors_overview.ipynb#Modifying-the-core-model), you need to modify the `build_graph` method of `ContextualColorDescriber` so that it uses your new `ColorContextDecoder` and `ColorizedEncoderDecoder`. Here's starter code:

In [194]:
from torch_color_describer import Encoder

class ColorizedInputDescriber(ContextualColorDescriber):
        
    def build_graph(self):
        
        # We didn't modify the encoder, so this is
        # just copied over from the original:
        encoder = Encoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim)

        # Use your `ColorContextDecoder`, making sure
        # to pass in all the keyword arguments coming
        # from `ColorizedInputDescriber`:
        
        ##### YOUR CODE HERE
        decoder = ColorContextDecoder(
            color_dim=self.color_dim,
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            embedding=self.embedding,
            hidden_dim=self.hidden_dim)


        
        # Return a `ColorizedEncoderDecoder` that uses
        # your encoder and decoder:
        
        ##### YOUR CODE HERE

        return ColorizedEncoderDecoder(encoder, decoder)




That's it! Since these modifications are pretty intricate, you might want to use [a toy dataset](colors_overview.ipynb#Toy-problems-for-development-work) to debug it:

In [195]:
toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
    group_size=50, vec_dim=2)

In [196]:
toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
    train_test_split(toy_color_seqs, toy_word_seqs)

In [197]:
toy_mod = ColorizedInputDescriber(
    toy_vocab, 
    embed_dim=10, 
    hidden_dim=10, 
    max_iter=100, 
    batch_size=128)

In [217]:
_ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)

Epoch 100; err = 0.1454654335975647

In [218]:
toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)

1.0

If that worked, then you can now try this model on SCC problems!

## Your original system [3 points]

There are many options for your original system, which consists of the full pipeline – all preprocessing and modeling steps. You are free to use any model you like, as long as you subclass `ContextualColorDescriber` in a way that allows its `listener_accuracy` method to behave in the expected way.

So that we can evaluate models in a uniform way for the bake-off, we ask that you modify the function `my_original_system` below so that it accepts a trained instance of your model and does any preprocessing steps required by your model.

If we seek to reproduce your results, we will rerun this entire notebook. Thus, it is fine if your `my_original_system` makes use of functions you wrote or modified above this cell.

In [219]:
def my_original_system(trained_model, color_seqs_test, texts_test): 
    """Feel free to modify this code to accommodate the needs of
    your system. Just keep in mind that it will get raw corpus
    examples as inputs for the bake-off.
    
    """    
    # `word_seqs_test` is a list of strings, so tokenize each of
    # its elements:    
    tok_seqs = [tokenize_example(s) for s in texts_test]
    
    col_seqs = [represent_color_context(colors) 
                for colors in color_seqs_test]

    # Return the `listener_accuracy` for your model:
    return trained_model.listener_accuracy(col_seqs, tok_seqs)

If `my_original_system` works on test sets you create from the corpus distribution, then it will works for the bake-off, so consider checking that. For example, this would check that `dev_mod` above passes muster:

In [220]:
my_original_system(dev_mod, dev_rawcols_test, dev_texts_test)

0.6547653325655053

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [240]:
# Enter your system description in this cell.
# For the text tokenizer, we follow [Monroe et al. 2017] and perform preprocessing including 1. lower the text; 
# 2. split the punctuations from the original text; 3. mark the utterances that occuring only once as UNKNOWN. 
# For the color context preprocessing, we first convert the colors from HSL to rgb space and unnormalize the 
# data to [0, 255] range, after that, we leverage the block from [Monroe et al. 2017] to perform Fourier transform
# to the data. For the model, we concat the targetted color block with every token's text embedding. We also 
# increase the network depth to 2 and network width to 2. 

# My peak score was: 0.8266235424291429
# This is my code


dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME, 
    word_count=None, 
    normalize_colors=True)
dev_examples = list(dev_corpus.read())
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)

# Text sequence
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]
dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]

# dev_vocab = sorted({w for toks in dev_seqs_train for w in toks}) + [UNK_SYMBOL]
occuredOnce_w = set()
all_words_occs = set()
for toks in dev_seqs_train:
    for w in toks:
        if w in occuredOnce_w:
            occuredOnce_w.remove(w)
        elif w not in all_words_occs:
            occuredOnce_w.add(w)
        all_words_occs.add(w)
            
dev_seqs_train = [tokenize_example(s, occuredOnce_w) for s in dev_texts_train]
dev_vocab = sorted({w for toks in dev_seqs_train for w in toks})


# Color sequence
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]
dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]

# Model
import torch
import torch.nn as nn
from torch_color_describer import Decoder
from torch_color_describer import EncoderDecoder
from torch_color_describer import Encoder

class DeepEncoder(Encoder):
    def __init__(self, *args, num_layers=2, **kwargs):
        super().__init__(*args, **kwargs)
        self.num_layers = num_layers
        self.rnn = nn.GRU(
            input_size=self.color_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True) 

class CustomizedColorContextDecoder(Decoder):    
    def __init__(self, color_dim, num_layers=2, *args, **kwargs):
        self.color_dim = color_dim
        self.num_layers = num_layers
        super().__init__(*args, **kwargs)
        
        # Fix the `self.rnn` attribute:
        ##### YOUR CODE HERE
        self.rnn = nn.GRU(
            input_size=self.embed_dim + self.color_dim,
            hidden_size=self.hidden_dim,
            num_layers=self.num_layers,
            batch_first=True)

    def get_embeddings(self, word_seqs, target_colors=None):  
        """You can assume that `target_colors` is a tensor of shape 
        (m, n), where m is the length of the batch (same as 
        `word_seqs.shape[0]`) and n is the dimensionality of the 
        color representations the model is using. The goal is
        to attached each color vector i to each of the tokens in
        the ith sequence of (the embedded version of) `word_seqs`.
        
        """        
        ##### YOUR CODE HERE
        embedding = self.embedding(word_seqs)
        if target_colors is not None:
            interleaved_colors =  torch.repeat_interleave(target_colors, word_seqs.shape[1], dim=1)
            return torch.cat((embedding,interleaved_colors), dim=2)
        else:
            return embedding


class CustomizedColorizedEncoderDecoder(EncoderDecoder):
    
    def forward(self, 
            color_seqs, 
            word_seqs, 
            seq_lengths=None, 
            hidden=None, 
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)
            
        # Extract the target colors from `color_seqs` and 
        # feed them to the decoder, which already has a
        # `target_colors` keyword.        
        ##### YOUR CODE HERE

        output, hidden = self.decoder(
            word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=color_seqs[:,-1:,:])
        
        return output, hidden, targets


class CustomizedColorizedInputDescriber(ContextualColorDescriber):
    def __init__(self, *args, num_layers=2, **kwargs):
        self.num_layers = num_layers
        super().__init__(*args, **kwargs)

    def build_graph(self):
        
        # We didn't modify the encoder, so this is
        # just copied over from the original:
        encoder = DeepEncoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim,
            num_layers=self.num_layers)

        # Use your `ColorContextDecoder`, making sure
        # to pass in all the keyword arguments coming
        # from `ColorizedInputDescriber`:
        
        ##### YOUR CODE HERE
        decoder = CustomizedColorContextDecoder(
            color_dim=self.color_dim,
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            embedding=self.embedding,
            hidden_dim=self.hidden_dim,
            num_layers=self.num_layers)


        
        # Return a `ColorizedEncoderDecoder` that uses
        # your encoder and decoder:
        
        ##### YOUR CODE HERE

        return CustomizedColorizedEncoderDecoder(encoder, decoder)


dev_mod = ColorizedInputDescriber(
    dev_vocab, 
    embed_dim=20, 
    hidden_dim=20, 
    max_iter=100, 
    batch_size=128)
_ = dev_mod.fit(dev_cols_train, dev_seqs_train)

my_original_system(dev_mod, dev_rawcols_test, dev_texts_test)

# Please do not remove this comment.

Epoch 100; err = 129.03375735878944

0.8266235424291429

## Bakeoff [1 point]

For the bake-off, we will release a test set. The announcement will go out on the discussion forum. You will evaluate your custom model from the previous question on these new datasets using your `my_original_system` function. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [None]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.


In [None]:
# On an otherwise blank line in this cell, please enter
# your listener_accuracy score as reported by the code
# above. Please enter only a number between 0 and 1 inclusive. 
# Please do not remove this comment.
