In [1]:
__authors__ = "Anton Gochev, Jaro Habr, Yan Jiang, Samuel Kahn"
__version__ = "XCS224u, Stanford, Spring 2021"

## Colours with static embeddings

1. [Setup](#Setup)
1. [Dataset](#Dataset)
    1. [Filtered Corpus](#Filtered-Corpus)
    1. [Bake-Off Corpus](#Bake-Off-Corpus)
1. [Baseline-System](#Baseline-System)
1. [Experiments](#Experiments)
  1. [BERT Embeddings](#BERT-Embeddings)
  2. [XLNet Embeddings](#XLNet-Embeddings)
  3. [RoBERTa Embeddings](#RoBERTa-Embeddings)
  4. [ELECTRA Embeddings](#ELECTRA-Embeddings)

## Setup

This notebook explores the performance of the basemodel with using different pre-trained static embeddings extracted from transformers such as BERT, XLNet, RoBERTa, ELECTRA

In [2]:
from utils.colors import ColorsCorpusReader
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from utils.torch_color_describer import ContextualColorDescriber, create_example_dataset
import utils.utils
from utils.utils import UNK_SYMBOL, START_SYMBOL, END_SYMBOL
import matplotlib.pyplot as plt
import matplotlib.patches as mpatch
import numpy as np
from baseline.model import (
    BaselineTokenizer, BaselineColorEncoder,
    BaselineEmbedding, BaselineDescriber, GloVeEmbedding,
    ConvolutionalColorEncoder
)

from transformers import (
    BertTokenizer, BertModel,
    XLNetTokenizer, XLNetModel,
    RobertaTokenizer, RobertaModel,
    ElectraTokenizer, ElectraModel,    
)

import utils.model_utils as mu

## Dataset

This exploration of the dataset counts the examples for different classes and plots the words distribition in order to see any data imbalance issues.

### Filtered Corpus

The filtered corpus is the full dataset used in assignment 4. The following code looks at the composition of the dataset, the number of example in each condition as well as the word count used in the color descriptions.

In [3]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv"
)

In [4]:
corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [5]:
examples = list(corpus.read())

In [6]:
len(examples)

46994

In [7]:
subset_examples = mu.extract_colour_examples(examples, from_word_count=5)

In [8]:
close_examples = [example for example in examples if example.condition == "close"]
split_examples = [example for example in examples if example.condition == "split"]
far_examples = [example for example in examples if example.condition == "far"]

In [9]:
print(f"close: {len(close_examples)}")
print(f"split: {len(split_examples)}")
print(f"far: {len(far_examples)}")

close: 15519
split: 15693
far: 15782


To understand the datasets (training and bake-off) in more details refer to [colors_in_context.ipynb](colors_in_context.ipynb). The notebook shows the distribution of the colours examples among the different splits.

### Bake-Off Corpus

The following code analyses the bake-off dataset. We will look at the number of examples for each of the conditions as well as the word count used to described the colors.

In [10]:
BAKE_OFF_COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "cs224u-colors-bakeoff-data.csv"
)

In [11]:
bake_off_corpus = ColorsCorpusReader(
    BAKE_OFF_COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [12]:
bake_off_examples = list(bake_off_corpus.read())

In [13]:
import pickle
emb_path = os.path.join(
    "data", "colors", "resnet18_color_embeddings.pickle"
)
file = open(emb_path,'rb')
resnet_emb = pickle.load(file)
file.close()


In [110]:
len(resnet_emb)

13890

## Baseline-System

This baseline system is based on assignment 4 and we use different token embeddings and sequences.

In [14]:
%load_ext autoreload
%autoreload 2

from baseline.model import (
    BaselineTokenizer, BaselineColorEncoder,
    BaselineEmbedding, BaselineDescriber, GloVeEmbedding,
    ConvolutionalColorEncoder
)



### Baseline system development - dev dataset

The tiny dataset is used for baseline model development and fast testing.

In [15]:
def create_dev_data():
    dev_color_seqs, dev_word_seqs, dev_vocab = create_example_dataset(
        group_size=50,
        vec_dim=2
    )

    dev_colors_train, dev_colors_test, dev_words_train, dev_words_test = \
        train_test_split(dev_color_seqs, dev_word_seqs)
    
    return dev_vocab, dev_colors_train, dev_words_train, dev_colors_test, dev_words_test

In [16]:
dev_vocab, dev_colors_train, dev_tokens_train, dev_colors_test, dev_texts_test = \
    create_dev_data()

### Model training - full dataset

The full color context dataset is used for final baseline model training.

In [56]:
def create_data(tokenizer, include_position=False,include_conv_embeddings=False):
    if include_conv_embeddings==False:
        rawcols, texts = zip(*[[ex.colors, ex.contents] for ex in examples])
    
    raw_colors_train, raw_colors_test, texts_train, texts_test = \
        train_test_split(rawcols, texts)
    
    raw_colors_train = raw_colors_train
    texts_train = texts_train

    tokens_train = [
        mu.tokenize_colour_description(text, tokenizer) for text in texts_train
    ]
    colors_train = [
        color_encoder.encode_colors_from_convnet(colors) for colors in raw_colors_train
    ]

    return colors_train, tokens_train, raw_colors_test, texts_test

In [57]:
def create_bakeoff_data():    
    return zip(*[[ex.colors, ex.contents] for ex in bake_off_examples])

## Experiments

Results from the experiments:

| Model | Protocol | Training Results | Bake-off Results |
| --- | --- | --- | --- |
| BERT 'bert-base-cased' | Stopping after epoch 54. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 115.87914156913757 CPU times: user 2h 4min 43s, sys: 1h 1min 36s, total: 3h 6min 19s Wall time: 1h 3min 47s | {'listener_accuracy': 0.8282407013362839, 'corpus_bleu': 0.4614753491619744} | {'listener_accuracy': 0.9034958148695224, 'corpus_bleu': 0.6637981882180088} |
| XLNet 'xlnet-base-cased' | Stopping after epoch 55. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 117.36168694496155 CPU times: user 2h 18min 38s, sys: 1h 7min 4s, total: 3h 25min 43s Wall time: 1h 9min 34s | {'listener_accuracy': 0.8243254745084688, 'corpus_bleu': 0.4290345340732608} | {'listener_accuracy': 0.9034958148695224, 'corpus_bleu': 0.6462354138567806} |
| RoBERTa 'roberta-base' | Stopping after epoch 71. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 88.95393252372742 CPU times: user 2h 26min 29s, sys: 1h 11min 9s, total: 3h 37min 38s Wall time: 1h 10min 30s | {'listener_accuracy': 0.8160694527193804, 'corpus_bleu': 0.43914107019568926} | {'listener_accuracy': 0.8936484490398818, 'corpus_bleu': 0.6365891729377956} |
| ELECTRA 'google/electra-small-discriminator' | Stopping after epoch 55. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 112.70061588287354 CPU times: user 1h 18min 13s, sys: 33min 6s, total: 1h 51min 20s Wall time: 37min 34s | {'listener_accuracy': 0.8459443356881436, 'corpus_bleu': 0.4808836633817978} | {'listener_accuracy': 0.914327917282127, 'corpus_bleu': 0.6813010156866868} |

In [58]:
def evaluate(trained_model, tokenizer, color_seqs_test, texts_test):
    tok_seqs = [mu.tokenize_colour_description(text, tokenizer) for text in texts_test]
    col_seqs = [color_encoder.encode_colors_from_convnet(colors) for colors in color_seqs_test]

    return trained_model.evaluate(col_seqs, tok_seqs)

### BERT Embeddings

In [59]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = BertModel.from_pretrained('bert-base-cased')

In [60]:
from baseline.model import (
    BaselineTokenizer, BaselineColorEncoder,
    BaselineEmbedding, BaselineDescriber, GloVeEmbedding,
    ConvolutionalColorEncoder
)
color_encoder = ConvolutionalColorEncoder("resnet18",True)
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(bert_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

Using cache found in /Users/samuelkahn/.cache/torch/hub/pytorch_vision_v0.6.0


CPU times: user 3h 16min 18s, sys: 17min 58s, total: 3h 34min 17s
Wall time: 2h 42min 33s


In [61]:
%time bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, bert_model, bert_tokenizer)

CPU times: user 2.8 s, sys: 36.8 ms, total: 2.83 s
Wall time: 2.84 s


In [62]:
colors_train[0][0].shape

torch.Size([1, 566])

In [63]:
# %time bert_positional_embeddings, bert_positional_vocab = \
#    mu.extract_positional_embeddings(texts_test, bert_model, bert_tokenizer)

*Baseline model using BERT pretrained embeddings and vocab*

In [64]:
%load_ext autoreload
%autoreload 2
from baseline.model import (
    BaselineTokenizer, BaselineColorEncoder,
    BaselineEmbedding, BaselineDescriber, GloVeEmbedding,
    ConvolutionalColorEncoder
)
baseline_model = BaselineDescriber(
    bert_vocab,
    embedding=bert_embeddings,
    early_stopping=True
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
%time _ = baseline_model.fit(colors_train, tokens_train)

  perp = [np.prod(s)**(-1/len(s)) for s in scores]
Finished epoch 80 of 1000; error is 78.556766748428347

Evaluate on test data

In [None]:
evaluate(baseline_model, bert_tokenizer, raw_colors_test, texts_test)

Evaluate on bake-off data

In [125]:
evaluate(baseline_model, bert_tokenizer, b_raw_colors_test, b_texts_test)

{'listener_accuracy': 0.3825701624815362, 'corpus_bleu': 0.5236180769568916}

### XLNet Embeddings

In [28]:
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xlnet_model = XLNetModel.from_pretrained('xlnet-base-cased')

In [29]:
colors_train, tokens_train, raw_colors_test, texts_test = create_data(xlnet_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [30]:
%time xlnet_embeddings, xlnet_vocab = mu.extract_input_embeddings(texts_test, xlnet_model, xlnet_tokenizer)

CPU times: user 3.35 s, sys: 41.3 ms, total: 3.39 s
Wall time: 3.42 s


In [31]:
#%time bert_positional_embeddings, bert_positional_vocab = \
#    mu.extract_positional_embeddings(texts_test, xlnet_model, xlnet_tokenizer)

*Baseline model using XLNet pretrained embeddings and vocab*

In [32]:
baseline_model = BaselineDescriber(
    xlnet_vocab,
    embedding=xlnet_embeddings,
    early_stopping=True
)

In [33]:
%time _ = baseline_model.fit(colors_train, tokens_train)

Stopping after epoch 55. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 117.36168694496155

CPU times: user 2h 18min 38s, sys: 1h 7min 4s, total: 3h 25min 43s
Wall time: 1h 9min 34s


Evaluate on test data

In [34]:
evaluate(baseline_model, xlnet_tokenizer, raw_colors_test, texts_test)

{'listener_accuracy': 0.8243254745084688, 'corpus_bleu': 0.4290345340732608}

Evaluate on bake-off data

In [35]:
evaluate(baseline_model, xlnet_tokenizer, b_raw_colors_test, b_texts_test)

{'listener_accuracy': 0.9034958148695224, 'corpus_bleu': 0.6462354138567806}

### RoBERTa Embeddings

In [36]:
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaModel.from_pretrained('roberta-base')

In [37]:
colors_train, tokens_train, raw_colors_test, texts_test = create_data(roberta_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [38]:
%time roberta_embeddings, roberta_vocab = mu.extract_input_embeddings(texts_test, roberta_model, roberta_tokenizer)

CPU times: user 2.32 s, sys: 35.4 ms, total: 2.36 s
Wall time: 2.37 s


In [39]:
#%time bert_positional_embeddings, bert_positional_vocab = \
#    mu.extract_positional_embeddings(texts_test, roberta_model, roberta_tokenizer)

*Baseline model using RoBERTa pretrained embeddings and vocab*

In [40]:
baseline_model = BaselineDescriber(
    roberta_vocab,
    embedding=roberta_embeddings,
    early_stopping=True
)

In [41]:
%time _ = baseline_model.fit(colors_train, tokens_train)

Stopping after epoch 71. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 88.95393252372742

CPU times: user 2h 26min 29s, sys: 1h 11min 9s, total: 3h 37min 38s
Wall time: 1h 10min 30s


Evaluate on test data

In [42]:
evaluate(baseline_model, roberta_tokenizer, raw_colors_test, texts_test)

{'listener_accuracy': 0.8160694527193804, 'corpus_bleu': 0.43914107019568926}

Evaluate on bake-off data

In [43]:
evaluate(baseline_model, roberta_tokenizer, b_raw_colors_test, b_texts_test)

{'listener_accuracy': 0.8936484490398818, 'corpus_bleu': 0.6365891729377956}

### ELECTRA Embeddings

In [44]:
electra_tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
electra_model = ElectraModel.from_pretrained('google/electra-small-discriminator')

In [45]:
colors_train, tokens_train, raw_colors_test, texts_test = create_data(electra_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [46]:
%time electra_embeddings, electra_vocab = mu.extract_input_embeddings(texts_test, electra_model, electra_tokenizer)

CPU times: user 2.66 s, sys: 7.9 ms, total: 2.67 s
Wall time: 2.68 s


In [47]:
#%time bert_positional_embeddings, bert_positional_vocab = \
#    mu.extract_positional_embeddings(texts_test, electra_model, electra_tokenizer)

*Baseline model using ELECTRA pretrained embeddings and vocab*

In [48]:
baseline_model = BaselineDescriber(
    electra_vocab,
    embedding=electra_embeddings,
    early_stopping=True
)

In [49]:
%time _ = baseline_model.fit(colors_train, tokens_train)

Stopping after epoch 55. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 112.70061588287354

CPU times: user 1h 18min 13s, sys: 33min 6s, total: 1h 51min 20s
Wall time: 37min 34s


Evaluate on test data

In [50]:
evaluate(baseline_model, electra_tokenizer, raw_colors_test, texts_test)

{'listener_accuracy': 0.8459443356881436, 'corpus_bleu': 0.4808836633817978}

Evaluate on bake-off data

In [51]:
evaluate(baseline_model, electra_tokenizer, b_raw_colors_test, b_texts_test)

{'listener_accuracy': 0.914327917282127, 'corpus_bleu': 0.6813010156866868}