In [1]:
__authors__ = "Anton Gochev, Jaro Habr, Yan Jiang, Samuel Kahn"
__version__ = "XCS224u, Stanford, Spring 2021"

### Colours with contextual embeddings

This notebook uses the classes and helpers from the [model.py](experiment/model.py) library. Here we experiment with different combinations of contextual embeddings aiming to determine which will have a better performance than the basemodel or the static embeddings used in [colours_with_static_embeddings_with_convolutional_image_embeddings.ipynb](colours_with_static_embeddings_with_convolutional_image_embeddings).

## WIP

### Setup

In [2]:
from utils.colors import ColorsCorpusReader
import os
from sklearn.model_selection import train_test_split

from utils.torch_color_describer import (
    ContextualColorDescriber,
    create_example_dataset
)

from utils.utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL

from transformers import (
    BertTokenizer, BertModel,
    XLNetConfig, XLNetModel,
    XLNetTokenizer, XLNetForSequenceClassification,
    RobertaTokenizer, RobertaModel,
    ElectraTokenizer, ElectraModel,
    EncoderDecoderModel
)
import utils.model_utils as mu

### Dataset

This exploration of the dataset counts the examples for different classes and plots the words distribition in order to see any data imbalance issues.

#### Filtered Corpus

The filtered corpus is the full dataset used in assignment 4. The following code looks at the composition of the dataset, the number of example in each condition as well as the word count used in the color descriptions.

In [3]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv"
)

In [4]:
corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [5]:
examples = list(corpus.read())

In [6]:
len(examples)

46994

To understand the datasets (training and bake-off) in more details refer to [colors_in_context.ipynb](colors_in_context.ipynb). The notebook shows the distribution of the colours examples among the different splits.

#### Bake-Off Corpus

The following code analyses the bake-off dataset. We will look at the number of examples for each of the conditions as well as the word count used to described the colors.

In [7]:
BAKE_OFF_COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "cs224u-colors-bakeoff-data.csv"
)

In [8]:
bake_off_corpus = ColorsCorpusReader(
    BAKE_OFF_COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [9]:
bake_off_examples = list(bake_off_corpus.read())

In [10]:
len(bake_off_examples)

2031

### Baseline-System

This baseline system is based on assignment 4 and is enhanced with new classes that allow using different contextual embedding extrsactors for easier experiments and evaluation.

In [11]:
%load_ext autoreload
%autoreload 2

from baseline.model import (
#     BaselineTokenizer, 
    BaselineColorEncoder#, 
#     BaselineDecoder,
#     BaselineEmbedding, BaselineDescriber, GloVeEmbedding
)

from experiment.model import (
    TransformerEmbeddingDecoder, 
    TransformerEmbeddingDescriber,
    EmbeddingExtractorType,
    EmbeddingExtractor
)


In [12]:
color_encoder = BaselineColorEncoder()

#### Full dataset used for training and expeiments

In [13]:
def create_data(tokenizer):
    rawcols, texts = zip(*[[ex.colors, ex.contents] for ex in examples])
    
    raw_colors_train, raw_colors_test, texts_train, texts_test = \
        train_test_split(rawcols, texts)
    
    raw_colors_train = raw_colors_train
    texts_train = texts_train

    tokens_train = [
        mu.tokenize_colour_description(text, tokenizer) for text in texts_train
    ]
    colors_train = [
        color_encoder.encode_color_context(colors) for colors in raw_colors_train
    ]

    return colors_train, tokens_train, raw_colors_test, texts_test

In [14]:
def create_bakeoff_data():    
    return zip(*[[ex.colors, ex.contents] for ex in bake_off_examples])

### Experiments

Results from the experiments:

| Model | Hidden Layer | Protocol | Training Results | Bake-off Results |
| --- | --- | --- | --- | --- |
| BERT 'bert-base-cased' | Layer 12 | Stopping after epoch 13. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.030726432800293 CPU times: user 1h 43min 56s, sys: 54min 44s, total: 2h 38min 40s Wall time: 1h 22min 48s | {'listener_accuracy': 0.2905405405405405, 'corpus_bleu': 0.07466216216216218} | {'listener_accuracy': 0.32348596750369274, 'corpus_bleu': 0.050147710487444604} |
| BERT 'bert-base-cased' | Positional | Stopping after epoch 27. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 5.257309436798096 CPU times: user 1h 23min 10s, sys: 10min 23s, total: 1h 33min 33s Wall time: 58min 54s | {'listener_accuracy': 0.32432432432432434, 'corpus_bleu': 0.2281656643331077} | {'listener_accuracy': 0.3559822747415066, 'corpus_bleu': 0.39051130299796255} |
| BERT 'bert-base-cased' fixed padding and baseline special symbols | Layer 12 | Stopping after epoch 14. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.11735725402832 CPU times: user 1h 44min 47s, sys: 54min 54s, total: 2h 39min 42s Wall time: 1h 21min 23s | {'listener_accuracy': 0.36486486486486486, 'corpus_bleu': 0.035602020040005664} | {'listener_accuracy': 0.3658296405711472, 'corpus_bleu': 0.41658000242602683} |
| BERT 'bert-base-cased' fixed padding and BERT special symbols | Layer 12 | Stopping after epoch 15. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 5.931732177734375 CPU times: user 1h 53min 23s, sys: 59min 28s, total: 2h 52min 52s Wall time: 1h 29min 26s | {'listener_accuracy': 0.34459459459459457, 'corpus_bleu': 0.05016891891891893} | {'listener_accuracy': 0.3293943870014771, 'corpus_bleu': 0.050147710487444604} | 
| XLNet 'xlnet-base-cased' |  |  |  |  |
| RoBERTa 'roberta-base' |  |  |  |  |
| ELECTRA 'google/electra-small-discriminator' |  |  |  |  |

Initial experiments show that using just contextual or positional embeddings is certainly not improving the performance. It seems that those are very high-level and we are missing on some of the low level context static embeddings provide us with. Therefore, next steps are to implement and text embeddings extractors that make use of the static embeddings. 
- Do not throw away the static embeddings as they have good performance
- Combine the static embeddings with the [CLS] output to see what the result is (he had good experience with such approach)
- Combine the static embeddings with the output of the model (contextual embeddings/last hidden state) to see if there is any improvement

In additoin, we will try with different model complexity (increase the hidden_dim)

In [15]:
def evaluate(trained_model, tokenizer, color_seqs_test, texts_test):
    tok_seqs = [mu.tokenize_colour_description(text, tokenizer) for text in texts_test]
    col_seqs = [color_encoder.encode_color_context(colors) for colors in color_seqs_test]

    return trained_model.evaluate(col_seqs, tok_seqs)

In [16]:
b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [42]:
def train_many(colors_train, tokens_train, extractors=None):
    assert extractors != None and type(extractors) is list, \
            "Expected a list of extractors but got something else"
    
    models = []
    for extractor in extractors:
        model = TransformerEmbeddingDescriber(
            vocab=bert_vocab,
            embedding=bert_embeddings,
            model=bert_model,
            tokenizer=bert_tokenizer,
            embed_extractor=embedding_extractor,
            early_stopping=True)
        %time model.fit(colors_train, tokens_train)
        models.append(model)
        
    return models

In [50]:
def evaluate_many(models, tokenizer, raw_colors_test, texts_test):
    assert models != None and type(models) is list, \
        "Expected a list of models but got something else"
    
    results = []
    for model in models:
        %time results.append(evaluate(model, tokenizer, raw_colors_test, texts_test))
        
    return results

In [36]:
embedding_extractors = [
    EmbeddingExtractor(EmbeddingExtractorType.STATIC),
    EmbeddingExtractor(EmbeddingExtractorType.POSITIONAL),
    EmbeddingExtractor(EmbeddingExtractorType.LAYER12)
]

### BERT Embeddings

In [17]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = BertModel.from_pretrained('bert-base-cased')

In [18]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(bert_tokenizer)

CPU times: user 8.05 s, sys: 111 ms, total: 8.16 s
Wall time: 8.18 s


In [19]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, bert_model, bert_tokenizer)

In [51]:
models = train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 5.9898362159729

CPU times: user 1.05 s, sys: 267 ms, total: 1.32 s
Wall time: 1.15 s


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.020166873931885

CPU times: user 1.12 s, sys: 224 ms, total: 1.35 s
Wall time: 1.09 s


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.24141263961792

CPU times: user 1.17 s, sys: 233 ms, total: 1.41 s
Wall time: 890 ms


**Evaluate on test data**

In [54]:
evaluate_many(models, bert_tokenizer, raw_colors_test[:5], texts_test[:5])

CPU times: user 62.9 ms, sys: 9.52 ms, total: 72.5 ms
Wall time: 67 ms
CPU times: user 50.2 ms, sys: 1.99 ms, total: 52.2 ms
Wall time: 50.5 ms
CPU times: user 50 ms, sys: 2.14 ms, total: 52.1 ms
Wall time: 50.3 ms


[{'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.6, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.2, 'corpus_bleu': 0.05000000000000001}]

**Evaluate with bake-off data the model hasn't seen**

In [55]:
evaluate_many(models, bert_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

CPU times: user 48.7 ms, sys: 3.86 ms, total: 52.6 ms
Wall time: 51.3 ms
CPU times: user 48.9 ms, sys: 2.68 ms, total: 51.6 ms
Wall time: 49.6 ms
CPU times: user 49 ms, sys: 3.68 ms, total: 52.7 ms
Wall time: 49.9 ms


[{'listener_accuracy': 0.6, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.2, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.6, 'corpus_bleu': 0.05000000000000001}]

### XLNet Embeddings

In [56]:
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xlnet_model = XLNetModel.from_pretrained('xlnet-base-cased')

In [57]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(xlnet_tokenizer)

CPU times: user 7.85 s, sys: 993 ms, total: 8.84 s
Wall time: 9.16 s


In [58]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, xlnet_model, xlnet_tokenizer)

In [59]:
models = train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

Stopping after epoch 21. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.09463357925415

CPU times: user 1.61 s, sys: 1.14 s, total: 2.75 s
Wall time: 2.23 s


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.791963577270508

CPU times: user 1.06 s, sys: 178 ms, total: 1.24 s
Wall time: 756 ms


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.765586853027344

CPU times: user 1.18 s, sys: 288 ms, total: 1.47 s
Wall time: 856 ms


**Evaluate on test data**

In [60]:
evaluate_many(models, xlnet_tokenizer, raw_colors_test[:5], texts_test[:5])

CPU times: user 57.1 ms, sys: 7.43 ms, total: 64.5 ms
Wall time: 67.3 ms
CPU times: user 51.7 ms, sys: 3.71 ms, total: 55.5 ms
Wall time: 55.1 ms
CPU times: user 49.5 ms, sys: 2.32 ms, total: 51.8 ms
Wall time: 49.7 ms


[{'listener_accuracy': 0.6, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.6, 'corpus_bleu': 0.06},
 {'listener_accuracy': 0.6, 'corpus_bleu': 0.05000000000000001}]

**Evaluate with bake-off data the model hasn't seen**

In [61]:
evaluate_many(models, xlnet_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

CPU times: user 56.8 ms, sys: 3.01 ms, total: 59.8 ms
Wall time: 57.5 ms
CPU times: user 47.3 ms, sys: 3.31 ms, total: 50.6 ms
Wall time: 47.8 ms
CPU times: user 47.8 ms, sys: 2.79 ms, total: 50.6 ms
Wall time: 47.8 ms


[{'listener_accuracy': 0.2, 'corpus_bleu': 0.07058823529411766},
 {'listener_accuracy': 0.6, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.0, 'corpus_bleu': 0.05000000000000001}]

### RoBERTa Embeddings

In [62]:
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaModel.from_pretrained('roberta-base')

In [63]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(roberta_tokenizer)

CPU times: user 8.38 s, sys: 235 ms, total: 8.61 s
Wall time: 8.66 s


In [64]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, roberta_model, roberta_tokenizer)

In [65]:
models = train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.189422607421875

CPU times: user 1.15 s, sys: 201 ms, total: 1.35 s
Wall time: 792 ms


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.499911785125732

CPU times: user 1.15 s, sys: 206 ms, total: 1.35 s
Wall time: 800 ms


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.431423664093018

CPU times: user 1.14 s, sys: 199 ms, total: 1.34 s
Wall time: 778 ms


**Evaluate on test data**

In [66]:
evaluate_many(models, roberta_tokenizer, raw_colors_test[:5], texts_test[:5])

CPU times: user 57.6 ms, sys: 7.23 ms, total: 64.8 ms
Wall time: 62.5 ms
CPU times: user 50 ms, sys: 1.85 ms, total: 51.8 ms
Wall time: 50 ms
CPU times: user 50.1 ms, sys: 2.45 ms, total: 52.5 ms
Wall time: 50.3 ms


[{'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001}]

**Evaluate with bake-off data the model hasn't seen**

In [67]:
evaluate_many(models, roberta_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

CPU times: user 59.7 ms, sys: 3.43 ms, total: 63.1 ms
Wall time: 60.5 ms
CPU times: user 56.1 ms, sys: 2.45 ms, total: 58.5 ms
Wall time: 57 ms
CPU times: user 48.5 ms, sys: 2.14 ms, total: 50.6 ms
Wall time: 49 ms


[{'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001}]

### ELECTRA Embeddings

In [68]:
electra_tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
electra_model = ElectraModel.from_pretrained('google/electra-small-discriminator')

In [69]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(electra_tokenizer)

CPU times: user 9.32 s, sys: 111 ms, total: 9.43 s
Wall time: 9.45 s


In [70]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, electra_model, electra_tokenizer)

In [71]:
models = train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.825011253356934

CPU times: user 660 ms, sys: 144 ms, total: 804 ms
Wall time: 541 ms


Stopping after epoch 12. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.785768032073975

CPU times: user 686 ms, sys: 153 ms, total: 839 ms
Wall time: 575 ms


Stopping after epoch 20. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.495962619781494

CPU times: user 981 ms, sys: 361 ms, total: 1.34 s
Wall time: 1.22 s


**Evaluate on test data**

In [72]:
evaluate_many(models, electra_tokenizer, raw_colors_test[:5], texts_test[:5])

CPU times: user 54.2 ms, sys: 6.83 ms, total: 61.1 ms
Wall time: 57.9 ms
CPU times: user 47 ms, sys: 2.19 ms, total: 49.2 ms
Wall time: 47.1 ms
CPU times: user 47.7 ms, sys: 2.53 ms, total: 50.2 ms
Wall time: 47.9 ms


[{'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.0, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.4, 'corpus_bleu': 0.1076923076923077}]

**Evaluate with bake-off data the model hasn't seen**

In [73]:
evaluate_many(models, electra_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

CPU times: user 50.1 ms, sys: 2.82 ms, total: 52.9 ms
Wall time: 50.8 ms
CPU times: user 50.6 ms, sys: 3.32 ms, total: 54 ms
Wall time: 50.8 ms
CPU times: user 46.8 ms, sys: 2.95 ms, total: 49.8 ms
Wall time: 47 ms


[{'listener_accuracy': 0.4, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.2, 'corpus_bleu': 0.05000000000000001},
 {'listener_accuracy': 0.2, 'corpus_bleu': 0.05000000000000001}]