In [1]:
__authors__ = "Anton Gochev, Jaro Habr, Yan Jiang, Samuel Kahn"
__version__ = "XCS224u, Stanford, Spring 2021"

## Colours with static embeddings

1. [Setup](#Setup)
1. [Dataset](#Dataset)
    1. [Filtered Corpus](#Filtered-Corpus)
    1. [Bake-Off Corpus](#Bake-Off-Corpus)
1. [Baseline-System](#Baseline-System)
1. [Experiments](#Experiments)
  1. [BERT Embeddings](#BERT-Embeddings)
  2. [XLNet Embeddings](#XLNet-Embeddings)
  3. [RoBERTa Embeddings](#RoBERTa-Embeddings)
  4. [ELECTRA Embeddings](#ELECTRA-Embeddings)

## Setup

This notebook explores the performance of the basemodel with using different pre-trained static embeddings extracted from transformers such as BERT, XLNet, RoBERTa, ELECTRA

In [2]:
from utils.colors import ColorsCorpusReader
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from utils.torch_color_describer import ContextualColorDescriber, create_example_dataset
import utils.utils
from utils.utils import UNK_SYMBOL, START_SYMBOL, END_SYMBOL
import matplotlib.pyplot as plt
import matplotlib.patches as mpatch
import numpy as np
from baseline.model import (
    BaselineTokenizer, BaselineColorEncoder,
    BaselineEmbedding, BaselineDescriber, 
    GloVeEmbedding, BaselineLSTMDescriber    
)

from transformers import (
    BertTokenizer, BertModel,
    XLNetTokenizer, XLNetModel,
    RobertaTokenizer, RobertaModel,
    ElectraTokenizer, ElectraModel,    
)

import utils.model_utils as mu

## Dataset

This exploration of the dataset counts the examples for different classes and plots the words distribition in order to see any data imbalance issues.

### Filtered Corpus

The filtered corpus is the full dataset used in assignment 4. The following code looks at the composition of the dataset, the number of example in each condition as well as the word count used in the color descriptions.

In [3]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv"
)

In [4]:
corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [5]:
examples = list(corpus.read())

In [6]:
len(examples)

46994

In [7]:
subset_examples = mu.extract_colour_examples(examples, from_word_count=5)

To understand the datasets (training and bake-off) in more details refer to [colors_in_context.ipynb](colors_in_context.ipynb). The notebook shows the distribution of the colours examples among the different splits.

### Bake-Off Corpus

The following code analyses the bake-off dataset. We will look at the number of examples for each of the conditions as well as the word count used to described the colors.

In [8]:
BAKE_OFF_COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "cs224u-colors-bakeoff-data.csv"
)

In [9]:
bake_off_corpus = ColorsCorpusReader(
    BAKE_OFF_COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [10]:
bake_off_examples = list(bake_off_corpus.read())

In [11]:
len(bake_off_examples)

2031

## Baseline-System

This baseline system is based on assignment 4 and we use different token embeddings and sequences.

### Baseline system development - dev dataset

The tiny dataset is used for baseline model development and fast testing.

In [12]:
def create_dev_data():
    dev_color_seqs, dev_word_seqs, dev_vocab = create_example_dataset(
        group_size=50,
        vec_dim=2
    )

    dev_colors_train, dev_colors_test, dev_words_train, dev_words_test = \
        train_test_split(dev_color_seqs, dev_word_seqs)
    
    return dev_vocab, dev_colors_train, dev_words_train, dev_colors_test, dev_words_test

In [13]:
dev_vocab, dev_colors_train, dev_tokens_train, dev_colors_test, dev_texts_test = \
    create_dev_data()

### Model training - full dataset

The full color context dataset is used for final baseline model training.

In [14]:
color_encoder = BaselineColorEncoder()

In [15]:
def create_data(tokenizer, include_position=False,include_conv_embeddings=False):
    if include_conv_embeddings==False:
        rawcols, texts = zip(*[[ex.colors, ex.contents] for ex in examples])
    
    raw_colors_train, raw_colors_test, texts_train, texts_test = \
        train_test_split(rawcols, texts)
    
    raw_colors_train = raw_colors_train
    texts_train = texts_train

    tokens_train = [
        mu.tokenize_colour_description(text, tokenizer) for text in texts_train
    ]
    colors_train = [
        color_encoder.encode_color_context(colors) for colors in raw_colors_train
    ]

    return colors_train, tokens_train, raw_colors_test, texts_test

In [16]:
def create_bakeoff_data():    
    return zip(*[[ex.colors, ex.contents] for ex in bake_off_examples])

## Experiments

The notebook contains results produced with the baseline architecture and with different static embeddings and RNN cells. We use GRU and LSTM. LSTM shows a slightly better accuracy results and worse BLEU scores. 

In order to run the experiments using different embeddings:
- for GRU - use BaselineDescriber
- for LSTM - use BaselineLSTMDescriber

In the code in this notebook this is the only change you have to make to experiment with one or the other.

Results from the experiments:

| Model | Protocol | Cell | Training Results | Bake-off Results |
| --- | --- | --- | --- | --- |
| BERT 'bert-base-cased' | GRU | Stopping after epoch 54. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 115.87914156913757 CPU times: user 2h 4min 43s, sys: 1h 1min 36s, total: 3h 6min 19s Wall time: 1h 3min 47s | {'listener_accuracy': 0.8282407013362839, 'corpus_bleu': 0.4614753491619744} | {'listener_accuracy': 0.9034958148695224, 'corpus_bleu': 0.6637981882180088} |
| BERT 'bert-base-cased' | LSTM | Stopping after epoch 83. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 76.14108896255493 CPU times: user 3h 3min 25s, sys: 1h 27min 37s, total: 4h 31min 2s  Wall time: 1h 30min 36s | {'listener_accuracy': 0.8422844497404034, 'corpus_bleu': 0.4628617489092576} | {'listener_accuracy': 0.9084194977843427, 'corpus_bleu': 0.663598646435816} |
| XLNet 'xlnet-base-cased' | GRU | Stopping after epoch 55. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 117.36168694496155 CPU times: user 2h 18min 38s, sys: 1h 7min 4s, total: 3h 25min 43s Wall time: 1h 9min 34s | {'listener_accuracy': 0.8243254745084688, 'corpus_bleu': 0.4290345340732608} | {'listener_accuracy': 0.9034958148695224, 'corpus_bleu': 0.6462354138567806} |
| XLNet 'xlnet-base-cased' | LSTM | Stopping after epoch 61. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 107.04569149017334 CPU times: user 2h 44min 52s, sys: 1h 13min 37s, total: 3h 58min 29s Wall time: 1h 20min 2s | {'listener_accuracy': 0.8394757000595795, 'corpus_bleu': 0.43106009125838857} | {'listener_accuracy': 0.9148202855736091, 'corpus_bleu': 0.6450470273203873} |
| RoBERTa 'roberta-base' | GRU | Stopping after epoch 71. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 88.95393252372742 CPU times: user 2h 26min 29s, sys: 1h 11min 9s, total: 3h 37min 38s Wall time: 1h 10min 30s | {'listener_accuracy': 0.8160694527193804, 'corpus_bleu': 0.43914107019568926} | {'listener_accuracy': 0.8936484490398818, 'corpus_bleu': 0.6365891729377956} |
| RoBERTa 'roberta-base' | LSTM | Stopping after epoch 61. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 111.48405265808105 CPU times: user 2h 41min 37s, sys: 1h 11min 45s, total: 3h 53min 23s Wall time: 1h 15min 32s | {'listener_accuracy': 0.8200697931738872, 'corpus_bleu': 0.39418980395247344} | {'listener_accuracy': 0.8941408173313639, 'corpus_bleu': 0.5649086521910678} |
| ELECTRA 'google/electra-small-discriminator' | GRU | Stopping after epoch 55. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 112.70061588287354 CPU times: user 1h 18min 13s, sys: 33min 6s, total: 1h 51min 20s Wall time: 37min 34s | {'listener_accuracy': 0.8459443356881436, 'corpus_bleu': 0.4808836633817978} | {'listener_accuracy': 0.914327917282127, 'corpus_bleu': 0.6813010156866868} |
| ELECTRA 'google/electra-small-discriminator' | LSTM | Stopping after epoch 79. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 80.59795188903809 CPU times: user 2h 2min 9s, sys: 47min 40s, total: 2h 49min 49s Wall time: 56min 50s | {'listener_accuracy': 0.8423695633670951, 'corpus_bleu': 0.47554009108512063} | {'listener_accuracy': 0.9187592319054653, 'corpus_bleu': 0.6793120721302932} |

In [17]:
def evaluate(trained_model, tokenizer, color_seqs_test, texts_test):
    tok_seqs = [mu.tokenize_colour_description(text, tokenizer) for text in texts_test]
    col_seqs = [color_encoder.encode_color_context(colors) for colors in color_seqs_test]

    return trained_model.evaluate(col_seqs, tok_seqs)

### BERT Embeddings

In [18]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = BertModel.from_pretrained('bert-base-cased')

In [19]:
%time colors_train, tokens_train, raw_colors_test, texts_test = create_data(bert_tokenizer)

CPU times: user 7.88 s, sys: 111 ms, total: 7.99 s
Wall time: 8.01 s


In [20]:
%time b_raw_colors_test, b_texts_test = create_bakeoff_data()

CPU times: user 3.74 ms, sys: 106 Âµs, total: 3.85 ms
Wall time: 3.86 ms


In [21]:
%time bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, bert_model, bert_tokenizer)

CPU times: user 2.41 s, sys: 12.1 ms, total: 2.42 s
Wall time: 2.43 s


*Baseline model using BERT pretrained embeddings and vocab*

In [22]:
baseline_model = BaselineLSTMDescriber(
    bert_vocab,
    embedding=bert_embeddings,
    early_stopping=True
)

In [33]:
%time _ = baseline_model.fit(colors_train, tokens_train)

Stopping after epoch 83. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 76.14108896255493

CPU times: user 3h 3min 25s, sys: 1h 27min 37s, total: 4h 31min 2s
Wall time: 1h 30min 36s


Evaluate on test data

In [34]:
%time evaluate(baseline_model, bert_tokenizer, raw_colors_test, texts_test)

CPU times: user 38.4 s, sys: 4.46 s, total: 42.8 s
Wall time: 37.4 s


{'listener_accuracy': 0.8422844497404034, 'corpus_bleu': 0.4628617489092576}

Evaluate on bake-off data

In [35]:
%time evaluate(baseline_model, bert_tokenizer, b_raw_colors_test, b_texts_test)

CPU times: user 6.23 s, sys: 455 ms, total: 6.69 s
Wall time: 6.05 s


{'listener_accuracy': 0.9084194977843427, 'corpus_bleu': 0.663598646435816}

### XLNet Embeddings

In [18]:
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xlnet_model = XLNetModel.from_pretrained('xlnet-base-cased')

In [19]:
colors_train, tokens_train, raw_colors_test, texts_test = create_data(xlnet_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [20]:
%time xlnet_embeddings, xlnet_vocab = mu.extract_input_embeddings(texts_test, xlnet_model, xlnet_tokenizer)

CPU times: user 2.38 s, sys: 18 ms, total: 2.4 s
Wall time: 2.41 s


*Baseline model using XLNet pretrained embeddings and vocab*

In [21]:
baseline_model = BaselineLSTMDescriber(
    xlnet_vocab,
    embedding=xlnet_embeddings,
    early_stopping=True
)

In [22]:
%time _ = baseline_model.fit(colors_train, tokens_train)

  perp = [np.prod(s)**(-1/len(s)) for s in scores]
Stopping after epoch 61. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 107.04569149017334

CPU times: user 2h 44min 52s, sys: 1h 13min 37s, total: 3h 58min 29s
Wall time: 1h 20min 2s


Evaluate on test data

In [23]:
%time evaluate(baseline_model, xlnet_tokenizer, raw_colors_test, texts_test)

CPU times: user 42.1 s, sys: 5.94 s, total: 48 s
Wall time: 41.8 s


{'listener_accuracy': 0.8394757000595795, 'corpus_bleu': 0.43106009125838857}

Evaluate on bake-off data

In [24]:
%time evaluate(baseline_model, xlnet_tokenizer, b_raw_colors_test, b_texts_test)

CPU times: user 6.49 s, sys: 465 ms, total: 6.96 s
Wall time: 6.22 s


{'listener_accuracy': 0.9148202855736091, 'corpus_bleu': 0.6450470273203873}

### RoBERTa Embeddings

In [25]:
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaModel.from_pretrained('roberta-base')

In [26]:
colors_train, tokens_train, raw_colors_test, texts_test = create_data(roberta_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [27]:
%time roberta_embeddings, roberta_vocab = mu.extract_input_embeddings(texts_test, roberta_model, roberta_tokenizer)

CPU times: user 2.71 s, sys: 39.2 ms, total: 2.75 s
Wall time: 2.75 s


*Baseline model using RoBERTa pretrained embeddings and vocab*

In [28]:
baseline_model = BaselineLSTMDescriber(
    roberta_vocab,
    embedding=roberta_embeddings,
    early_stopping=True
)

In [29]:
%time _ = baseline_model.fit(colors_train, tokens_train)

Stopping after epoch 61. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 111.48405265808105

CPU times: user 2h 41min 37s, sys: 1h 11min 45s, total: 3h 53min 23s
Wall time: 1h 15min 32s


Evaluate on test data

In [30]:
%time evaluate(baseline_model, roberta_tokenizer, raw_colors_test, texts_test)

CPU times: user 38.4 s, sys: 4.02 s, total: 42.4 s
Wall time: 35.9 s


{'listener_accuracy': 0.8200697931738872, 'corpus_bleu': 0.39418980395247344}

Evaluate on bake-off data

In [31]:
%time evaluate(baseline_model, roberta_tokenizer, b_raw_colors_test, b_texts_test)

CPU times: user 5.98 s, sys: 511 ms, total: 6.49 s
Wall time: 5.65 s


{'listener_accuracy': 0.8941408173313639, 'corpus_bleu': 0.5649086521910678}

### ELECTRA Embeddings

In [32]:
electra_tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
electra_model = ElectraModel.from_pretrained('google/electra-small-discriminator')

In [33]:
colors_train, tokens_train, raw_colors_test, texts_test = create_data(electra_tokenizer)

b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [34]:
%time electra_embeddings, electra_vocab = mu.extract_input_embeddings(texts_test, electra_model, electra_tokenizer)

CPU times: user 2.72 s, sys: 3.16 ms, total: 2.72 s
Wall time: 2.72 s


*Baseline model using ELECTRA pretrained embeddings and vocab*

In [35]:
baseline_model = BaselineLSTMDescriber(
    electra_vocab,
    embedding=electra_embeddings,
    early_stopping=True
)

In [36]:
%time _ = baseline_model.fit(colors_train, tokens_train)

Stopping after epoch 79. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 80.59795188903809

CPU times: user 2h 2min 9s, sys: 47min 40s, total: 2h 49min 49s
Wall time: 56min 50s


Evaluate on test data

In [37]:
%time evaluate(baseline_model, electra_tokenizer, raw_colors_test, texts_test)

CPU times: user 34.1 s, sys: 3.48 s, total: 37.6 s
Wall time: 32.8 s


{'listener_accuracy': 0.8423695633670951, 'corpus_bleu': 0.47554009108512063}

Evaluate on bake-off data

In [38]:
%time evaluate(baseline_model, electra_tokenizer, b_raw_colors_test, b_texts_test)

CPU times: user 5.45 s, sys: 451 ms, total: 5.9 s
Wall time: 5.27 s


{'listener_accuracy': 0.9187592319054653, 'corpus_bleu': 0.6793120721302932}