In [None]:
__authors__ = "Anton Gochev, Jaro Habr, Yan Jiang, Samuel Kahn"
__version__ = "XCS224u, Stanford, Spring 2021"

### Colours with contextual embeddings

This notebook uses the classes and helpers from the [model.py](experiment/model.py) library. Here we experiment with different combinations of contextual embeddings aiming to determine which will have a better performance than the basemodel or the static embeddings used in [colours_with_static_embeddings_with_convolutional_image_embeddings.ipynb](colours_with_static_embeddings_with_convolutional_image_embeddings).

## WIP

### Setup

In [1]:
from utils.colors import ColorsCorpusReader
import os
from sklearn.model_selection import train_test_split

from utils.torch_color_describer import (
    ContextualColorDescriber,
    create_example_dataset
)

from utils.utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL

from transformers import (
    BertTokenizer, BertModel,
    XLNetConfig, XLNetModel,
    XLNetTokenizer, XLNetForSequenceClassification,
    RobertaTokenizer, RobertaModel,
    ElectraTokenizer, ElectraModel,
    EncoderDecoderModel
)
import utils.model_utils as mu
import experiment.helper as helper

### Dataset

This exploration of the dataset counts the examples for different classes and plots the words distribition in order to see any data imbalance issues.

#### Filtered Corpus

The filtered corpus is the full dataset used in assignment 4. The following code looks at the composition of the dataset, the number of example in each condition as well as the word count used in the color descriptions.

In [3]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv"
)

In [4]:
corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [5]:
examples = list(corpus.read())

In [6]:
len(examples)

46994

To understand the datasets (training and bake-off) in more details refer to [colors_in_context.ipynb](colors_in_context.ipynb). The notebook shows the distribution of the colours examples among the different splits.

#### Bake-Off Corpus

The following code analyses the bake-off dataset. We will look at the number of examples for each of the conditions as well as the word count used to described the colors.

In [7]:
BAKE_OFF_COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "cs224u-colors-bakeoff-data.csv"
)

In [8]:
bake_off_corpus = ColorsCorpusReader(
    BAKE_OFF_COLORS_SRC_FILENAME,
    word_count=None,
    normalize_colors=True
)

In [9]:
bake_off_examples = list(bake_off_corpus.read())

In [10]:
len(bake_off_examples)

2031

### Baseline-System

This baseline system is based on assignment 4 and is enhanced with new classes that allow using different contextual embedding extrsactors for easier experiments and evaluation.

In [2]:
%load_ext autoreload
%autoreload 2

from baseline.model import BaselineColorEncoder

from experiment.model import (
    TransformerEmbeddingDecoder, 
    TransformerEmbeddingDescriber,
    EmbeddingExtractorType,
    EmbeddingExtractor
)


In [12]:
color_encoder = BaselineColorEncoder()

#### Full dataset used for training and expeiments

In [13]:
def create_data(tokenizer):
    rawcols, texts = zip(*[[ex.colors, ex.contents] for ex in examples])
    
    raw_colors_train, raw_colors_test, texts_train, texts_test = \
        train_test_split(rawcols, texts)
    
    raw_colors_train = raw_colors_train
    texts_train = texts_train

    tokens_train = [
        mu.tokenize_colour_description(text, tokenizer) for text in texts_train
    ]
    colors_train = [
        color_encoder.encode_color_context(colors) for colors in raw_colors_train
    ]

    return colors_train, tokens_train, raw_colors_test, texts_test

In [14]:
def create_bakeoff_data():    
    return zip(*[[ex.colors, ex.contents] for ex in bake_off_examples])

### Experiments

Results from the experiments:

| Model | Hidden Layer | Protocol | Training Results | Bake-off Results |
| --- | --- | --- | --- | --- |
| BERT 'bert-base-cased' | Layer 12 | Stopping after epoch 13. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.030726432800293 CPU times: user 1h 43min 56s, sys: 54min 44s, total: 2h 38min 40s Wall time: 1h 22min 48s | {'listener_accuracy': 0.2905405405405405, 'corpus_bleu': 0.07466216216216218} | {'listener_accuracy': 0.32348596750369274, 'corpus_bleu': 0.050147710487444604} |
| BERT 'bert-base-cased' | Positional | Stopping after epoch 27. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 5.257309436798096 CPU times: user 1h 23min 10s, sys: 10min 23s, total: 1h 33min 33s Wall time: 58min 54s | {'listener_accuracy': 0.32432432432432434, 'corpus_bleu': 0.2281656643331077} | {'listener_accuracy': 0.3559822747415066, 'corpus_bleu': 0.39051130299796255} |
| BERT 'bert-base-cased' fixed padding and baseline special symbols | Layer 12 | Stopping after epoch 14. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 6.11735725402832 CPU times: user 1h 44min 47s, sys: 54min 54s, total: 2h 39min 42s Wall time: 1h 21min 23s | {'listener_accuracy': 0.36486486486486486, 'corpus_bleu': 0.035602020040005664} | {'listener_accuracy': 0.3658296405711472, 'corpus_bleu': 0.41658000242602683} |
| BERT 'bert-base-cased' fixed padding and BERT special symbols | Layer 12 | Stopping after epoch 15. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 5.931732177734375 CPU times: user 1h 53min 23s, sys: 59min 28s, total: 2h 52min 52s Wall time: 1h 29min 26s | {'listener_accuracy': 0.34459459459459457, 'corpus_bleu': 0.05016891891891893} | {'listener_accuracy': 0.3293943870014771, 'corpus_bleu': 0.050147710487444604} | 
| XLNet 'xlnet-base-cased' |  |  |  |  |
| RoBERTa 'roberta-base' |  |  |  |  |
| ELECTRA 'google/electra-small-discriminator' |  |  |  |  |

Initial experiments show that using just contextual or positional embeddings is certainly not improving the performance. It seems that those are very high-level and we are missing on some of the low level context static embeddings provide us with. Therefore, next steps are to implement and text embeddings extractors that make use of the static embeddings. 
- Do not throw away the static embeddings as they have good performance
- Combine the static embeddings with the [CLS] output to see what the result is (he had good experience with such approach)
- Combine the static embeddings with the output of the model (contextual embeddings/last hidden state) to see if there is any improvement

In additoin, we will try with different model complexity (increase the hidden_dim)

In [15]:
b_raw_colors_test, b_texts_test = create_bakeoff_data()

In [16]:
embedding_extractors = [
    EmbeddingExtractor(EmbeddingExtractorType.STATIC),
    EmbeddingExtractor(EmbeddingExtractorType.POSITIONAL),
    EmbeddingExtractor(EmbeddingExtractorType.LAYER12)
]

### BERT Embeddings

In [17]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = BertModel.from_pretrained('bert-base-cased')

In [18]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(bert_tokenizer)

CPU times: user 8.04 s, sys: 127 ms, total: 8.17 s
Wall time: 8.21 s


In [19]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, bert_model, bert_tokenizer)

In [20]:
%time models = helper.train_many(colors_train[:5], tokens_train[:5], \
                                 bert_vocab, bert_embeddings, bert_model, bert_tokenizer, \
                                 extractors=embedding_extractors)

Stopping after epoch 17. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 7.168666839599609

CPU times: user 1min 29s, sys: 30 s, total: 1min 59s
Wall time: 53 s


**Evaluate on test data**

In [None]:
helper.evaluate_many(models, bert_tokenizer, raw_colors_test[:5], texts_test[:5])

**Evaluate with bake-off data the model hasn't seen**

In [None]:
helper.evaluate_many(models, bert_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

### XLNet Embeddings

In [None]:
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xlnet_model = XLNetModel.from_pretrained('xlnet-base-cased')

In [None]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(xlnet_tokenizer)

In [None]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, xlnet_model, xlnet_tokenizer)

In [None]:
%time models = helper.train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

**Evaluate on test data**

In [None]:
%time helper.evaluate_many(models, xlnet_tokenizer, raw_colors_test[:5], texts_test[:5])

**Evaluate with bake-off data the model hasn't seen**

In [None]:
%time helper.evaluate_many(models, xlnet_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

### RoBERTa Embeddings

In [None]:
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaModel.from_pretrained('roberta-base')

In [None]:
%time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(roberta_tokenizer)

In [None]:
bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, roberta_model, roberta_tokenizer)

In [None]:
%time models = helper.train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

**Evaluate on test data**

In [None]:
%time helper.evaluate_many(models, roberta_tokenizer, raw_colors_test[:5], texts_test[:5])

**Evaluate with bake-off data the model hasn't seen**

In [None]:
%time helper.evaluate_many(models, roberta_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])

### ELECTRA Embeddings WIP - doesn't work yet

In [None]:
# electra_tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')
# electra_model = ElectraModel.from_pretrained('google/electra-small-discriminator')

In [None]:
# %time  colors_train, tokens_train, raw_colors_test, texts_test = create_data(electra_tokenizer)

In [None]:
# bert_embeddings, bert_vocab = mu.extract_input_embeddings(texts_test, electra_model, electra_tokenizer)

In [None]:
#%time models = helper.train_many(colors_train[:5], tokens_train[:5], extractors=embedding_extractors)

**Evaluate on test data**

In [None]:
#%time helper.evaluate_many(models, electra_tokenizer, raw_colors_test[:5], texts_test[:5])

**Evaluate with bake-off data the model hasn't seen**

In [None]:
#%time helper.evaluate_many(models, electra_tokenizer, b_raw_colors_test[:5], b_texts_test[:5])