# Summary of model
TL;DR: Our group noticed a critical weakness in predicting SST-3 neutral labels fro most models from the course notebooks, so we took the approach of trying to boost the prediction for this particular label.

Since this seemed to be a gap experienced by all the models, we were extremely interested in investigating how we could engineer a particular training set/training regime that could help (any) model overcome this. Thus, we started with a model that seems to be doing well, and trying to optimize the training data in order to make the model better at its weakest prediction.

First, we started by surveying the examples that were already included in the course notebooks in a [spreadsheet](https://docs.google.com/spreadsheets/d/18TpQ84CP4cQvLGLX-abzg06h3195AoURdKaGzb8wuU8/edit?usp=sharing&resourcekey=0-XMaX_Xj3pEve0AMA4-aH2g). We saw three key observations:
- training on DynaSent dev was very helpful, increasing the score of a particular model by 0.1 compared to training on SST-3 alone.
- BERT vectors offered a >0.1 point improvement over glove vectors; generally the highest-scoring models so far were ones that used BERT encoding.
- the model that performs the best so far uses BERT encoding and finetuning.

Taking these results, we set forth to optimize a model based on BERT encoding and a finetuned classifier.

# Optimizing the training data

For all our experiments, we used SST-3-dev and DynaSent-dev as the assessment dataframes. We observed that our base BERT + RNN classifier struggled the most with predicting the neutral label on SST-3. Since the macro-F1 score weighs each class and each dataset equally, this affected the final performance a lot. We were hoping that if we could focus on tuning the predictions in this category, we'd be able to get a score in the 0.72 range, since the model can achieve this average score for the other labels/dataset. 

## Step 1: add DynaSent round 1 data
We noted that SST-3-train had label distribution of (neg, neutral, pos)=(3310, 1624, 3610). We imported the DynaSent round 1 data and used a subset of it, with label distribution of (neg, neutral, pos)=(2000, 4000, 2000). This allowed us to balance out the label distribution in the entire training set. Further, it meant we had roughly the same amount of SST-3 and DynaSent data to train with. (This is why we did not just use the entire round 1 dataset.) We chose to use the round 1 data bacause our BERT+RNN model is likely less robust than the RoBERTa model used in the DynaSent paper, and we reasoned that the round 1 data was sufficiently "difficult" for our model to learn and improve from.

We did not see as noticeable of an improvement in the neutral category from this addition, so...

## Step 2: add SST-3 subtree data
We decided to add more SST-3 data to the training set as well. Here, we loaded the subtree version of SST-3-train and took the first 3000 examples that had `label=neutral`. The idea was to boost the number of neutral SST-3 examples, without overwhelming the dataset with neutral labels, to avoid confusing the model with these more-difficult examples.

We noticed a 0.1 improvement in the neutral label f1-score after this addition. Furthermore, whereas the model we started with had roughly a 0.6 difference in the macro average f-1 scores of SST-3 vs DynaSent, using this training data, our model had very similar macro average f-1 scores for these two datasets. This was a positive change, potentially indicating a model with better domain-transfer abilities.

We tried increasing the number of subtree neutral examples to see if it would lend an even greater improvement, but the model started to compromise in its score for the negative label, so we decided to stick with 3000.

Our final training data consisted of SST-3 train root-level data, 3000 of SST-3 train subtree examples, and 8000 of DynaSent round 1 examples.

## Step 3: explore different training regimes
We realized that by simply concatenating the above three data sources into our training set, the training examples were presented in some sort of an order.

In other papers, we've read about how different training regimes, for example showing the model easier examples first and harder ones later, or pretraining the model with a slightly different task, could affect its learning and performance. Therefore, some variations we tried were:
* Scrambling the entire dataframe. 
  * Hypothesis: this could help bridge the gap between SST-3 neutral and DynaSent neutral label predictions, since all the data was mixed up and the model would not be "biased" towards learning one of those representations
  * Result: this decreased the performance. 
* Train in different orders, such as [DynaSent, SST-3 root, SST-3 subtree neutrals] and [DynaSent, SST-3 subtree neutrals, SST-3 root]. Most of these permutations also decreased the score. Our final model is trained with data in the order of [SST-3 root, SST-3 subtree, DynaSent]. We found that this permutation was the one that gave the best results. 

## Side note
Another reason for using subsets of the DynaSent and subtree data was that our group could not make use of Colab's GPU properly and each experiment took 40-60 minutes to run. Therefore, we wanted to keep the data size manageable while being able to observe some results.

# Hyperparameter tuning

From lecture, we noted that oftentimes the choice of feature function makes the most difference in the performance of the model. Since we wanted to use BERT features, we were not able to tweak this in that many other ways, which is another reason why we chose to instead focus more on studying the effects of the training data. 

However, we did perform some hyperparameter tuning on the model. The parameters we tuned were hidden dim, eta, batch size, gradient accumulation steps, set of BERT weights, and activation function. We found that compared to the class defaults, a larger hidden dimension and smaller eta were better. The batch size was reduced due to memory constraint, and correspondingly, the model performed better with a larger gradient accumulation step, since this smoothes out variability from the small batches. We found that the cased BERT weights worked better than uncased weights, and the Tanh activation function performed better than others tried. 

## Below is the code for our model.

## Imports

In [None]:
# this mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# enter the foldername in your Drive where you have saved the unzipped
# assignment folder
FOLDERNAME = 'personal/CS224U/cs224u-kf/'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# this ensures that the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))

Mounted at /content/drive


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 6.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 23.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 38.1MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
import os
from sklearn.metrics import classification_report
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNModel
from torch_rnn_classifier import TorchRNNClassifier
from torch_rnn_classifier import TorchRNNClassifierModel
from torch_rnn_classifier import TorchRNNClassifier
import sst
import utils

import pandas as pd

## Training data setup

In [None]:
utils.fix_random_seeds()

In [None]:
SST_HOME = os.path.join(sys.path[-1], 'data/sentiment')

In [None]:
bakeoff_dev = sst.bakeoff_dev_reader(SST_HOME)
sst_dev = sst.dev_reader(SST_HOME)
sst_train = sst.train_reader(SST_HOME)

In [None]:
# load DynaSent dataset
# want format:
# (example_id, sentence, label, is_subtree)
# use text_id for example_id
# sentence = sentence
# gold_label -> label
# is_subtree = 0
import json

def load_dataset(*src_filenames, labels=None):
    data = []
    for filename in src_filenames:
        with open(filename) as f:
            for line in f:
                d = json.loads(line)
                if labels is None or d['gold_label'] in labels:
                    data.append(d)
    return data

dynasent_folder = os.path.join(sys.path[-1], 'data/dynasent-v1.1')
r1_train_filename = os.path.join(dynasent_folder, 'dynasent-v1.1-round01-yelp-train.jsonl')

In [None]:
# get a subsample of each label
r1_train_neg = load_dataset(r1_train_filename, labels=('negative', 'negative'))[:2000]
r1_train_neu = load_dataset(r1_train_filename, labels=('neutral', 'neutral'))[:2000]
r1_train_pos = load_dataset(r1_train_filename, labels=('positive', 'positive'))[:4000]

pairs_neg = zip((d['text_id'], d['sentence'], d['gold_label'], 0) for d in r1_train_neg) 
pairs_list_neg = list(pairs_neg)
df_source_neg = [pair[0] for pair in pairs_list_neg]
df_neg = pd.DataFrame(df_source_neg, columns =['example_id', 'sentence', 'label', 'is_subtree'])

pairs_neu = zip((d['text_id'], d['sentence'], d['gold_label'], 0) for d in r1_train_neu) 
pairs_list_neu = list(pairs_neu)
df_source_neu = [pair[0] for pair in pairs_list_neu]
df_neu = pd.DataFrame(df_source_neu, columns =['example_id', 'sentence', 'label', 'is_subtree'])

pairs_pos = zip((d['text_id'], d['sentence'], d['gold_label'], 0) for d in r1_train_pos) 
pairs_list_pos = list(pairs_pos)
df_source_pos = [pair[0] for pair in pairs_list_pos]
df_pos = pd.DataFrame(df_source_pos, columns =['example_id', 'sentence', 'label', 'is_subtree'])

# concatenate all labels
df_whole = pd.concat([df_neg, df_neu, df_pos])
print(len(df_whole))


8000


In [None]:
# try using subtrees SST neutral examples --
# SST-train contains 8544 examples; (neg, neutral, pos)=(3310, 1624, 3610)
subtree_dedup_train_df = sst.train_reader(SST_HOME, include_subtrees=True, dedup=True)

In [None]:
# use 3000 additional neutral examples.
subtree_neutrals = subtree_dedup_train_df.loc[subtree_dedup_train_df['label'] == 'neutral'][:3000]
sst_boosted_ds_train = pd.concat([sst_train, subtree_neutrals, df_whole])

## Model

In [None]:
class HfBertClassifierModel(nn.Module):
    def __init__(self, n_classes, weights_name='bert-base-cased'):
        super().__init__()
        self.n_classes = n_classes
        self.weights_name = weights_name
        self.bert = BertModel.from_pretrained(self.weights_name)
        self.bert.train()
        self.hidden_dim = self.bert.embeddings.word_embeddings.embedding_dim
        # The only new parameters -- the classifier:
        self.classifier_layer = nn.Linear(
            self.hidden_dim, self.n_classes)

    def forward(self, indices, mask):
        reps = self.bert(
            indices, attention_mask=mask)
        return self.classifier_layer(reps.pooler_output)

In [None]:
class HfBertClassifier(TorchShallowNeuralClassifier):
    def __init__(self, weights_name, *args, **kwargs):
        self.weights_name = weights_name
        self.tokenizer = BertTokenizer.from_pretrained(self.weights_name)
        super().__init__(*args, **kwargs)
        self.params += ['weights_name']

    def build_graph(self):
        return HfBertClassifierModel(self.n_classes_, self.weights_name)

    def build_dataset(self, X, y=None):
        data = self.tokenizer.batch_encode_plus(
            X,
            max_length=None,
            add_special_tokens=True,
            padding='longest',
            return_attention_mask=True)
        indices = torch.tensor(data['input_ids'])
        mask = torch.tensor(data['attention_mask'])
        if y is None:
            dataset = torch.utils.data.TensorDataset(indices, mask)
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(indices, mask, y)
        return dataset

In [None]:
def bert_fine_tune_phi(text):
    return text

In [None]:
# This was first written as a version containing hyperparameter search
# and modified to use the final best version from the search
def fit_hf_bert_classifier(X, y):
    mod = HfBertClassifier(
        gradient_accumulation_steps=8,
        eta=0.0001,
        hidden_dim=300,
        weights_name='bert-base-cased', # also try bert-based-uncased
        batch_size=8,  # Small batches to avoid memory overload.
        max_iter=1,  # We'll search based on 1 iteration for efficiency.
        n_iter_no_change=5,   # Early-stopping params are for the
        early_stopping=True)  # final evaluation.

    mod.fit(X, y)

    return mod

In [None]:
sst_boosted_ds_train = pd.concat([sst_train, subtree_neutrals, df_whole]) # examples are NOT shuffled

# took around an hour maybe to run
# without gpu: still has not completed after 5h
bert_classifier_xval = sst.experiment(
    sst_boosted_ds_train,
    bert_fine_tune_phi,
    fit_hf_bert_classifier,
    assess_dataframes=[sst_dev, bakeoff_dev],
    vectorize=False)  # Pass in the BERT hidden state directly!

In [None]:
optimized_bert_classifier = bert_classifier_xval['model']

# Remove the rest of the experiment results to clear out some memory
del bert_classifier_xval

In [None]:
def fit_optimized_hf_bert_classifier(X, y):
    optimized_bert_classifier.max_iter = 1000
    optimized_bert_classifier.fit(X, y)
    return optimized_bert_classifier

In [None]:
# took 3.5h to run....
hfbert_experiment = sst.experiment(
    sst_boosted_ds_train, 
    bert_fine_tune_phi,
    fit_optimized_hf_bert_classifier,
    assess_dataframes=[sst_dev, bakeoff_dev],
    vectorize=False)  # Pass in the BERT hidden state directly!

Stopping after epoch 10. Validation score did not improve by tol=1e-05 for more than 5 epochs. Final error is 17.4847345644921

Assessment dataset 1
              precision    recall  f1-score   support

    negative      0.775     0.734     0.754       428
     neutral      0.427     0.345     0.382       229
    positive      0.732     0.842     0.783       444

    accuracy                          0.697      1101
   macro avg      0.645     0.640     0.640      1101
weighted avg      0.685     0.697     0.688      1101

Assessment dataset 2
              precision    recall  f1-score   support

    negative      0.606     0.703     0.651       565
     neutral      0.818     0.511     0.629      1019
    positive      0.594     0.817     0.688       777

    accuracy                          0.658      2361
   macro avg      0.673     0.677     0.656      2361
weighted avg      0.694     0.658     0.654      2361

Mean of macro-F1 scores: 0.648


In [None]:
def predict_one_rnn(text):
    # List of tokenized examples:
    X = [hfbert_experiment['phi'](text)]
    # Standard `predict` step on a list of lists of str:
    preds = hfbert_experiment['model'].predict(X)
    # Be sure to return the only member of the predictions,
    # rather than the singleton list:
    return preds[0]

In [None]:
def create_bakeoff_submission(
        predict_one_func,
        output_filename='cs224u-sentiment-bakeoff-entry.csv'):

    bakeoff_test = sst.bakeoff_test_reader(SST_HOME)
    sst_test = sst.test_reader(SST_HOME)
    bakeoff_test['dataset'] = 'bakeoff'
    sst_test['dataset'] = 'sst3'
    df = pd.concat((bakeoff_test, sst_test))

    df['prediction'] = df['sentence'].apply(predict_one_func)

    df.to_csv(output_filename, index=None)

In [None]:
create_bakeoff_submission(predict_one_rnn)