This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [1]:
# Mount Google Drive.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 14.2MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 52.6MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 49.9MB/s 
Installing 

In [3]:
import pandas as pd
import numpy as np
import os
import random
random.seed(1)
import re

# Data processing.
import torch
from torchtext.legacy import data 

# Model.
import torch.nn as nn
import torch.optim as optim
from transformers import DistilBertModel, DistilBertTokenizer

# Training.
from sklearn.model_selection import KFold

# Visualization.
import matplotlib.pyplot as plt

# Set working directory.
os.chdir('/content/gdrive/My Drive/personal/CS224U/project')

# Load a pre-trained BERT model

In [4]:
WEIGHTS_NAME = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizer.from_pretrained(WEIGHTS_NAME)
bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Read the data

In [5]:
# A utility function for reading data
# Takes the number of the study/sample and the label we want to extract (e.g., "Novelty_Combined")
# Return the a df with a column named 'text' and a column named 'label'
# Can also choose to shuffle
def get_data(study, metric, shuffle = True):

  sheet_df = pd.read_excel("Idea Ratings_Berg_2019_OBHDP.xlsx", sheet_name=study-1) 
  sheet_df.dropna(inplace=True)
  data_df = sheet_df[['Final_Idea', metric]].rename(columns={'Final_Idea': 'text', metric: 'label'})

  if shuffle:
    data_df = data_df.sample(frac=1)
  return data_df

# Take a list with the numbers of studies
# Extract multiple datasets with get_data and concatenate them
def get_multiple_datasets(study_list, metric, shuffle = True):
  dfs = [get_data(study, metric, shuffle) for study in study_list]
  return pd.concat(dfs)

In [6]:
data_df = get_multiple_datasets([0,1,2], 'Creativity_Combined', shuffle=True)

In [7]:
# For prototype purposes:
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
train_df = data_df[:1000]
test_df = data_df[1000:] # roughly 190 test examples set aside

# write them to CSV files
train_df.to_csv('ktrain.csv', index=False, header=False)
test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [8]:
INIT_TOKEN_IDX = tokenizer.cls_token_id
EOS_TOKEN_IDX = tokenizer.sep_token_id
PAD_TOKEN_IDX = tokenizer.pad_token_id
UNK_TOKEN_IDX = tokenizer.unk_token_id

# BERT input can be at most 512 words
MAX_INPUT_LENGTH = tokenizer.max_model_input_sizes[WEIGHTS_NAME]

# Apply tokenization and some preprocessing steps to the input sentence.
# Namely, this trims examples down to MAX_INPUT_LENGTH. (There is a -2 
# since the [CLS] and [SEP] tokens will be added)
def tokenize_and_cut(sentence):
  sentence = sentence.replace('/', '') # remove slashes
  tokens = tokenizer.tokenize(sentence) 
  tokens = tokens[:MAX_INPUT_LENGTH-2]
  return tokens

# text_fields defines preprocessing and handling of the text of an example.
text_fields = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = INIT_TOKEN_IDX, # add [CLS] token
                  eos_token = EOS_TOKEN_IDX, # add [SEP] token
                  pad_token = PAD_TOKEN_IDX,
                  unk_token = UNK_TOKEN_IDX)

# label_fields defines how to handle the label of an example.
# for regression, we do not need to build a vocabulary.
label_fields = data.LabelField(sequential=False, use_vocab=False, dtype = torch.float)
all_fields = [('text', text_fields), ('label', label_fields)] # must match order of cols in csv

train_dataset, test_dataset = data.TabularDataset.splits(
  path='', # path='' because the csvs are in the same directory
  train='ktrain.csv', test='ktest.csv', format='csv',
  fields=all_fields  
)

In [9]:
# # Just inspect what the tokenizer is doing
# # // and escape characters \ are kept. We may want to remove them
# print(data_df['text'][1])
# print(tokenize_and_cut(data_df['text'][1]))

In [10]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Define the BERT-RNN model

In [11]:
class BERTRNN(nn.Module):
  def __init__(self,
               bert,
               hidden_dim,
               output_dim,
               n_layers,
               bidirectional,
               dropout):
    super().__init__()
    self.bert = bert
    # Modify this if we want to concatenate something onto BERT embedding
    # Note: 'dim' is equivalent of 'hidden_size' for BERT model
    embedding_dim = bert.config.to_dict()['dim']

    # TODO: change to lstm cells.
    # self.rnn = nn.GRU(embedding_dim,
    #                   hidden_dim,
    #                   num_layers = n_layers,
    #                   bidirectional = bidirectional,
    #                   batch_first = True,
    #                   dropout = 0 if n_layers < 2 else dropout)
    
    # TODO: need to modify this if we want to set bidirectional=True
    self.out = nn.Linear(hidden_dim, output_dim)
    self.dropout = nn.Dropout(dropout)
    # TODO: we probably need some regression output layer instead.

  def forward(self, text):
    # forward pass of bert; then take the output of CLS token
    embedded = self.bert(text)[0]

    _, hidden = self.rnn(embedded)

    # TODO: need to modify this if bidirectional=True
    # for prototype purposes, assume we won't use bidirectional
    hidden = self.dropout(hidden[-1,:,:])
    output = self.out(hidden)
    return output



In [12]:
import torch
torch.cuda.empty_cache()

In [13]:
# Instantiate the model
HIDDEN_DIM = 64
OUTPUT_DIM = 1
N_LAYERS = 1
BIDIRECTIONAL = False
DROPOUT = 0.25

model = BERTRNN(bert,
                HIDDEN_DIM,
                OUTPUT_DIM,
                N_LAYERS,
                BIDIRECTIONAL,
                DROPOUT)

# Training pipeline begins here


## Define training parameters

In [14]:
BATCH_SIZE = 4
N_EPOCHS = 2 # TODO we can increase this

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

optimizer = optim.Adam(model.parameters())
criterion = nn.MSELoss(size_average=False)

model = model.to(device)
criterion = criterion.to(device)



In [15]:
# model.train() # Uncomment to view structure of model.

## Define helper functions

In [16]:
def train(model, iterator, optimizer, criterion):
  epoch_loss = 0
  epoch_corr = 0
  
  model.train()

  for batch in iterator:
    optimizer.zero_grad()
    predictions = model(batch.text).squeeze(1)
    loss = criterion(predictions, batch.label)
    # need to use detach() since `predictions` requires gradient
    # alternative: scipy.stats.pearsonr? (might be more memory efficient,
    # but not sure which one is more efficient to compute)
    corr = np.corrcoef(batch.label.cpu().data.numpy(), predictions.detach().cpu().data.numpy())
    loss.backward()
    optimizer.step()

    epoch_loss += loss.item()
    # corr is a (2,2) matrix, so we just get the top right element.
    # If the correlation is a nan value, replace with 0, which means
    # no correlation.
    corr_value = corr[0][1].item()
    if np.isnan(corr[0][1]):
      corr_value = 0

    epoch_corr += corr_value

  return epoch_loss / len(iterator), epoch_corr / len(iterator)

In [17]:
def evaluate(model, iterator, criterion):
  epoch_loss = 0
  epoch_corr = 0

  model.eval()

  # i = 0
  with torch.no_grad():
    for batch in iterator:
      # print(i)
      # i += 1
      predictions = model(batch.text).squeeze(1)
      # print(predictions) # uncomment to see how the predictions look compared to labels
      # print(batch.label)
      loss = criterion(predictions, batch.label)
      corr = np.corrcoef(batch.label.cpu().data, predictions.cpu().data)
      epoch_loss += loss.item()

      # If the correlation is a nan value, replace with 0, which means
      # no correlation.
      corr_value = corr[0][1].item()
      if np.isnan(corr[0][1]):
        corr_value = 0

      epoch_corr += corr_value

  return epoch_loss / len(iterator), epoch_corr / len(iterator)

In [18]:
# Given train and validation datasets, returns 2 iterators.
def get_iterators(train_data, valid_data):
  return data.BucketIterator.splits(
    (train_data, valid_data),
    batch_size = BATCH_SIZE,
    device = device,
    # Below are needed to overcome error when calling evaluate():
    # TypeError: '<' not supported between instances of 'Example' and 'Example'
    sort_key = lambda x: len(x.text),
    sort_within_batch = False,
  )

## The cell where it actually trains!

In [19]:
# The main training loop
# TODO: add some sort of weights-saving, either periodically or at the end
# This way we can save our trained model and use it easily for downstream
# analysis without having to re-train.
# TODO: add some sort of timing info / progress bar.
def launch_experiment(train_data_df):
  best_valid_loss = float('inf') 
  
  kf = KFold(n_splits=5)
  for train_index, valid_index in kf.split(train_data_df):
    train_data = data.Dataset(train_exs_arr[train_index], all_fields)
    valid_data = data.Dataset(train_exs_arr[valid_index], all_fields)

    train_iterator, valid_iterator = get_iterators(train_data, valid_data)

    

    for epoch in range(N_EPOCHS):
      train_loss, train_corr = train(model, train_iterator, optimizer, criterion)
      valid_loss, valid_corr = evaluate(model, valid_iterator, criterion)

      if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss

      print(f'\tTrain Loss: {train_loss:.3f} | Train Corr: {train_corr:.2f}')
      print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Corr: {valid_corr:.2f}')
  
  return best_valid_loss 

launch_experiment(train_exs_arr)
#print(best_valid_loss)

AttributeError: ignored

# Test the trained model on held-out dataset.

In [None]:
# Get a test iterator
test_iterator = data.BucketIterator(
  test_dataset,
  batch_size = BATCH_SIZE,
  device = device,
  # Below are needed to overcome error when calling evaluate():
  # TypeError: '<' not supported between instances of 'Example' and 'Example'
  sort_key = lambda x: len(x.text),
  sort_within_batch = False,
)

In [None]:
test_loss, test_corr = evaluate(model, test_iterator, criterion)
print(test_loss)
print(test_corr)

  c /= stddev[:, None]
  c /= stddev[None, :]
  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)


2.312805041721141
-0.07726163023834338


# Misc other stuff

In [None]:
# When predictions are all identical, we will get a nan value (https://www.kaggle.com/general/186524)
k_pred = np.array([1.5818, 1.5818, 1.5818, 1.5818, 1.5818, 1.5818, 1.5818, 1.5818, 1.5818,
        1.5818, 1.5818, 1.5818, 1.5818, 1.5818, 1.5818, 1.5818])
k_label = np.array([3.0500, 4.5750, 3.8500, 2.2750, 4.1000, 3.9000, 3.3750, 2.9750, 4.0500,
        4.5000, 3.8500, 3.6250, 4.0000, 5.4750, 3.2250, 3.5500])
print(np.corrcoef(k_label, k_pred))

[[ 1. nan]
 [nan nan]]


  c /= stddev[:, None]
  c /= stddev[None, :]


In [None]:
# I use this chunk to collect code from happy transformer that may be relevant to our study
    def _get_training_args(args, output_path):
        """
        :param args: a dictionary of arguments for training
        :param output_path: A string to a temporary directory
        :return: A TrainingArguments object
        """
        return TrainingArguments(
            output_dir=output_path,
            learning_rate=args["learning_rate"],
            weight_decay=args["weight_decay"],
            adam_beta1=args["adam_beta1"],
            adam_beta2=args["adam_beta2"],
            adam_epsilon=args["adam_epsilon"],
            max_grad_norm=args["max_grad_norm"],
            num_train_epochs=args["num_train_epochs"],

        )

    def _run_train(self, dataset, args):
    """

    :param dataset: a child of torch.utils.data.Dataset
    :param args: a dictionary that contains settings
    :return: None
    """
    with tempfile.TemporaryDirectory() as tmp_dir_name:
        training_args = self._get_training_args(args, tmp_dir_name)
        trainer = Trainer( #This trainer class comes from hugging face
            model=self.model,
            args=training_args,
            train_dataset=dataset,
        )
        trainer.train()

# The actual class used in my study
class DistilBertForSequenceClassification(DistilBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        self.distilbert = DistilBertModel(config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(config.dim, config.num_labels)
        self.dropout = nn.Dropout(config.seq_classif_dropout)

        self.init_weights()

[DOCS]    @add_start_docstrings_to_model_forward(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint=_CHECKPOINT_FOR_DOC,
        output_type=SequenceClassifierOutput,
        config_class=_CONFIG_FOR_DOC,
    )
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        distilbert_output = self.distilbert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
        pooled_output = hidden_state[:, 0]  # (bs, dim)
        pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
        pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)
        pooled_output = self.dropout(pooled_output)  # (bs, dim)
        logits = self.classifier(pooled_output)  # (bs, num_labels)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + distilbert_output[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=distilbert_output.hidden_states,
            attentions=distilbert_output.attentions,
        )


Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."