This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [None]:
# Mount Google Drive.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 8.8MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 37.3MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 24.3MB/s 
Installing c

In [None]:
import pandas as pd
import numpy as np
import os
import random
random.seed(1)
import re

# Data processing.
import torch
from torchtext.legacy import data 

# Model.
import torch.nn as nn
import torch.optim as optim
from transformers import DistilBertModel, DistilBertTokenizer

# Training.
from sklearn.model_selection import KFold

# Visualization.
import matplotlib.pyplot as plt

# Set working directory.
os.chdir('/content/gdrive/My Drive/personal/CS224U/project')

# Load a pre-trained BERT model

In [None]:
WEIGHTS_NAME = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizer.from_pretrained(WEIGHTS_NAME)
bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Read the data

In [None]:
# For illustrative purposes, reading just one study as the "dataset".
# TODO: read and combine the data from the rest of the studies.
sheet_df = pd.read_excel("Idea Ratings_Berg_2019_OBHDP.xlsx", sheet_name=0)
sheet_df.dropna(inplace=True) # For some reason, first sheet has an extra NaN row at the bottom. This makes sure it's removed.
data_df = sheet_df[['Final_Idea', 'Creativity_Combined']].rename(columns={'Final_Idea': 'text', 'Creativity_Combined': 'label'})

# shuffle the rows
data_df = data_df.sample(frac=1)

In [None]:
# A utility function for reading data
# Takes the number of the study/sample and the label we want to extract (e.g., "Novelty_Combined")
# Return the a df with a column named 'text' and a column named 'label'
# Can also choose 

def get_data(study, metric, shuffle = True):

  sheet_df = pd.read_excel("Idea Ratings_Berg_2019_OBHDP.xlsx", sheet_name=study-1) 
  sheet_df.dropna(inplace=True)
  data_df = sheet_df[['Final_Idea', 'Creativity_Combined']].rename(columns={'Final_Idea': 'text', metric: 'label'})

  if shuffle:
    data_df = data_df.sample(frac=1)
  return data_df

# Take a list with the numbers of studies
# Extract multiple datasets with get_data and concatenate them

def get_multiple_datasets(study_list, metric, shuffle = True):
  dfs = [get_data(study, metric, shuffle) for study in study_list]
  return pd.concat(dfs)

In [None]:
# illustrate the function
get_multiple_datasets([1,3], "Creativity_Combined")

Unnamed: 0,text,label
69,The idea is a product that can help with more ...,4.325
250,I think it would be good to ad some sort of fa...,2.975
121,A new kind of treadmill that is fully accessib...,4.000
252,The device would have straps that attached to ...,2.975
273,A foldable step machine that you can use as a ...,2.675
...,...,...
174,This is a European Tour done by train in which...,4.225
60,Take a Haunted Trip Destinations could be to A...,4.825
310,A train travel of Bernina Express between Chur...,2.175
278,a small train with about ten cabs behind the f...,3.300


In [None]:
# For prototype purposes:
# assign binary classification labels
# score <= 3.8 --> negative (not creative)
# score > 3.8 --> positive (creative)
data_df['label'] = data_df['label'].apply(lambda x: 0 if x <= 3.8 else 1)
print(data_df.head(1))

                                                  text  label
204  My idea is to make a bike that can be used for...      0


In [None]:
# For prototype purposes:
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
train_df = data_df[:200]
test_df = data_df[200:]

# write them to CSV files
train_df.to_csv('train.csv', index=False, header=False)
test_df.to_csv('test.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [None]:
INIT_TOKEN_IDX = tokenizer.cls_token_id
EOS_TOKEN_IDX = tokenizer.sep_token_id
PAD_TOKEN_IDX = tokenizer.pad_token_id
UNK_TOKEN_IDX = tokenizer.unk_token_id

# BERT input can be at most 512 words
MAX_INPUT_LENGTH = tokenizer.max_model_input_sizes[WEIGHTS_NAME]

# Apply tokenization and some preprocessing steps to the input sentence.
# Namely, this trims examples down to MAX_INPUT_LENGTH. (There is a -2 
# since the [CLS] and [SEP] tokens will be added)
def tokenize_and_cut(sentence):
  tokens = tokenizer.tokenize(sentence) 
  tokens = tokens[:MAX_INPUT_LENGTH-2]
  return tokens

# text_fields defines preprocessing and handling of the text of an example.
text_fields = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = INIT_TOKEN_IDX, # add [CLS] token
                  eos_token = EOS_TOKEN_IDX, # add [SEP] token
                  pad_token = PAD_TOKEN_IDX,
                  unk_token = UNK_TOKEN_IDX)

# label_fields defines how to handle the label of an example.
label_fields = data.LabelField(dtype = torch.float)
all_fields = [('text', text_fields), ('label', label_fields)]

train_dataset, test_dataset = data.TabularDataset.splits(
  path='', # path='' because the csvs are in the same directory
  train='train.csv', test='test.csv', format='csv',
  fields=all_fields # must match order of cols in csv 
)

In [None]:
# Just inspect what the tokenizer is doing
# // and escape characters \ are kept. We may want to remove them
print(data_df['text'][1])
tokenize_and_cut(data_df['text'][1])

The "Real Row" is an exercise machine that simulates the real feeling of rowing down your favorite river. With the capability to change yaw, pitch, and roll (to a limited degree) you'll feel like you're outside enjoying the water. /  / Sit down, strap on the seatbelt, and choose a route from the monitor. The machine will them program all the motion that would occur on that river as you row down it. The resistance will change based on water conditions. The boat will twist and turn, rise and fall, as you cross simulated waves. /  / As you row, the monitor will display a beautifully rendered landscape along with the river you're on. You can actually see more challenging or less challenging paths, and steer towards your preference. /  / The two-person variant offers you a chance to work out with a partner, and will provide both visual and audio feedback on how well you are working together.


['the',
 '"',
 'real',
 'row',
 '"',
 'is',
 'an',
 'exercise',
 'machine',
 'that',
 'simulate',
 '##s',
 'the',
 'real',
 'feeling',
 'of',
 'rowing',
 'down',
 'your',
 'favorite',
 'river',
 '.',
 'with',
 'the',
 'capability',
 'to',
 'change',
 'ya',
 '##w',
 ',',
 'pitch',
 ',',
 'and',
 'roll',
 '(',
 'to',
 'a',
 'limited',
 'degree',
 ')',
 'you',
 "'",
 'll',
 'feel',
 'like',
 'you',
 "'",
 're',
 'outside',
 'enjoying',
 'the',
 'water',
 '.',
 '/',
 '/',
 'sit',
 'down',
 ',',
 'strap',
 'on',
 'the',
 'seat',
 '##belt',
 ',',
 'and',
 'choose',
 'a',
 'route',
 'from',
 'the',
 'monitor',
 '.',
 'the',
 'machine',
 'will',
 'them',
 'program',
 'all',
 'the',
 'motion',
 'that',
 'would',
 'occur',
 'on',
 'that',
 'river',
 'as',
 'you',
 'row',
 'down',
 'it',
 '.',
 'the',
 'resistance',
 'will',
 'change',
 'based',
 'on',
 'water',
 'conditions',
 '.',
 'the',
 'boat',
 'will',
 'twist',
 'and',
 'turn',
 ',',
 'rise',
 'and',
 'fall',
 ',',
 'as',
 'you',
 'cross',

In [None]:
# We have to build a 'vocabulary' for the labels.
label_fields.build_vocab(train_dataset)
# TODO: make this 0=1 situation less confusing...
print(label_fields.vocab.stoi)

defaultdict(None, {'1': 0, '0': 1})


In [None]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Define the BERT-RNN model

In [None]:
class BERTRNN(nn.Module):
  def __init__(self,
               bert,
               hidden_dim,
               output_dim,
               n_layers,
               bidirectional,
               dropout):
    super().__init__()
    self.bert = bert
    # Modify this if we want to concatenate something onto BERT embedding
    # Note: 'dim' is equivalent of 'hidden_size' for BERT model
    embedding_dim = bert.config.to_dict()['dim']

    # TODO: change to lstm cells.
    self.rnn = nn.GRU(embedding_dim,
                      hidden_dim,
                      num_layers = n_layers,
                      bidirectional = bidirectional,
                      batch_first = True,
                      dropout = 0 if n_layers < 2 else dropout)
    
    # TODO: need to modify this if bidirectional=True
    self.out = nn.Linear(hidden_dim, output_dim)
    self.dropout = nn.Dropout(dropout)
    # TODO: we probably need some regression output layer instead.

  def forward(self, text):
    # forward pass of bert; then take the output of CLS token
    embedded = self.bert(text)[0]

    _, hidden = self.rnn(embedded)

    # TODO: need to modify this if bidirectional=True
    # for prototype purposes, assume we won't use bidirectional
    hidden = self.dropout(hidden[-1,:,:])
    output = self.out(hidden)
    return output



In [None]:
# Instantiate the model
HIDDEN_DIM = 10 # TODO: this should be much bigger
OUTPUT_DIM = 1
N_LAYERS = 1
BIDIRECTIONAL = False
DROPOUT = 0.25

model = BERTRNN(bert,
                HIDDEN_DIM,
                OUTPUT_DIM,
                N_LAYERS,
                BIDIRECTIONAL,
                DROPOUT)

# Training pipeline begins here


## Define training parameters

In [None]:
BATCH_SIZE = 16 # TODO increase this
N_EPOCHS = 2 # TODO we can increase this

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
# TODO: place model + criterion onto GPU device.

In [None]:
# model.train() # Uncomment to view structure of model.

## Define helper functions

In [None]:
def binary_accuracy(preds, y):
  rounded_preds = torch.round(torch.sigmoid(preds))
  correct = (rounded_preds == y).float()
  acc = correct.sum() / len(correct)
  return acc

In [None]:
def train(model, iterator, optimizer, criterion):
  epoch_loss = 0
  epoch_acc = 0
  
  model.train()

  for batch in iterator:
    optimizer.zero_grad()
    predictions = model(batch.text).squeeze(1)
    loss = criterion(predictions, batch.label)
    acc = binary_accuracy(predictions, batch.label)
    loss.backward()
    optimizer.step()

    epoch_loss += loss.item()
    epoch_acc += acc.item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
  epoch_loss = 0
  epoch_acc = 0

  model.eval()

  with torch.no_grad():
    for batch in iterator:
      predictions = model(batch.text).squeeze(1)
      loss = criterion(predictions, batch.label)
      acc = binary_accuracy(predictions, batch.label)
      epoch_loss += loss.item()
      epoch_acc += acc.item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
# Given train and validation datasets, returns 2 iterators.
def get_iterators(train_data, valid_data):
  return data.BucketIterator.splits(
    (train_data, valid_data),
    batch_size = BATCH_SIZE,
    # Below are needed to overcome error when calling evaluate():
    # TypeError: '<' not supported between instances of 'Example' and 'Example'
    sort_key = lambda x: len(x.text),
    sort_within_batch = False,
  )

## The cell where it actually trains!

In [None]:
best_valid_loss = float('inf')

# The main training loop
# TODO: add some sort of weights-saving, either periodically or at the end
# This way we can save our trained model and use it easily for downstream
# analysis without having to re-train.
# TODO: add some sort of timing info / progress bar.
def launch_experiment(train_data_df):
  best_valid_loss = float('inf') 
  # best_valid_loss is a local variable in this function and I added this line to prevent a potential error

  kf = KFold(n_splits=5)
  for train_index, valid_index in kf.split(train_data_df):
    train_data = data.Dataset(train_exs_arr[train_index], all_fields)
    valid_data = data.Dataset(train_exs_arr[valid_index], all_fields)

    train_iterator, valid_iterator = get_iterators(train_data, valid_data)

    for epoch in range(N_EPOCHS):
      train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
      valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

      if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
      
      # Added some 
      print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
      print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

launch_experiment(train_exs_arr)
print(best_valid_loss)

	Train Loss: 0.704 | Train Acc: 50.62%
	 Val. Loss: 0.692 |  Val. Acc: 52.08%
	Train Loss: 0.701 | Train Acc: 46.88%
	 Val. Loss: 0.692 |  Val. Acc: 52.08%
	Train Loss: 0.685 | Train Acc: 55.00%
	 Val. Loss: 0.688 |  Val. Acc: 56.25%
	Train Loss: 0.704 | Train Acc: 43.75%
	 Val. Loss: 0.688 |  Val. Acc: 56.25%
	Train Loss: 0.696 | Train Acc: 45.62%
	 Val. Loss: 0.685 |  Val. Acc: 60.42%
	Train Loss: 0.680 | Train Acc: 59.38%
	 Val. Loss: 0.686 |  Val. Acc: 60.42%
	Train Loss: 0.695 | Train Acc: 53.75%
	 Val. Loss: 0.683 |  Val. Acc: 66.67%
	Train Loss: 0.721 | Train Acc: 45.62%
	 Val. Loss: 0.684 |  Val. Acc: 66.67%
	Train Loss: 0.687 | Train Acc: 56.25%
	 Val. Loss: 0.694 |  Val. Acc: 50.00%
	Train Loss: 0.703 | Train Acc: 50.00%
	 Val. Loss: 0.694 |  Val. Acc: 50.00%
inf


# Test the trained model on held-out dataset.

In [None]:
# Get a test iterator
test_iterator = data.BucketIterator(
  test_dataset,
  batch_size = BATCH_SIZE,
  # Below are needed to overcome error when calling evaluate():
  # TypeError: '<' not supported between instances of 'Example' and 'Example'
  sort_key = lambda x: len(x.text),
  sort_within_batch = False,
)

In [None]:
# Accuracy is about chance right now.
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(test_loss)
print(test_acc)

0.6918129069464547
0.5267857142857143
