This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [72]:
import pandas as pd
import numpy as np
import os
import random
random.seed(1)
import re
import time

# Data processing.
import dataset # dataset.py
import torch
from torchtext.legacy import data 

# Model.
import models # models.py
import torch.nn as nn
import torch.optim as optim
from transformers import DistilBertModel, DistilBertTokenizer

# Training.
import training # training.py
from sklearn.model_selection import KFold
import utils # utils.py

# Visualization.
import matplotlib.pyplot as plt

# If you make a code change that doesn't get picked up by
# Jupyter notebook, try reloading like below:
# import imp
# imp.reload(training)

# Load a BERT tokenizer

In [73]:
WEIGHTS_NAME = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizer.from_pretrained(WEIGHTS_NAME)

# Read the data

In [74]:
data_df = dataset.get_multiple_datasets([1,2,3], 'Creativity_Combined', shuffle=True)

In [75]:
# For prototype purposes:
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
train_df = data_df[:1000]
test_df = data_df[1000:] # roughly 190 test examples set aside

# write them to CSV files
train_df.to_csv('ktrain.csv', index=False, header=False)
test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [76]:
INIT_TOKEN_IDX = tokenizer.cls_token_id
EOS_TOKEN_IDX = tokenizer.sep_token_id
PAD_TOKEN_IDX = tokenizer.pad_token_id
UNK_TOKEN_IDX = tokenizer.unk_token_id

# BERT input can be at most 512 words
MAX_INPUT_LENGTH = tokenizer.max_model_input_sizes[WEIGHTS_NAME]

# Apply tokenization and some preprocessing steps to the input sentence.
# Namely, this trims examples down to MAX_INPUT_LENGTH. (There is a -2 
# since the [CLS] and [SEP] tokens will be added)
def tokenize_and_cut(sentence):
  sentence = sentence.replace('/', '') # remove slashes
  tokens = tokenizer.tokenize(sentence) 
  tokens = tokens[:MAX_INPUT_LENGTH-2]
  return tokens

# text_fields defines preprocessing and handling of the text of an example.
text_fields = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = INIT_TOKEN_IDX, # add [CLS] token
                  eos_token = EOS_TOKEN_IDX, # add [SEP] token
                  pad_token = PAD_TOKEN_IDX,
                  unk_token = UNK_TOKEN_IDX)

# label_fields defines how to handle the label of an example.
# for regression, we do not need to build a vocabulary.
label_fields = data.LabelField(sequential=False, use_vocab=False, dtype = torch.float)
all_fields = [('text', text_fields), ('label', label_fields)] # must match order of cols in csv

train_dataset, test_dataset = data.TabularDataset.splits(
  path='', # path='' because the csvs are in the same directory
  train='ktrain.csv', test='ktest.csv', format='csv',
  fields=all_fields  
)

In [77]:
print(data_df.head(1))

                                                  text  label
162  Passive Weighted Toning Child Attachments.  Li...   3.75


In [78]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Training pipeline begins here


## Define parameters

In [89]:
# Model parameters
OUTPUT_DIM = 1
DROPOUT = 0.2 # This is the distilbert default

# Training parameters
N_SPLITS = 5 # For cross validation
BATCH_SIZE = 1 # This is the distilbert default
N_EPOCHS = 3 # This is the distilbert default

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
criterion = nn.MSELoss(size_average=False)
criterion = criterion.to(device)

In [127]:
# For testing purpose, I am using randomly generated tensors with 3 columns
ADDED_FEATURES = torch.rand(BATCH_SIZE,3)
ADDED_DIM = added_features.shape[1]

## The cell where it actually trains!

In [128]:
# The main training loop
def launch_experiment(train_data_df, n_splits=N_SPLITS):
  # valid_corrs = np.empty(n_splits)
  best_valid_loss = float('inf') 
  
  kf = KFold(n_splits=n_splits)
  fold = 0
  for train_index, valid_index in kf.split(train_data_df):
    print('training on fold {}'.format(fold))
    train_data = data.Dataset(train_exs_arr[train_index], all_fields)
    valid_data = data.Dataset(train_exs_arr[valid_index], all_fields)
    
    # Initialize a new model each fold.
    # https://ai.stackexchange.com/questions/18221/deep-learning-with-kfold-cross-validation-with-epochs
    # https://stats.stackexchange.com/questions/358380/neural-networks-epochs-with-10-fold-cross-validation-doing-something-wrong
    bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)
    model = models.BERTLinear(bert,
                              OUTPUT_DIM,
                              DROPOUT,
                             added_dim = ADDED_DIM, 
                              # I modified to model function such that the dimension of the addtional features is added to BERT dimension (768)
                             )
    # optimizer = optim.Adam(model.parameters())
    optimizer = optim.Adam(model.parameters(),lr=5e-05,betas=(0.9, 0.999),eps=1e-08) # lr is not Adam default
    model = model.to(device)

    train_iterator, valid_iterator = training.get_iterators(train_data, valid_data, BATCH_SIZE, device)

    for epoch in range(N_EPOCHS):
      start_time = time.time()
    # I modified the training and model functions so that we can concatenate hand-picked features with BERT representations
      train_loss, train_corr = training.train(model, train_iterator, optimizer, criterion, added_features = ADDED_FEATURES)
      valid_loss, valid_corr = training.evaluate(model, valid_iterator, criterion, added_features = ADDED_FEATURES)
      # valid_corrs[fold] will end up holding the valid_corr from the last epoch for this fold
      # is this what we want?
      # valid_corrs[fold] = valid_corr
      end_time = time.time()
      epoch_mins, epoch_secs = utils.epoch_time(start_time, end_time)

      if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        # Save the weights of the model with the best valid loss
        print('updating saved weights of best model')
        torch.save(model.state_dict(), "best_valid_loss.pt")

      print(f'Epoch: {epoch:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
      print(f'\t Train Loss: {train_loss:.3f} | Train Corr: {train_corr:.2f}')
      print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Corr: {valid_corr:.2f}')
    
    fold += 1
    # return best_valid_loss # TODO REMOVE -- JUST TRAIN ONE FOLD
  
  return best_valid_loss # , valid_corrs

In [None]:
import imp
imp.reload(models)
imp.reload(training)

best_valid_loss = launch_experiment(train_exs_arr)
print('best validation loss: {}'.format(best_valid_loss))
# print('validation correlations for each fold: {}'.format(valid_corrs))
# print('average of validation correlations: {}'.format(np.mean(valid_corrs)))

training on fold 0
updating saved weights of best model
Epoch: 00 | Epoch Time: 10m 34s
	 Train Loss: 0.610 | Train Corr: 0.38
	 Val. Loss: 0.372 |  Val. Corr: 0.63
Epoch: 01 | Epoch Time: 10m 30s
	 Train Loss: 0.320 | Train Corr: 0.68
	 Val. Loss: 0.593 |  Val. Corr: 0.64


# Test the trained model on held-out dataset.

In [82]:
# Get a test iterator
test_iterator = training.get_iterator(test_dataset, BATCH_SIZE, device)

In [88]:
# load the best model saved
bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)
model = models.BERTLinear(bert, OUTPUT_DIM, DROPOUT)
model.load_state_dict(torch.load("best_valid_loss.pt"))
model.to(device)
model.eval()
test_loss, test_corr = training.evaluate(model, test_iterator, criterion, debug=True)
print(test_loss)
print(test_corr)

predictions: tensor([4.4527, 4.6467, 4.6444, 4.6318, 4.6442, 4.6132, 4.5274, 4.6450])
true labels: tensor([4.8500, 4.0250, 4.7750, 4.8000, 4.3750, 5.2500, 4.4250, 2.1500])
predictions: tensor([4.6422, 4.6388, 4.5168, 4.4133, 4.6303, 4.5219, 4.6410, 4.6405])
true labels: tensor([4.2000, 3.5500, 3.8500, 3.7000, 3.5000, 5.7250, 4.2500, 5.0750])
predictions: tensor([4.6390, 4.6216, 4.6093, 4.4785, 4.6441, 4.6428, 4.5734, 4.4639])
true labels: tensor([4.5750, 4.3750, 5.5000, 4.8750, 4.5500, 4.7750, 4.9000, 2.7250])
predictions: tensor([4.6457, 4.6457, 4.6436, 4.6454, 4.6468, 4.6440, 4.6466, 4.4695])
true labels: tensor([3.8500, 3.5500, 3.8750, 3.4750, 4.2250, 5.0000, 3.9500, 4.9500])
predictions: tensor([4.6447, 4.6475, 4.6482, 4.6476, 4.6469, 4.6471, 4.6409, 4.6480])
true labels: tensor([3.5000, 4.0500, 4.8000, 4.6750, 4.5750, 3.2750, 5.2750, 4.2000])


KeyboardInterrupt: 

# Misc other stuff

Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."