This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [2]:
import pandas as pd
import numpy as np
import os
import random
# random.seed(1)
import re

# Data processing.
import constants # constants.py
import dataset # dataset.py
import torch

# Model.
import models # models.py
import torch.nn as nn
from transformers import DistilBertModel

# Training.
import training # training.py
import utils # utils.py

# If you make a code change that doesn't get picked up by
# Jupyter notebook, try reloading like below:
# import imp
# imp.reload(training)

# Read the data

In [3]:
# data_df = dataset.get_multiple_datasets([1,2,3], 'Creativity_Combined', shuffle=True)

In [4]:
'''This cell is commented out because the csvs should already exist in the directory.
If you are running the notebook for the first time, run them to generate the csvs.'''
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
# train_df = data_df[:1000]
# test_df = data_df[1000:] # roughly 203 test examples set aside

# write them to CSV files
# train_df.to_csv('ktrain.csv', index=False, header=False)
# test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [3]:
train_dataset, test_dataset = dataset.get_train_test_datasets()

In [4]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Training pipeline begins here


In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

param_grid = {
    'dropout': [0.2],
    'batch_size': [8],
    'max_epochs': [10],
    'lr': [5e-05],
    'hidden_dim': [256],
    'num_layers': [1],
    'bidirectional': [False],
}

results, best_model = training.perform_hyperparameter_search(param_grid, train_exs_arr, rnn=True, save_weights=True)
print(best_model)

eid 0, params {'batch_size': 8, 'bidirectional': False, 'dropout': 0.2, 'hidden_dim': 256, 'lr': 5e-05, 'max_epochs': 10, 'num_layers': 1}
training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


updating saved weights of best model
Epoch: 00 | Epoch Time: 0m 55s
	 Train Loss: 24.393 | Train Corr: -0.06
	 Val. Loss: 4.159 |  Val. Corr: 0.46
Epoch: 01 | Epoch Time: 0m 55s
	 Train Loss: 3.902 | Train Corr: 0.44
	 Val. Loss: 4.529 |  Val. Corr: 0.38
Epoch: 02 | Epoch Time: 0m 54s
	 Train Loss: 2.947 | Train Corr: 0.61
	 Val. Loss: 4.211 |  Val. Corr: 0.50
updating saved weights of best model
Epoch: 03 | Epoch Time: 0m 55s
	 Train Loss: 2.226 | Train Corr: 0.73
	 Val. Loss: 3.978 |  Val. Corr: 0.55
updating saved weights of best model
Epoch: 04 | Epoch Time: 0m 55s
	 Train Loss: 1.764 | Train Corr: 0.79
	 Val. Loss: 3.394 |  Val. Corr: 0.59
updating saved weights of best model
Epoch: 05 | Epoch Time: 0m 55s
	 Train Loss: 1.451 | Train Corr: 0.83
	 Val. Loss: 3.183 |  Val. Corr: 0.57
updating saved weights of best model
Epoch: 06 | Epoch Time: 0m 55s
	 Train Loss: 1.212 | Train Corr: 0.86
	 Val. Loss: 2.965 |  Val. Corr: 0.60
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 0m 56s
	 Train Loss: 27.073 | Train Corr: 0.00
	 Val. Loss: 4.820 |  Val. Corr: 0.22
Epoch: 01 | Epoch Time: 0m 56s
	 Train Loss: 4.040 | Train Corr: 0.39
	 Val. Loss: 5.187 |  Val. Corr: 0.59
training on fold 2


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


KeyboardInterrupt: 

# Test the trained model on held-out dataset.

In [17]:
# Get a test iterator
# use batch size from best params!!
test_iterator = training.get_iterator(test_dataset, 8, device)

In [19]:
# load the best model saved
bert = DistilBertModel.from_pretrained(constants.WEIGHTS_NAME)
# use the params from the best model!!! 
model = models.BERTRNN(bert, constants.OUTPUT_DIM, 256, 1, False, 0.2)
model.load_state_dict(torch.load("0_best_valid_loss.pt"))
model.to(device)
model.eval()
# If you change the criterion, make sure it matches with the training criterion in training.py
criterion = nn.MSELoss(size_average=False)
criterion = criterion.to(device)
test_loss, test_corr = training.evaluate(model, test_iterator, criterion, debug=True)
print(test_loss)
print(test_corr)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


predictions: tensor([4.0583, 3.9499, 4.5969, 3.8813, 3.8024, 3.2636, 3.7395, 3.0295],
       device='cuda:0')
true labels: tensor([4.7500, 5.0250, 4.7000, 3.8000, 4.7000, 3.7000, 4.4000, 3.3250],
       device='cuda:0')
predictions: tensor([3.5669, 3.8882, 3.9488, 3.4002, 4.1824, 4.7027, 3.5549, 4.5723],
       device='cuda:0')
true labels: tensor([2.7250, 3.7250, 4.1000, 3.1500, 3.8750, 5.2750, 3.5500, 5.5250],
       device='cuda:0')
predictions: tensor([4.6455, 3.5152, 4.0440, 4.4155, 4.6087, 2.6608, 3.5462, 3.5407],
       device='cuda:0')
true labels: tensor([5.1500, 2.9000, 3.8750, 4.1500, 5.0500, 2.7500, 4.8250, 3.9500],
       device='cuda:0')
predictions: tensor([3.2251, 4.1936, 4.4317, 3.8372, 3.5440, 3.3435, 3.9743, 3.9061],
       device='cuda:0')
true labels: tensor([4.4500, 4.8750, 4.6000, 3.9500, 2.7750, 4.0750, 4.0500, 4.3250],
       device='cuda:0')
predictions: tensor([4.2032, 3.6779, 4.3451, 2.8282, 4.0200, 3.4766, 3.7172, 3.9679],
       device='cuda:0')
true label

# Misc other stuff

Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."