This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [1]:
import pandas as pd
import numpy as np
import os
import random
# random.seed(1)
import re

# Data processing.
import constants # constants.py
import dataset # dataset.py
import torch

# Model.
import models # models.py
import torch.nn as nn
from transformers import DistilBertModel

# Training.
import training # training.py
import utils # utils.py

# If you make a code change that doesn't get picked up by
# Jupyter notebook, try reloading like below:
# import imp
# imp.reload(training)

# Read the data

In [3]:
# data_df = dataset.get_multiple_datasets([1,2,3], 'Creativity_Combined', shuffle=True)

In [4]:
'''This cell is commented out because the csvs should already exist in the directory.
If you are running the notebook for the first time, run them to generate the csvs.'''
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
# train_df = data_df[:1000]
# test_df = data_df[1000:] # roughly 203 test examples set aside

# write them to CSV files
# train_df.to_csv('ktrain.csv', index=False, header=False)
# test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [2]:
train_dataset, test_dataset = dataset.get_train_test_datasets()

In [3]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Training pipeline begins here


In [22]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

param_grid = {
    'dropout': [0.2],
    'batch_size': [8],
    'max_epochs': [10],
    'lr': [3e-05],
    'hidden_dim': [768],
    'num_layers': [1],
    'bidirectional': [True],
}

results, best_model = training.perform_hyperparameter_search(param_grid, train_exs_arr, rnn=True, save_weights=True)
print(best_model)

eid 0, params {'batch_size': 8, 'bidirectional': True, 'dropout': 0.2, 'hidden_dim': 768, 'lr': 3e-05, 'max_epochs': 10, 'num_layers': 1}
training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


updating saved weights of best model
Epoch: 00 | Epoch Time: 1m 7s
	 Train Loss: 11.542 | Train Corr: 0.19
	 Val. Loss: 4.599 |  Val. Corr: 0.11
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 2.988 | Train Corr: 0.60
	 Val. Loss: 4.689 |  Val. Corr: 0.45
updating saved weights of best model
Epoch: 02 | Epoch Time: 1m 7s
	 Train Loss: 2.479 | Train Corr: 0.69
	 Val. Loss: 3.765 |  Val. Corr: 0.57
updating saved weights of best model
Epoch: 03 | Epoch Time: 1m 7s
	 Train Loss: 1.594 | Train Corr: 0.81
	 Val. Loss: 2.987 |  Val. Corr: 0.59
updating saved weights of best model
Epoch: 04 | Epoch Time: 1m 7s
	 Train Loss: 0.995 | Train Corr: 0.89
	 Val. Loss: 2.725 |  Val. Corr: 0.62
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 11.462 | Train Corr: 0.22
	 Val. Loss: 4.985 |  Val. Corr: 0.62
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 2.938 | Train Corr: 0.60
	 Val. Loss: 5.075 |  Val. Corr: 0.72
Epoch: 02 | Epoch Time: 1m 8s
	 Train Loss: 2.245 | Train Corr: 0.71
	 Val. Loss: 3.692 |  Val. Corr: 0.71
updating saved weights of best model
Epoch: 03 | Epoch Time: 1m 8s
	 Train Loss: 1.755 | Train Corr: 0.78
	 Val. Loss: 2.430 |  Val. Corr: 0.73
Epoch: 04 | Epoch Time: 1m 8s
	 Train Loss: 1.188 | Train Corr: 0.86
	 Val. Loss: 3.335 |  Val. Corr: 0.73
Epoch: 05 | Epoch Time: 1m 9s
	 Train Loss: 0.865 | Train Corr: 0.90
	 Val. Loss: 2.704 |  Val. Corr: 0.73
Epoch: 06 | Epoch Time: 1m 9s
	 Train Loss: 0.625 | Train Corr: 0.93
	 Val. Loss: 2.594 |  Val. Corr: 0.71
updating saved weights of best model
Epoch: 07 | Epoch Time: 1m 7s
	 Train Loss: 0.449 | Train Corr: 0.95
	 Val. Loss: 2.425 |  Val. Corr: 0.72
updating saved weights of best model
training on fold 2


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 7s
	 Train Loss: 10.870 | Train Corr: 0.23
	 Val. Loss: 3.114 |  Val. Corr: 0.62
updating saved weights of best model
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 2.875 | Train Corr: 0.63
	 Val. Loss: 2.090 |  Val. Corr: 0.80
Epoch: 02 | Epoch Time: 1m 8s
	 Train Loss: 1.886 | Train Corr: 0.78
	 Val. Loss: 2.350 |  Val. Corr: 0.80
updating saved weights of best model
Epoch: 03 | Epoch Time: 1m 8s
	 Train Loss: 1.495 | Train Corr: 0.83
	 Val. Loss: 1.696 |  Val. Corr: 0.78
updating saved weights of best model
Epoch: 04 | Epoch Time: 1m 7s
	 Train Loss: 1.192 | Train Corr: 0.87
	 Val. Loss: 1.682 |  Val. Corr: 0.80
updating saved weights of best model
Epoch: 05 | Epoch Time: 1m 9s
	 Train Loss: 0.792 | Train Corr: 0.91
	 Val. Loss: 1.554 |  Val. Corr: 0.81
Epoch: 06 | Epoch Time: 1m 8s
	 Train Loss: 0.572 | Train Corr: 0.94
	 Val. Loss: 1.868 |  Val. Corr: 0.78
Epoch: 07 | Epoch Time: 1m 7s
	 Train Loss: 0.467 | Train Corr: 0.95
	 Val. Loss: 1.890 |  Val. Corr: 

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 6s
	 Train Loss: 11.167 | Train Corr: 0.21
	 Val. Loss: 4.185 |  Val. Corr: 0.55
Epoch: 01 | Epoch Time: 1m 5s
	 Train Loss: 2.762 | Train Corr: 0.63
	 Val. Loss: 2.976 |  Val. Corr: 0.72
Epoch: 02 | Epoch Time: 1m 7s
	 Train Loss: 2.359 | Train Corr: 0.70
	 Val. Loss: 2.733 |  Val. Corr: 0.76
Epoch: 03 | Epoch Time: 1m 6s
	 Train Loss: 1.610 | Train Corr: 0.80
	 Val. Loss: 2.071 |  Val. Corr: 0.77
Epoch: 04 | Epoch Time: 1m 6s
	 Train Loss: 1.184 | Train Corr: 0.86
	 Val. Loss: 1.926 |  Val. Corr: 0.79
Epoch: 05 | Epoch Time: 1m 7s
	 Train Loss: 0.819 | Train Corr: 0.91
	 Val. Loss: 1.977 |  Val. Corr: 0.78
Epoch: 06 | Epoch Time: 1m 7s
	 Train Loss: 0.599 | Train Corr: 0.93
	 Val. Loss: 2.152 |  Val. Corr: 0.78
Epoch: 07 | Epoch Time: 1m 7s
	 Train Loss: 0.429 | Train Corr: 0.95
	 Val. Loss: 1.996 |  Val. Corr: 0.78
Epoch: 08 | Epoch Time: 1m 5s
	 Train Loss: 0.363 | Train Corr: 0.96
	 Val. Loss: 2.153 |  Val. Corr: 0.77
training on fold 4


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 3s
	 Train Loss: 11.315 | Train Corr: 0.17
	 Val. Loss: 5.510 |  Val. Corr: 0.58
Epoch: 01 | Epoch Time: 1m 4s
	 Train Loss: 2.964 | Train Corr: 0.59
	 Val. Loss: 3.521 |  Val. Corr: 0.65
Epoch: 02 | Epoch Time: 1m 3s
	 Train Loss: 2.474 | Train Corr: 0.68
	 Val. Loss: 2.910 |  Val. Corr: 0.69
Epoch: 03 | Epoch Time: 1m 3s
	 Train Loss: 1.721 | Train Corr: 0.79
	 Val. Loss: 2.673 |  Val. Corr: 0.70
Epoch: 04 | Epoch Time: 1m 2s
	 Train Loss: 1.280 | Train Corr: 0.85
	 Val. Loss: 2.542 |  Val. Corr: 0.68
Epoch: 05 | Epoch Time: 1m 4s
	 Train Loss: 0.904 | Train Corr: 0.90
	 Val. Loss: 2.408 |  Val. Corr: 0.71
Epoch: 06 | Epoch Time: 1m 2s
	 Train Loss: 0.654 | Train Corr: 0.93
	 Val. Loss: 2.431 |  Val. Corr: 0.69
Epoch: 07 | Epoch Time: 1m 4s
	 Train Loss: 0.479 | Train Corr: 0.95
	 Val. Loss: 2.543 |  Val. Corr: 0.68
Epoch: 08 | Epoch Time: 1m 2s
	 Train Loss: 0.540 | Train Corr: 0.94
	 Val. Loss: 2.520 |  Val. Corr: 0.71
('batch_size_8; bidirectional_True; 

In [23]:
for k in results:
    total = np.sum(results[k])
    print('{}: {}->sum={}\n'.format(k, results[k], total))

batch_size_8; bidirectional_True; dropout_0.2; hidden_dim_768; lr_3e-05; max_epochs_10; num_layers_1: [0.61492086 0.71203809 0.79055422 0.79251652 0.69029136]->sum=3.600321061082088



# Test the trained model on held-out dataset.

In [6]:
# Get a test iterator
# use batch size from best params!!
test_iterator = training.get_iterator(test_dataset, 8, device)

In [25]:
# load the best model saved
bert = DistilBertModel.from_pretrained(constants.WEIGHTS_NAME)
# use the params from the best model!!! 
model = models.BERTGRU(bert, constants.OUTPUT_DIM, 768, 1, True, 0.2)
model.load_state_dict(torch.load("0_best_valid_loss.pt"))
model.to(device)
model.eval()
# If you change the criterion, make sure it matches with the training criterion in training.py
criterion = nn.MSELoss(size_average=False)
criterion = criterion.to(device)
test_loss, test_corr = training.evaluate(model, test_iterator, criterion, debug=False)
print(test_loss)
print(test_corr)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2.1075654236169963
0.7259270041616388


# Misc other stuff

Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."