This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [17]:
import pandas as pd
import numpy as np
import os
import random
random.seed(1)
import re
import time

# Data processing.
import dataset # dataset.py
import torch
from torchtext.legacy import data 

# Model.
import models # models.py
import torch.nn as nn
import torch.optim as optim
from transformers import DistilBertModel, DistilBertTokenizer

# Training.
import training # training.py
from sklearn.model_selection import KFold
import utils # utils.py

# Visualization.
import matplotlib.pyplot as plt

# If you make a code change that doesn't get picked up by
# Jupyter notebook, try reloading like below:
# import imp
# imp.reload(training)

# Load a BERT tokenizer

In [18]:
WEIGHTS_NAME = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizer.from_pretrained(WEIGHTS_NAME)

# Read the data

In [19]:
data_df = dataset.get_multiple_datasets([1,2,3], 'Creativity_Combined', shuffle=True)

In [20]:
# For prototype purposes:
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
train_df = data_df[:1000]
test_df = data_df[1000:] # roughly 190 test examples set aside

# write them to CSV files
train_df.to_csv('ktrain.csv', index=False, header=False)
test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [21]:
INIT_TOKEN_IDX = tokenizer.cls_token_id
EOS_TOKEN_IDX = tokenizer.sep_token_id
PAD_TOKEN_IDX = tokenizer.pad_token_id
UNK_TOKEN_IDX = tokenizer.unk_token_id

# BERT input can be at most 512 words
MAX_INPUT_LENGTH = tokenizer.max_model_input_sizes[WEIGHTS_NAME]

# Apply tokenization and some preprocessing steps to the input sentence.
# Namely, this trims examples down to MAX_INPUT_LENGTH. (There is a -2 
# since the [CLS] and [SEP] tokens will be added)
def tokenize_and_cut(sentence):
  sentence = sentence.replace('/', '') # remove slashes
  tokens = tokenizer.tokenize(sentence) 
  tokens = tokens[:MAX_INPUT_LENGTH-2]
  return tokens

# text_fields defines preprocessing and handling of the text of an example.
text_fields = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = INIT_TOKEN_IDX, # add [CLS] token
                  eos_token = EOS_TOKEN_IDX, # add [SEP] token
                  pad_token = PAD_TOKEN_IDX,
                  unk_token = UNK_TOKEN_IDX)

# label_fields defines how to handle the label of an example.
# for regression, we do not need to build a vocabulary.
label_fields = data.LabelField(sequential=False, use_vocab=False, dtype = torch.float)
all_fields = [('text', text_fields), ('label', label_fields)] # must match order of cols in csv

train_dataset, test_dataset = data.TabularDataset.splits(
  path='', # path='' because the csvs are in the same directory
  train='ktrain.csv', test='ktest.csv', format='csv',
  fields=all_fields  
)

In [12]:
print(data_df.head(1))

                                                  text  label
260  This exercise specially work for the patient w...  2.925


In [22]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Training pipeline begins here


## Define parameters

In [23]:
# Model parameters
OUTPUT_DIM = 1
DROPOUT = 0.2 # This is the distilbert default

# Training parameters
N_SPLITS = 5 # For cross validation
BATCH_SIZE = 1 # This is the distilbert default
N_EPOCHS = 3 # This is the distilbert default

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
criterion = nn.MSELoss(size_average=False)
criterion = criterion.to(device)



In [24]:
# For testing purpose, I am using randomly generated tensors with 3 columns
ADDED_FEATURES = torch.rand(BATCH_SIZE,3)
ADDED_DIM = ADDED_FEATURES.shape[1]
# Move the added features to device so that it can
ADDED_FEATURES = ADDED_FEATURES.to(device)

KeyboardInterrupt: 

## The cell where it actually trains!

In [36]:
# The main training loop
def launch_experiment(train_data_df, n_splits=N_SPLITS):
  # valid_corrs = np.empty(n_splits)
  best_valid_loss = float('inf') 
  
  kf = KFold(n_splits=n_splits)
  fold = 0
  for train_index, valid_index in kf.split(train_data_df):
    print('training on fold {}'.format(fold))
    train_data = data.Dataset(train_exs_arr[train_index], all_fields)
    valid_data = data.Dataset(train_exs_arr[valid_index], all_fields)
    
    # Initialize a new model each fold.
    # https://ai.stackexchange.com/questions/18221/deep-learning-with-kfold-cross-validation-with-epochs
    # https://stats.stackexchange.com/questions/358380/neural-networks-epochs-with-10-fold-cross-validation-doing-something-wrong
    bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)
    model = models.BERTLinear(bert,
                              OUTPUT_DIM,
                              DROPOUT,
                              added_dim = ADDED_DIM, 
                              # I modified to model function such that the dimension of the addtional features is added to BERT dimension (768)
)
    # optimizer = optim.Adam(model.parameters())
    optimizer = optim.Adam(model.parameters(),lr=5e-05,betas=(0.9, 0.999),eps=1e-08) # lr is not Adam default
    model = model.to(device)

    train_iterator, valid_iterator = training.get_iterators(train_data, valid_data, BATCH_SIZE, device)

    for epoch in range(N_EPOCHS):
      start_time = time.time()
    # I modified the training and model functions so that we can concatenate hand-picked features with BERT representations
      train_loss, train_corr = training.train(model, train_iterator, optimizer, criterion, added_features = ADDED_FEATURES)
      valid_loss, valid_corr = training.evaluate(model, valid_iterator, criterion, debug=False,added_features = ADDED_FEATURES)
     # valid_corrs[fold] will end up holding the valid_corr from the last epoch for this fold
      # is this what we want?
      # valid_corrs[fold] = valid_corr
      end_time = time.time()
      epoch_mins, epoch_secs = utils.epoch_time(start_time, end_time)

      if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        # Save the weights of the model with the best valid loss
        print('updating saved weights of best model')
        torch.save(model.state_dict(), "best_valid_loss.pt")

      print(f'Epoch: {epoch:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
      print(f'\t Train Loss: {train_loss:.3f} | Train Corr: {train_corr:.2f}')
      print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Corr: {valid_corr:.2f}')
    
    fold += 1
    # return best_valid_loss # TODO REMOVE -- JUST TRAIN ONE FOLD
  
  return best_valid_loss # , valid_corrs

In [39]:
import imp
imp.reload(models)
imp.reload(training)

best_valid_loss = launch_experiment(train_exs_arr)
print('best validation loss: {}'.format(best_valid_loss))
# print('validation correlations for each fold: {}'.format(valid_corrs))
# print('average of validation correlations: {}'.format(np.mean(valid_corrs)))

training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


updating saved weights of best model
Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 0.582 | Train Corr: 0.39
	 Val. Loss: 0.386 |  Val. Corr: 0.60
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 0.288 | Train Corr: 0.71
	 Val. Loss: 0.499 |  Val. Corr: 0.57
Epoch: 02 | Epoch Time: 1m 8s
	 Train Loss: 0.216 | Train Corr: 0.79
	 Val. Loss: 0.416 |  Val. Corr: 0.61
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 0.722 | Train Corr: 0.20
	 Val. Loss: 0.425 |  Val. Corr: 0.66
updating saved weights of best model
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 0.377 | Train Corr: 0.60
	 Val. Loss: 0.319 |  Val. Corr: 0.70
updating saved weights of best model
Epoch: 02 | Epoch Time: 1m 8s
	 Train Loss: 0.265 | Train Corr: 0.74
	 Val. Loss: 0.274 |  Val. Corr: 0.72
training on fold 2


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 0.606 | Train Corr: 0.33
	 Val. Loss: 0.517 |  Val. Corr: 0.67
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 0.330 | Train Corr: 0.65
	 Val. Loss: 0.281 |  Val. Corr: 0.75
Epoch: 02 | Epoch Time: 1m 8s
	 Train Loss: 0.205 | Train Corr: 0.80
	 Val. Loss: 0.336 |  Val. Corr: 0.73
training on fold 3


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 0.566 | Train Corr: 0.43
	 Val. Loss: 0.305 |  Val. Corr: 0.73
updating saved weights of best model
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 0.318 | Train Corr: 0.68
	 Val. Loss: 0.270 |  Val. Corr: 0.71
Epoch: 02 | Epoch Time: 1m 8s
	 Train Loss: 0.224 | Train Corr: 0.79
	 Val. Loss: 0.278 |  Val. Corr: 0.78
training on fold 4


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 7s
	 Train Loss: 0.596 | Train Corr: 0.38
	 Val. Loss: 0.284 |  Val. Corr: 0.69
Epoch: 01 | Epoch Time: 1m 7s
	 Train Loss: 0.305 | Train Corr: 0.69
	 Val. Loss: 0.360 |  Val. Corr: 0.65
Epoch: 02 | Epoch Time: 1m 7s
	 Train Loss: 0.200 | Train Corr: 0.81
	 Val. Loss: 0.328 |  Val. Corr: 0.63
best validation loss: 0.27033982449457655


# Parameter Search

In [39]:
from sklearn.model_selection import ParameterGrid

# Takes a parameter grid and search for the best model using all combinations of parameters
# The param_grid is a dictionary like the input for sklearn.model_selection.GridSearchCV
# Example: {'batch_size': [1,8], 'lr': [5e-05, 1e-05]}
# Each model will be evaluated using k-fold (default = 5) cross validations
# The model with the highest average correlations across all folds will be selected as the best model
# The function returns the performance of all models (k-element lists stored in dictionaries)
# These results can be used for model comparison (e.g., Wilcoxin test)
# and the best model (a tuple with the parameters and the average correlation)
def DistilBERT_with_parameter_search(param_grid, cv = 5):
    
    # Set default arguments. If the argument is not given in the parameter grid, the defauly will be used
    default = {'dropout': [.2], 
              'batch_size': [8],
               'lr': [5e-05],
              'added_features': [None],
              'n_epoch': [3]}
    for argum in default:
        if argum not in param_grid:
            param_grid[argum] = default[argum]
    
    # Use this function to expand the parameter grid
    grid = ParameterGrid(param_grid)
    
    # Place holder for model performance
    results = {}
    results_mean = {}
    
    for params in grid:
        print(params)
        # Index of the model, represents the parameters
        index = '; '.join(x + '_' + str(y) for x, y in params.items())
        
        # Evaluate the model using the current set of parameters
        result = evaluate_candidates(train_data_df = train_exs_arr
                                ,n_splits = cv
                                ,dropout = params['dropout']
                                ,added_features = params['added_features']
                                ,lr = params['lr']
                                ,batch_size = params['batch_size']
                                ,n_epoch = params['n_epoch']
                               )
        
        # Store the the results
        results[index] = result
        results_mean[index] = np.mean(result)
    
    # Select the best results
    best_index = max(results_mean, key = results_mean.get)
    best_model = (best_index,results_mean[best_index])

    return results, best_model
        

# This function evaluates a model with a certain set of parameters
# Similar to launch_experiment, except that the validation correlations are returned 
def evaluate_candidates(train_data_df
                        ,dropout
                        ,added_features
                        ,lr
                        ,batch_size
                        ,n_epoch
                        ,n_splits
                       ):
  valid_corrs = np.empty(n_splits)
  added_dim = 0
  if added_features is not None:
        added_dim = added_features.shape[1]
        added_features = added_features.to(device)
  kf = KFold(n_splits=n_splits)
  fold = 0
    
  for train_index, valid_index in kf.split(train_data_df):
    print('training on fold {}'.format(fold))
    train_data = data.Dataset(train_exs_arr[train_index], all_fields)
    valid_data = data.Dataset(train_exs_arr[valid_index], all_fields)
    
    # Initialize a new model each fold.
    # https://ai.stackexchange.com/questions/18221/deep-learning-with-kfold-cross-validation-with-epochs
    # https://stats.stackexchange.com/questions/358380/neural-networks-epochs-with-10-fold-cross-validation-doing-something-wrong
    bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)
    model = models.BERTLinear(bert,
                              OUTPUT_DIM,
                              dropout,
                              added_dim = added_dim, 
                              # I modified to model function such that the dimension of the addtional features is added to BERT dimension (768)
)
    # optimizer = optim.Adam(model.parameters())
    optimizer = optim.Adam(model.parameters(),lr=lr,betas=(0.9, 0.999),eps=1e-08) # lr is not Adam default
    model = model.to(device)

    train_iterator, valid_iterator = training.get_iterators(train_data, valid_data, batch_size, device)

    for epoch in range(n_epoch):
      start_time = time.time()
    # I modified the training and model functions so that we can concatenate hand-picked features with BERT representations
      train_loss, train_corr = training.train(model, train_iterator, optimizer, criterion, added_features = added_features)
      valid_loss, valid_corr = training.evaluate(model, valid_iterator, criterion, debug=False,added_features = added_features)
     # valid_corrs[fold] will end up holding the valid_corr from the last epoch for this fold
      # is this what we want?
      # valid_corrs[fold] = valid_corr
      end_time = time.time()
      epoch_mins, epoch_secs = utils.epoch_time(start_time, end_time)


      print(f'Epoch: {epoch:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
      print(f'\t Train Loss: {train_loss:.3f} | Train Corr: {train_corr:.2f}')
      print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Corr: {valid_corr:.2f}')
    
    valid_corrs[fold] = valid_corr
    fold += 1
  return valid_corrs

param_grid = {'batch_size': [1,8]}
all_results, best_model = DistilBERT_with_parameter_search(param_grid, cv = 2)

{'added_features': None, 'batch_size': 1, 'dropout': 0.2, 'lr': 5e-05, 'n_epoch': 3}
training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 0m 49s
	 Train Loss: 0.615 | Train Corr: 0.44
	 Val. Loss: 0.542 |  Val. Corr: 0.67
Epoch: 01 | Epoch Time: 0m 49s
	 Train Loss: 0.269 | Train Corr: 0.75
	 Val. Loss: 0.383 |  Val. Corr: 0.64
Epoch: 02 | Epoch Time: 0m 49s
	 Train Loss: 0.193 | Train Corr: 0.83
	 Val. Loss: 0.548 |  Val. Corr: 0.69
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 0m 48s
	 Train Loss: 0.788 | Train Corr: 0.19
	 Val. Loss: 0.480 |  Val. Corr: 0.67
Epoch: 01 | Epoch Time: 0m 48s
	 Train Loss: 0.420 | Train Corr: 0.53
	 Val. Loss: 0.450 |  Val. Corr: 0.64
Epoch: 02 | Epoch Time: 0m 48s
	 Train Loss: 0.305 | Train Corr: 0.69
	 Val. Loss: 0.306 |  Val. Corr: 0.71
{'added_features': None, 'batch_size': 8, 'dropout': 0.2, 'lr': 5e-05, 'n_epoch': 3}
training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 0m 32s
	 Train Loss: 13.229 | Train Corr: 0.09
	 Val. Loss: 7.008 |  Val. Corr: 0.55
Epoch: 01 | Epoch Time: 0m 32s
	 Train Loss: 3.246 | Train Corr: 0.57
	 Val. Loss: 7.226 |  Val. Corr: 0.59
Epoch: 02 | Epoch Time: 0m 34s
	 Train Loss: 2.606 | Train Corr: 0.67
	 Val. Loss: 5.254 |  Val. Corr: 0.62
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 0m 28s
	 Train Loss: 13.256 | Train Corr: 0.17
	 Val. Loss: 5.339 |  Val. Corr: 0.63
Epoch: 01 | Epoch Time: 0m 29s
	 Train Loss: 3.375 | Train Corr: 0.51
	 Val. Loss: 3.874 |  Val. Corr: 0.69
Epoch: 02 | Epoch Time: 0m 29s
	 Train Loss: 2.767 | Train Corr: 0.62
	 Val. Loss: 3.457 |  Val. Corr: 0.72


In [42]:
print(all_results)
print(best_model)

{'added_features_None; batch_size_1; dropout_0.2; lr_5e-05; n_epoch_3': array([0.69352679, 0.71319378]), 'added_features_None; batch_size_8; dropout_0.2; lr_5e-05; n_epoch_3': array([0.61965209, 0.71505918])}
('added_features_None; batch_size_1; dropout_0.2; lr_5e-05; n_epoch_3', 0.7033602868459449)


# Test the trained model on held-out dataset.

In [11]:
# Get a test iterator
test_iterator = training.get_iterator(test_dataset, BATCH_SIZE, device)

In [12]:
# load the best model saved
bert = DistilBertModel.from_pretrained(WEIGHTS_NAME)
model = models.BERTLinear(bert, OUTPUT_DIM, DROPOUT)
model.load_state_dict(torch.load("best_valid_loss.pt"))
model.to(device)
model.eval()
test_loss, test_corr = training.evaluate(model, test_iterator, criterion, debug=True)
print(test_loss)
print(test_corr)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


predictions: tensor([3.7420, 3.1371, 2.9669, 4.6738, 4.3069, 4.3898, 3.0036, 3.5888],
       device='cuda:0')
true labels: tensor([4.1750, 2.6750, 3.3500, 4.3250, 4.1750, 4.5750, 3.2500, 4.9500],
       device='cuda:0')
predictions: tensor([3.4298, 4.1198, 3.7915, 3.9154, 3.7454, 2.9854, 4.5348, 3.8341],
       device='cuda:0')
true labels: tensor([4.1250, 4.3500, 4.7000, 4.3000, 5.2750, 2.5500, 5.4250, 3.2750],
       device='cuda:0')
predictions: tensor([3.9121, 2.8993, 3.8251, 4.6524, 3.7348, 4.3551, 3.6334, 3.5260],
       device='cuda:0')
true labels: tensor([5.2750, 2.7750, 4.3750, 4.7750, 4.3750, 3.5500, 4.0500, 3.9250],
       device='cuda:0')
predictions: tensor([3.3758, 3.6365, 2.9481, 3.0048, 4.2327, 4.4699, 3.6686, 3.5043],
       device='cuda:0')
true labels: tensor([4.6000, 3.9250, 2.8500, 3.3000, 4.8000, 3.3250, 5.0500, 3.7250],
       device='cuda:0')
predictions: tensor([3.5478, 4.4148, 3.4240, 3.7057, 3.6455, 4.6826, 4.1336, 3.1818],
       device='cuda:0')
true label

# Misc other stuff

Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."