This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [1]:
import pandas as pd
import numpy as np
import os
import random
# random.seed(1)
import re

# Data processing.
import constants # constants.py
import dataset # dataset.py
import torch

# Model.
import models # models.py
import torch.nn as nn
from transformers import DistilBertModel

# Training.
import training # training.py
import utils # utils.py

# If you make a code change that doesn't get picked up by
# Jupyter notebook, try reloading like below:
# import imp
# imp.reload(training)

# Read the data

In [3]:
# data_df = dataset.get_multiple_datasets([1,2,3], 'Creativity_Combined', shuffle=True)

In [4]:
'''This cell is commented out because the csvs should already exist in the directory.
If you are running the notebook for the first time, run them to generate the csvs.'''
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
# train_df = data_df[:1000]
# test_df = data_df[1000:] # roughly 203 test examples set aside

# write them to CSV files
# train_df.to_csv('ktrain.csv', index=False, header=False)
# test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [2]:
train_dataset, test_dataset = dataset.get_train_test_datasets()

In [3]:
# Transform train_dataset into an np array representation.
# This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Training pipeline begins here


In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

param_grid = {
    'dropout': [0.2],
    'batch_size': [8],
    'max_epochs': [10],
    'lr': [3e-05, 1e-04],
    'hidden_dim': [768],
    'num_layers': [1],
    'bidirectional': [True],
}

results, best_model = training.perform_hyperparameter_search(param_grid, train_exs_arr, rnn=True, save_weights=True)
print(best_model)

eid 0, params {'batch_size': 8, 'bidirectional': True, 'dropout': 0.2, 'hidden_dim': 768, 'lr': 3e-05, 'max_epochs': 10, 'num_layers': 1}
training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


updating saved weights of best model
Epoch: 00 | Epoch Time: 1m 13s
	 Train Loss: 15.303 | Train Corr: 0.16
	 Val. Loss: 3.610 |  Val. Corr: 0.49
updating saved weights of best model
Epoch: 01 | Epoch Time: 1m 12s
	 Train Loss: 2.608 | Train Corr: 0.66
	 Val. Loss: 2.811 |  Val. Corr: 0.61
Epoch: 02 | Epoch Time: 1m 13s
	 Train Loss: 1.860 | Train Corr: 0.78
	 Val. Loss: 3.350 |  Val. Corr: 0.61
Epoch: 03 | Epoch Time: 1m 12s
	 Train Loss: 1.137 | Train Corr: 0.87
	 Val. Loss: 3.346 |  Val. Corr: 0.60
Epoch: 04 | Epoch Time: 1m 11s
	 Train Loss: 0.759 | Train Corr: 0.92
	 Val. Loss: 2.879 |  Val. Corr: 0.63
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 13s
	 Train Loss: 14.815 | Train Corr: 0.18
	 Val. Loss: 4.956 |  Val. Corr: 0.47
Epoch: 01 | Epoch Time: 1m 13s
	 Train Loss: 3.045 | Train Corr: 0.57
	 Val. Loss: 5.996 |  Val. Corr: 0.67
Epoch: 02 | Epoch Time: 1m 13s
	 Train Loss: 2.238 | Train Corr: 0.71
	 Val. Loss: 4.764 |  Val. Corr: 0.71
updating saved weights of best model
Epoch: 03 | Epoch Time: 1m 13s
	 Train Loss: 1.749 | Train Corr: 0.78
	 Val. Loss: 2.683 |  Val. Corr: 0.70
Epoch: 04 | Epoch Time: 1m 13s
	 Train Loss: 1.202 | Train Corr: 0.86
	 Val. Loss: 3.654 |  Val. Corr: 0.70
training on fold 2


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 12s
	 Train Loss: 17.089 | Train Corr: 0.14
	 Val. Loss: 3.698 |  Val. Corr: 0.59
Epoch: 01 | Epoch Time: 1m 13s
	 Train Loss: 3.116 | Train Corr: 0.59
	 Val. Loss: 3.247 |  Val. Corr: 0.74
Epoch: 02 | Epoch Time: 1m 13s
	 Train Loss: 2.337 | Train Corr: 0.71
	 Val. Loss: 3.627 |  Val. Corr: 0.80
updating saved weights of best model
Epoch: 03 | Epoch Time: 1m 13s
	 Train Loss: 1.633 | Train Corr: 0.81
	 Val. Loss: 1.762 |  Val. Corr: 0.79
Epoch: 04 | Epoch Time: 1m 12s
	 Train Loss: 1.263 | Train Corr: 0.86
	 Val. Loss: 1.984 |  Val. Corr: 0.80
Epoch: 05 | Epoch Time: 1m 14s
	 Train Loss: 0.786 | Train Corr: 0.91
	 Val. Loss: 1.807 |  Val. Corr: 0.80
Epoch: 06 | Epoch Time: 1m 13s
	 Train Loss: 0.640 | Train Corr: 0.93
	 Val. Loss: 1.821 |  Val. Corr: 0.78
Epoch: 07 | Epoch Time: 1m 12s
	 Train Loss: 0.501 | Train Corr: 0.95
	 Val. Loss: 2.019 |  Val. Corr: 0.81
Epoch: 08 | Epoch Time: 1m 12s
	 Train Loss: 0.371 | Train Corr: 0.96
	 Val. Loss: 1.876 |  Val. C

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 11s
	 Train Loss: 15.009 | Train Corr: 0.16
	 Val. Loss: 5.074 |  Val. Corr: 0.46
Epoch: 01 | Epoch Time: 1m 10s
	 Train Loss: 2.860 | Train Corr: 0.61
	 Val. Loss: 3.680 |  Val. Corr: 0.62
Epoch: 02 | Epoch Time: 1m 12s
	 Train Loss: 2.303 | Train Corr: 0.71
	 Val. Loss: 2.643 |  Val. Corr: 0.74
Epoch: 03 | Epoch Time: 1m 11s
	 Train Loss: 1.596 | Train Corr: 0.81
	 Val. Loss: 2.188 |  Val. Corr: 0.75
Epoch: 04 | Epoch Time: 1m 11s
	 Train Loss: 1.033 | Train Corr: 0.88
	 Val. Loss: 1.969 |  Val. Corr: 0.78
Epoch: 05 | Epoch Time: 1m 12s
	 Train Loss: 0.816 | Train Corr: 0.91
	 Val. Loss: 1.922 |  Val. Corr: 0.78
Epoch: 06 | Epoch Time: 1m 12s
	 Train Loss: 0.576 | Train Corr: 0.93
	 Val. Loss: 1.892 |  Val. Corr: 0.78
Epoch: 07 | Epoch Time: 1m 12s
	 Train Loss: 0.466 | Train Corr: 0.95
	 Val. Loss: 1.867 |  Val. Corr: 0.79
Epoch: 08 | Epoch Time: 1m 10s
	 Train Loss: 0.412 | Train Corr: 0.95
	 Val. Loss: 2.060 |  Val. Corr: 0.77
Epoch: 09 | Epoch Time: 1m 

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 15.648 | Train Corr: 0.13
	 Val. Loss: 5.077 |  Val. Corr: 0.47
Epoch: 01 | Epoch Time: 1m 9s
	 Train Loss: 2.961 | Train Corr: 0.59
	 Val. Loss: 3.521 |  Val. Corr: 0.63
Epoch: 02 | Epoch Time: 1m 7s
	 Train Loss: 2.481 | Train Corr: 0.68
	 Val. Loss: 3.247 |  Val. Corr: 0.71
Epoch: 03 | Epoch Time: 1m 8s
	 Train Loss: 1.741 | Train Corr: 0.79
	 Val. Loss: 2.715 |  Val. Corr: 0.72
Epoch: 04 | Epoch Time: 1m 7s
	 Train Loss: 1.144 | Train Corr: 0.87
	 Val. Loss: 2.644 |  Val. Corr: 0.67
Epoch: 05 | Epoch Time: 1m 9s
	 Train Loss: 0.897 | Train Corr: 0.90
	 Val. Loss: 2.570 |  Val. Corr: 0.71
eid 1, params {'batch_size': 8, 'bidirectional': True, 'dropout': 0.2, 'hidden_dim': 768, 'lr': 0.0001, 'max_epochs': 10, 'num_layers': 1}
training on fold 0


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


updating saved weights of best model
Epoch: 00 | Epoch Time: 1m 12s
	 Train Loss: 8.444 | Train Corr: 0.23
	 Val. Loss: 3.374 |  Val. Corr: 0.52
Epoch: 01 | Epoch Time: 1m 13s
	 Train Loss: 3.132 | Train Corr: 0.58
	 Val. Loss: 3.572 |  Val. Corr: 0.52
Epoch: 02 | Epoch Time: 1m 12s
	 Train Loss: 2.225 | Train Corr: 0.73
	 Val. Loss: 4.800 |  Val. Corr: 0.58
updating saved weights of best model
Epoch: 03 | Epoch Time: 1m 12s
	 Train Loss: 1.409 | Train Corr: 0.84
	 Val. Loss: 2.975 |  Val. Corr: 0.59
Epoch: 04 | Epoch Time: 1m 12s
	 Train Loss: 0.901 | Train Corr: 0.90
	 Val. Loss: 3.261 |  Val. Corr: 0.60
updating saved weights of best model
training on fold 1


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 13s
	 Train Loss: 8.366 | Train Corr: 0.25
	 Val. Loss: 3.818 |  Val. Corr: 0.64
Epoch: 01 | Epoch Time: 1m 13s
	 Train Loss: 3.016 | Train Corr: 0.59
	 Val. Loss: 3.793 |  Val. Corr: 0.66
Epoch: 02 | Epoch Time: 1m 13s
	 Train Loss: 1.857 | Train Corr: 0.77
	 Val. Loss: 4.889 |  Val. Corr: 0.63
Epoch: 03 | Epoch Time: 1m 13s
	 Train Loss: 1.164 | Train Corr: 0.86
	 Val. Loss: 3.532 |  Val. Corr: 0.61
Epoch: 04 | Epoch Time: 1m 13s
	 Train Loss: 0.876 | Train Corr: 0.90
	 Val. Loss: 2.976 |  Val. Corr: 0.62
Epoch: 05 | Epoch Time: 1m 14s
	 Train Loss: 0.570 | Train Corr: 0.93
	 Val. Loss: 3.794 |  Val. Corr: 0.67
training on fold 2


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 12s
	 Train Loss: 8.724 | Train Corr: 0.22
	 Val. Loss: 4.361 |  Val. Corr: 0.57
updating saved weights of best model
Epoch: 01 | Epoch Time: 1m 13s
	 Train Loss: 3.446 | Train Corr: 0.53
	 Val. Loss: 2.886 |  Val. Corr: 0.71
updating saved weights of best model
Epoch: 02 | Epoch Time: 1m 13s
	 Train Loss: 2.210 | Train Corr: 0.73
	 Val. Loss: 2.116 |  Val. Corr: 0.74
Epoch: 03 | Epoch Time: 1m 13s
	 Train Loss: 1.763 | Train Corr: 0.79
	 Val. Loss: 2.147 |  Val. Corr: 0.72
updating saved weights of best model
Epoch: 04 | Epoch Time: 1m 12s
	 Train Loss: 1.206 | Train Corr: 0.86
	 Val. Loss: 2.085 |  Val. Corr: 0.74
updating saved weights of best model
Epoch: 05 | Epoch Time: 1m 14s
	 Train Loss: 0.850 | Train Corr: 0.91
	 Val. Loss: 1.823 |  Val. Corr: 0.77
Epoch: 06 | Epoch Time: 1m 13s
	 Train Loss: 0.566 | Train Corr: 0.94
	 Val. Loss: 2.446 |  Val. Corr: 0.75
Epoch: 07 | Epoch Time: 1m 12s
	 Train Loss: 0.439 | Train Corr: 0.95
	 Val. Loss: 1.959 |  Val.

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 11s
	 Train Loss: 8.565 | Train Corr: 0.18
	 Val. Loss: 3.720 |  Val. Corr: 0.64
Epoch: 01 | Epoch Time: 1m 10s
	 Train Loss: 3.012 | Train Corr: 0.59
	 Val. Loss: 2.495 |  Val. Corr: 0.75
Epoch: 02 | Epoch Time: 1m 11s
	 Train Loss: 1.914 | Train Corr: 0.76
	 Val. Loss: 3.955 |  Val. Corr: 0.73
Epoch: 03 | Epoch Time: 1m 11s
	 Train Loss: 1.503 | Train Corr: 0.82
	 Val. Loss: 2.146 |  Val. Corr: 0.77
Epoch: 04 | Epoch Time: 1m 11s
	 Train Loss: 0.779 | Train Corr: 0.91
	 Val. Loss: 2.018 |  Val. Corr: 0.78
Epoch: 05 | Epoch Time: 1m 12s
	 Train Loss: 0.602 | Train Corr: 0.93
	 Val. Loss: 1.945 |  Val. Corr: 0.79
Epoch: 06 | Epoch Time: 1m 12s
	 Train Loss: 0.444 | Train Corr: 0.95
	 Val. Loss: 1.916 |  Val. Corr: 0.78
Epoch: 07 | Epoch Time: 1m 12s
	 Train Loss: 0.335 | Train Corr: 0.96
	 Val. Loss: 1.914 |  Val. Corr: 0.78
training on fold 4


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: 00 | Epoch Time: 1m 8s
	 Train Loss: 8.834 | Train Corr: 0.16
	 Val. Loss: 4.728 |  Val. Corr: 0.50
Epoch: 01 | Epoch Time: 1m 8s
	 Train Loss: 3.339 | Train Corr: 0.53
	 Val. Loss: 3.865 |  Val. Corr: 0.58
Epoch: 02 | Epoch Time: 1m 7s
	 Train Loss: 2.539 | Train Corr: 0.67
	 Val. Loss: 3.894 |  Val. Corr: 0.58
Epoch: 03 | Epoch Time: 1m 8s
	 Train Loss: 1.645 | Train Corr: 0.80
	 Val. Loss: 3.286 |  Val. Corr: 0.60
Epoch: 04 | Epoch Time: 1m 7s
	 Train Loss: 1.135 | Train Corr: 0.87
	 Val. Loss: 3.527 |  Val. Corr: 0.58
Epoch: 05 | Epoch Time: 1m 9s
	 Train Loss: 0.829 | Train Corr: 0.90
	 Val. Loss: 2.856 |  Val. Corr: 0.62
('batch_size_8; bidirectional_True; dropout_0.2; hidden_dim_768; lr_3e-05; max_epochs_10; num_layers_1', 0.7207434676264579)


In [14]:
for k in results:
    total = np.sum(results[k])
    print('{}: {}->sum={}\n'.format(k, results[k], total))

batch_size_8; bidirectional_False; dropout_0.2; hidden_dim_128; lr_5e-05; max_epochs_10; num_layers_1: [0.63336177 0.39367247 0.48705073 0.63487327 0.49103062]->sum=2.6399888551761626

batch_size_8; bidirectional_False; dropout_0.2; hidden_dim_128; lr_5e-05; max_epochs_10; num_layers_2: [0.37162964 0.66071486 0.34318829 0.29550682 0.4256426 ]->sum=2.0966822003214824

batch_size_8; bidirectional_False; dropout_0.2; hidden_dim_128; lr_5e-05; max_epochs_10; num_layers_3: [0.24622641 0.36365411 0.60378327 0.55231683 0.45307754]->sum=2.219058162342151

batch_size_8; bidirectional_False; dropout_0.2; hidden_dim_256; lr_5e-05; max_epochs_10; num_layers_1: [0.59652564 0.6254396  0.72505107 0.68787658 0.68638165]->sum=3.3212745482868824

batch_size_8; bidirectional_False; dropout_0.2; hidden_dim_256; lr_5e-05; max_epochs_10; num_layers_2: [ 0.59938873  0.56638524  0.76866984 -0.48525026  0.49190522]->sum=1.9410987781481384

batch_size_8; bidirectional_False; dropout_0.2; hidden_dim_256; lr_5e-0

# Test the trained model on held-out dataset.

In [6]:
# Get a test iterator
# use batch size from best params!!
test_iterator = training.get_iterator(test_dataset, 8, device)

In [None]:
# load the best model saved
bert = DistilBertModel.from_pretrained(constants.WEIGHTS_NAME)
# use the params from the best model!!! 
model = models.BERTRNN(bert, constants.OUTPUT_DIM, 768, 1, True, 0.2)
model.load_state_dict(torch.load("1_best_valid_loss.pt"))
model.to(device)
model.eval()
# If you change the criterion, make sure it matches with the training criterion in training.py
criterion = nn.MSELoss(size_average=False)
criterion = criterion.to(device)
test_loss, test_corr = training.evaluate(model, test_iterator, criterion, debug=False)
print(test_loss)
print(test_corr)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Misc other stuff

Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."