This notebook is written based on [this reference implementation](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb).

Other refs for model:
* https://stackoverflow.com/questions/65205582/how-can-i-add-a-bi-lstm-layer-on-top-of-bert-model
* https://discuss.pytorch.org/t/how-to-connect-hook-two-or-even-more-models-together/21033
* https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html

Other refs for torchtext:
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-i-5da6f1c89d84
* https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496
* http://anie.me/On-Torchtext/

# Imports and setup

In [1]:
import pandas as pd
import numpy as np
import os
import random
# random.seed(1)
import re

# Data processing.
import constants # constants.py
import dataset # dataset.py
import torch

# Model.
import models # models.py
import torch.nn as nn
from transformers import DistilBertModel

# Training.
import training # training.py
import utils # utils.py

# Manually created features
import semdis #semdis.py

# If you make a code change that doesn't get picked up by
# Jupyter notebook, try reloading like below:
# import imp
# imp.reload(training)

  from collections import Mapping, defaultdict


# Read the data
Skip this section if you've already ran the notebook once and have the csvs locally.

In [17]:
data_df = dataset.read_multiple_datasets([1,2,3], 'Creativity_Combined', shuffle=True)
data_df['add'] = semdis.normalized_tfidf(data_df)

In [24]:
'''This cell is commented out because the csvs should already exist in the directory.
If you are running the notebook for the first time, run them to generate the csvs.'''
# split into train, test sets. (Train set will be further split into 
# train+validation sets, via k-fold CV.)
train_df = data_df[:1000]
test_df = data_df[1000:] # roughly 203 test examples set aside

# write them to CSV files
train_df.to_csv('ktrain.csv', index=False, header=False)
test_df.to_csv('ktest.csv', index=False, header=False)

## Preprocessing and transform into torchtext Dataset format.

From what I understand, some preprocessing is done when data.Field() is applied.

In [2]:
train_dataset, test_dataset = dataset.get_train_test_datasets('ktrain.csv','ktest.csv', add=True)

In [3]:
# Transform train_dataset into an np array representation.
#This will be used for generating the K folds.
train_exs_arr = np.array(train_dataset.examples)

# Training pipeline begins here


In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import imp
imp.reload(training)
imp.reload(dataset)
imp.reload(models)

param_grid = {
    'dropout': [0.1,0.2],
    'batch_size': [1,4,8],
    'max_epochs': [10],
    'lr': [1e-05,3e-05, 5e-05, 1e-04],
    'model': ['linear']
}

results, best_model = training.perform_hyperparameter_search(param_grid, train_exs_arr, save_weights=True)
print(best_model)

'''commented out portion below which ran a single experiment'''
# params = {
#     'dropout': 0.2,
#     'batch_size': 8,
#     'max_epochs': 10,
#     'lr': 5e-05,
#     'model': 'linear'
# }

# valid_corrs = training.launch_experiment(1, train_exs_arr, params, save_weights=True, add=True)
# print('validation correlations: {}'.format(valid_corrs))

eid 0, params {'batch_size': 1, 'dropout': 0.1, 'lr': 1e-05, 'max_epochs': 10, 'model': 'linear'}
training on fold 0


KeyboardInterrupt: 

# Test the trained model on held-out dataset.

In [6]:
# Get a test iterator
test_iterator = training.get_iterator(test_dataset, 8, device)

In [8]:
# load the best model saved
bert = DistilBertModel.from_pretrained(constants.WEIGHTS_NAME)
model = models.BERTLinear(bert, constants.OUTPUT_DIM, 0.2)
model.load_state_dict(torch.load("1_best_valid_loss.pt"))
model.to(device)
model.eval()
# If you change the criterion, make sure it matches with the training criterion in training.py
criterion = nn.MSELoss(size_average=False)
criterion = criterion.to(device)
test_loss, test_corr = training.evaluate(model, test_iterator, criterion)
print(test_loss)
print(test_corr)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2.5071848928928375
0.6903010024542606


# Misc other stuff

Link to the trainer class: https://huggingface.co/transformers/main_classes/trainer.html



Default training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments

Batch size per device: 8

Epoch: 3



This should be the model I used to generate my initial results: https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
"DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks."