### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

In [1]:
# You will need to download any word embeddings required for your code, e.g.:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2021-02-22 21:15:48--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-02-22 21:15:48--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-02-22 21:15:48--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6

In [1]:
!pip install transformers
!pip install torch
!pip install skorch

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 8.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 48.2MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 54.8MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=5bd36

In [2]:
# Imports
import re
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from transformers import BertTokenizer, RobertaTokenizer, AdamW, BertConfig
from transformers import BertModel, BertForSequenceClassification,  RobertaModel
from transformers import get_linear_schedule_with_warmup
import codecs
import gc 

In [3]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

np.random.seed(SEED)

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
if use_cuda:
  print("Using GPU.")
else:
  print("Using CPU.")

Using GPU.


In [4]:
# Load data

!wget -O train.csv https://drive.google.com/u/0/uc?id=1UgrdjcHHZmAthjusQDAKoSqd37up-41f&export=download
!wget -O dev.csv https://drive.google.com/u/0/uc?id=1rY6A0cN_cxAMK3aMHlTFWxhbcLFomvQL&export=download

train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./dev.csv')

--2021-02-23 13:59:31--  https://drive.google.com/u/0/uc?id=1UgrdjcHHZmAthjusQDAKoSqd37up-41f
Resolving drive.google.com (drive.google.com)... 74.125.197.139, 74.125.197.113, 74.125.197.138, ...
Connecting to drive.google.com (drive.google.com)|74.125.197.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-cc-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/kru3gc9nhf4t9j9goqmgi405il3qvbfr/1614088725000/13802342090854404605/*/1UgrdjcHHZmAthjusQDAKoSqd37up-41f [following]
--2021-02-23 13:59:32--  https://doc-00-cc-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/kru3gc9nhf4t9j9goqmgi405il3qvbfr/1614088725000/13802342090854404605/*/1UgrdjcHHZmAthjusQDAKoSqd37up-41f
Resolving doc-00-cc-docs.googleusercontent.com (doc-00-cc-docs.googleusercontent.com)... 74.125.195.132, 2607:f8b0:400e:c09::84
Connecting to doc-00-cc-docs.googleusercontent.com (doc-00-cc-docs.googleusercontent.com)|74

# Training and Evaluation Helpers
Here we have the helper functions that define the training and evaluation cycle, with specialised functions for models which require inputs other than the traditional sentence + grade input (e.g. Bert).

In [5]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch, loss_fn):
  """
  Training loop for the model, which calls on eval to evaluate after each epoch
  """
  training_stats = []
  print("Training model.")

  for epoch in range(1, number_epoch+1):
    model.train()
    epoch_loss = 0
    epoch_sse = 0
    no_observations = 0  # Observations used for training so far

    for feature, target in train_iter:
      # for RNN:
      model.batch_size = target.shape[0]
      no_observations = no_observations + target.shape[0]
      model.hidden = model.init_hidden()

      predictions = model(feature).squeeze(1)
      optimizer.zero_grad()
      loss = loss_fn(predictions, target)
      sse, __ = model_performance(predictions.detach().cpu().numpy(), target.detach().cpu().numpy())
      loss.backward()
      optimizer.step()

      epoch_loss += loss.item()*target.shape[0]
      epoch_sse += sse

      loss = loss.detach().cpu().item()
      del loss
      gc.collect()

    valid_loss, valid_mse, __, __ = eval(dev_iter, model)
    epoch_loss, epoch_mse = epoch_loss / no_observations, epoch_sse / no_observations

    print('| Epoch: %.2d | Train Loss: %.4f | Train RMSE: %.4f | Val. Loss: %.4f | Val. RMSE: %.4f |' % (
      epoch, epoch_loss, epoch_mse**0.5, valid_loss, valid_mse**0.5
    ))

    training_stats.append({
            'epoch': epoch,
            'train_loss': epoch_loss,
            'train_rmse': epoch_mse**0.5,
            'val_loss': valid_loss,
            'val_rmse': valid_mse**0.5
    })

  return training_stats

In [6]:
# Train function specifically for Bert models (includes attention mask and token types during forward)
def train_bert(train_loader, val_loader, model, epochs):
  """
  Training loop for the Bert model, which expects an additional attention mask as part of the data loaders.
  """
  training_stats = []
  print("Training model.")

  for epoch_i in range(epochs):
    total_train_loss = 0
    model.train()

    x_pred = np.array([])
    x_true = np.array([])

    # Training Loop
    for step, batch in enumerate(train_loader):
      b_input_ids = batch[0].to(device)
      b_labels = batch[1].to(device)

      model.zero_grad()
      out = model(b_input_ids, token_type_ids=None, attention_mask=((b_input_ids != 0).type(torch.long)), labels=b_labels)
      loss = out.loss
      logits = out.logits

      total_train_loss += loss.item()
      loss.backward()

      loss = loss.detach().cpu().item()
      del loss
      gc.collect()

      # Log predicted and true for MSE & RMSE
      logits = logits.detach().cpu().numpy()
      label_ids = b_labels.cpu().numpy()
      x_pred = np.append(x_pred, logits)
      x_true = np.append(x_true, label_ids)

      # Clip the norm of the gradients to 1.0 to help prevent the "exploding gradients" problem
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
      optimizer.step()
      scheduler.step() # Linear scheduler is based on steps rather than epochs

    avg_train_loss = total_train_loss / len(train_loader)
    train_rmse = mean_squared_error(x_true, x_pred, squared=False)

    # Validation 
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    y_pred = np.array([])
    y_true = np.array([])

    # Evaluate data for one epoch
    for batch in val_loader:
      b_input_ids = batch[0].to(device)
      b_labels = batch[1].to(device)
        
      # Forward pass, calculate logit predictions
      with torch.no_grad():        
        out = model(b_input_ids, token_type_ids=None, attention_mask=((b_input_ids != 0).type(torch.long)), labels=b_labels)
        out = out.cpu().detach()
        loss = out.loss
        logits = out.logits
        total_eval_loss += loss.item()

        loss = loss.detach().cpu().item()
        del loss
        gc.collect()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.cpu().numpy()
        y_pred = np.append(y_pred,logits)
        y_true = np.append(y_true,label_ids)
        
    val_rmse = mean_squared_error(y_true, y_pred, squared=False)
    avg_val_loss = total_eval_loss / len(val_loader)
    
    print('| Epoch: %.2d | Train Loss: %.4f | Train RMSE: %.4f | Val. Loss: %.4f | Val. RMSE: %.4f |' % (
          epoch_i + 1, avg_train_loss, train_rmse, avg_val_loss, val_rmse
    ))

    training_stats.append({
            'epoch': epoch_i + 1,
            'train_loss': avg_train_loss,
            'train_rmse': train_rmse,
            'val_loss': avg_val_loss,
            'val_rmse.': val_rmse
    })
  return training_stats

In [7]:
# We evaluate performance on our dev set
def eval(data_iter, model):
  """
  Evaluating model performance on the dev set
  """
  model.eval()
  epoch_loss = 0
  epoch_sse = 0
  pred_all = []
  trg_all = []
  no_observations = 0

  with torch.no_grad():
    for batch in data_iter:
      feature, target = batch
      feature, target = feature.to(device), target.to(device)

      # for RNN:
      model.batch_size = target.shape[0]
      no_observations = no_observations + target.shape[0]
      model.hidden = model.init_hidden()

      predictions = model(feature).squeeze(1)
      loss = loss_fn(predictions, target)

      # We get the mse
      pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
      sse, __ = model_performance(pred, trg)

      epoch_loss += loss.item()*target.shape[0]
      epoch_sse += sse
      pred_all.extend(pred)
      trg_all.extend(trg)

  return epoch_loss/no_observations, epoch_sse/no_observations, np.array(pred_all), np.array(trg_all)

In [8]:
# How we print the model performance
def model_performance(output, target, print_output=False):
  """
  Returns SSE and MSE per batch (printing the MSE and the RMSE)
  """
  sq_error = (output - target)**2
  sse = np.sum(sq_error)
  mse = np.mean(sq_error)
  rmse = np.sqrt(mse)

  if print_output:
    print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

  return sse, mse

# Helper Functions

Helper functions for preprocessing or during model training.

In [9]:
def create_vocab(data):
  """
  Creating a corpus of all the tokens used
  """
  tokenized_corpus = [] # Let us put the tokenized corpus in a list

  for sentence in data:
    tokenized_sentence = []

    for token in sentence.split(' '): # simplest split is
      tokenized_sentence.append(token)

    tokenized_corpus.append(tokenized_sentence)

  # Create single list of all vocabulary
  vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

  for sentence in tokenized_corpus:
    for token in sentence:
      if token not in vocabulary:
          vocabulary.append(token)

  return vocabulary, tokenized_corpus

In [10]:
def get_sentences(df, og_label='original', edit_label='edit'):
  """
  Extract the original and new sentences + words from a dataframe
  """
  p = r"<(.*)\/>"
  replace_regex = re.compile(p)
  og_word = []
  new_word = []
  og_sentences = []
  new_sentences = []

  for s, w in df[[og_label, edit_label]].itertuples(index=False,name=None):
    tokens = s.split(' ') # For each sentence get the words
    m = replace_regex.search(str(s)) # Get the word to replace

    assert not m is None # Couldn't regex match the replacement word

    og_word.append(m.group(1))
    new_word.append(w)
    og_sentences.append(replace_regex.sub( m.group(1), s))
    new_sentences.append(replace_regex.sub(w, s))
  
  return og_sentences, new_sentences, og_word, new_word


In [11]:
def softmax_mask(batch, mask):
    normalizing_mask = torch.Tensor([[float('-inf') if token == 0 else 0 for token in entry] for entry in mask]).to(device)
    return torch.nn.functional.softmax(batch + normalizing_mask, dim=-1)

def padd_mask(batch):
    return torch.Tensor([[0 if token == 0 else 1 for token in entry] for entry in batch]).to(device)

def collate_fn_padd(batch):
  '''
  We add padding to our minibatches and create tensors for our model
  '''
  batch_labels = [l for f, l in batch]
  batch_features = [f for f, l in batch]
  batch_features_len = [len(f) for f, l in batch]
  seq_tensor = torch.zeros((len(batch), max(batch_features_len))).long()

  for idx, (seq, seqlen) in enumerate(zip(batch_features, batch_features_len)):
    seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

  batch_labels = torch.FloatTensor(batch_labels)
  return seq_tensor, batch_labels

class Task1Dataset(Dataset):
  def __init__(self, train_data, labels):
    self.x_train = train_data
    self.y_train = labels

  def __len__(self):
    return len(self.y_train)

  def __getitem__(self, item):
    return self.x_train[item], self.y_train[item]

In [12]:
def encode_edited(edited_sentence, grades, tokenizer, pad_len=None):
  pad_len = max([len(i) for i in edited_sentence]) if pad_len is None else pad_len
  encoded_data = []
  attention_data = []

  for sentence in edited_sentence:
    encoded = tokenizer.encode_plus(sentence, padding='max_length', max_length=pad_len, return_tensors='pt')
    encoded_data.append(encoded['input_ids'])

  # Split train dataset to train and validation sets
  train_data = torch.cat(encoded_data).to(device)
  encoded_grades  = torch.FloatTensor(grades).to(device)

  train_val_dataset = TensorDataset(train_data, encoded_grades)

  num_train = round(len(train_val_dataset) * train_proportion)
  num_val = len(train_val_dataset) - num_train
  train_dataset, val_dataset = random_split(train_val_dataset, (num_train, num_val))

  # Create dataloaders from the tokenised embeddings
  train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
  val_dataloader = DataLoader(val_dataset, sampler=RandomSampler(val_dataset), batch_size=batch_size)

  return train_dataloader, val_dataloader

def encode_both(og_sentence, edited_sentence, grades, tokenizer, with_attention=False, pad_len=None):
  pad_len = max([len(i) for i in edited_sentence]) if pad_len is None else pad_len
  encoded_data = []
  attention_data = []

  for og, new in zip(og_sentence, edited_sentence):
    encoded = tokenizer.encode_plus(og, text_pair=new, padding='max_length', max_length=pad_len, return_tensors='pt')
    encoded_data.append(encoded['input_ids'])

    if with_attention:
      attention_data.append(encoded['attention_mask'])

  # Split train dataset to train and validation sets
  train_data = torch.cat(encoded_data).to(device)
  encoded_grades  = torch.FloatTensor(grades).to(device)

  if with_attention:
    attention_data = torch.cat(attention_data).to(device)
    train_val_dataset = TensorDataset(train_data, attention_data, encoded_grades)
  else:
    train_val_dataset = TensorDataset(train_data, encoded_grades)

  num_train = round(len(train_val_dataset) * train_proportion)
  num_val = len(train_val_dataset) - num_train
  train_dataset, val_dataset = random_split(train_val_dataset, (num_train, num_val))

  # Create dataloaders from the tokenised embeddings
  train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
  val_dataloader = DataLoader(val_dataset, sampler=RandomSampler(val_dataset), batch_size=batch_size)

  return train_dataloader, val_dataloader

# Models

Below are the model architectures that was at some point part of the testing of the code.

In [13]:
# Skeleton BiLSTM model
class BiLSTM(nn.Module):
  def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size, device):
    super(BiLSTM, self).__init__()
    self.hidden_dim = hidden_dim
    self.embedding_dim = embedding_dim
    self.device = device
    self.batch_size = batch_size
    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

    # The LSTM takes word embeddings as inputs, and outputs hidden states
    # with dimensionality hidden_dim.
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)

    # The linear layer that maps from hidden state space to tag space
    self.hidden2label = nn.Linear(hidden_dim * 2, 1)
    self.hidden = self.init_hidden()

  def init_hidden(self):
    # Before we've done anything, we dont have any hidden state.
    # Refer to the Pytorch documentation to see exactly why they have this dimensionality.
    # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
    return torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
            torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)

  def forward(self, sentence):
    embedded = self.embedding(sentence)
    embedded = embedded.permute(1, 0, 2)

    lstm_out, self.hidden = self.lstm(
        embedded.view(len(embedded), self.batch_size, self.embedding_dim), self.hidden)

    out = self.hidden2label(lstm_out[-1])
    return out

In [14]:
# RoBERTa
class RoBERTa(nn.Module):
  def __init__(self, embedding_dim, hidden_dim, batch_size, device, roberta_pretrained='roberta-base'):
    super(RoBERTa, self).__init__()
    self.hidden_dim = hidden_dim
    self.embedding_dim = embedding_dim
    self.device = device
    self.batch_size = batch_size
    self.roberta = model = RobertaModel.from_pretrained(roberta_pretrained)

    # The LSTM takes word embeddings as inputs, and outputs hidden states with hidden_dim dimensions
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, batch_first=True)
    self.hidden2label = nn.Linear(hidden_dim * 2, 1)
    self.hidden = self.init_hidden()

  def init_hidden(self):
    # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
    return torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
            torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)

  def get_embedded(self, sentence):
    # Returns last hidden layer and pooled output (Linear & tanh) (we're only interested in the hidden layer)
    outputs = self.roberta(sentence)
    last_hidden_states = outputs[0]
    return last_hidden_states

  def forward(self, sentence):
    embedded = self.get_embedded(sentence)

    # Do I still need this if I'm also training the BERT model?
    lstm_out, self.hidden = self.lstm(embedded, self.hidden)
    
    out = self.hidden2label(lstm_out[-1])
    return out

In [15]:
# Bert
class BERT(nn.Module):
  def __init__(self, embedding_dim, hidden_dim, 
                batch_size, device, bert_model=None, bert_pretrained='bert-base-uncased'):
    super(BERT, self).__init__()
    self.hidden_dim = hidden_dim
    self.embedding_dim = embedding_dim
    self.device = device
    self.batch_size = batch_size
    self.bert = BertModel.from_pretrained(bert_pretrained).to(device) if bert_model == None else bert_model

    # The LSTM takes word embeddings as inputs, and outputs hidden states with hidden_dim dimensions
    self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, batch_first=True)
    self.hidden2label = nn.Linear(hidden_dim * 2, 1)
    self.hidden = self.init_hidden()

    # Attention Layer?
    self.attn_linear = nn.Linear(hidden_dim * 2, 1)

  def init_hidden(self):
    # The axes semantics are (num_layers * num_directions, minibatch_size, hidden_dim)
    return torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device), \
            torch.zeros(2, self.batch_size, self.hidden_dim).to(self.device)

  def get_embedded(self, sentence, grad=True):
    mask = (sentence != 0).type(torch.long)
    # Run the text through BERT, and collect all of the hidden states produced from all 12 layers
    if grad:
      # The last hidden-state is the first element of the output tuple
      out = self.bert(sentence, mask)
      if 'last_hidden_state' in out:
        last_hidden_states = out['last_hidden_state']
      elif 'hidden_states' in out:
        last_hidden_states = out['hidden_states'][-1]
      else:
        # Couldn't extract hidden state
        assert False
    else:
      with torch.no_grad():
        out = self.bert(sentence, mask)
        if 'last_hidden_state' in out:
          last_hidden_states = out['last_hidden_state']
        elif 'hidden_states' in out:
          last_hidden_states = out['hidden_states'][-1]
        else:
          # Couldn't extract hidden state
          assert False

    return last_hidden_states.type(torch.float32)

  def forward(self, sentence):
    mask = (sentence != 0).type(torch.long)
    # Get word embeddings. BERT base gives us 768 hidden parameters for each word.
    embedded = self.get_embedded(sentence) # (batch_size, max_len)

    # (max_len, batch_size, 768) -> (batch_size, max_len, directions * hidden_dim)
    lstm_out, self.hidden = self.lstm(embedded, self.hidden)

    # Attention Mechanism
    # Get similarity using DOT attention (i think)
    # (batch_size, max_len, directions * hidden_dim) -> (batch_size, max_len)
    att_out = self.attn_linear(lstm_out).squeeze(-1)
    # Get the attention weights for each token in a sentence (batch_size)
    # (batch_size, max_len) -> (batch_size, max_len)
    # att_out = torch.nn.functional.softmax(att_out, dim=-1)
    att_out = softmax_mask(att_out, mask)
    # Get sentence vector which is a weighted sum of token hidden states
    # (batch_size, max_len) -> (batch_size, directions * hidden_dim)
    att_out = torch.sum(att_out.unsqueeze(-1) * lstm_out, dim=1)
    
    # out = self.hidden2label(lstm_out[-1])
    out = self.hidden2label(att_out)
    return out

# Initialisation + Preprocessing
Prepare the data for training, assign hyperparameters, initialise the models.

### Prepare Bert and RoBerta specfic Preprocessing

In [16]:
# Hyperparameters
bert_type = 'bert-base-uncased'
rob_type  = 'roberta-base'
pad_len = 64

e_dim = 768
h_dim = 50

batch_size = 32
train_proportion = 0.8

In [17]:
# We set our training data and test data
training_data = train_df
test_data = test_df

# Parse training and test data to tuple of original sentences, new sentences, etc.
x_og, x_new, _, _ = get_sentences(training_data)
y_og, y_new, _, _ = get_sentences(test_data)
x_grades = train_df['meanGrade']

# Get Tokenizer for preprocessing
bert_tokenizer = BertTokenizer.from_pretrained(bert_type, do_lower_case=True)
rob_tokenizer = RobertaTokenizer.from_pretrained(rob_type)

# Get the dataloaders
bert_single_train_loader, bert_single_val_loader = encode_edited(x_new, x_grades, bert_tokenizer)
bert_both_train_loader, bert_both_val_loader = encode_both(x_og, x_new, x_grades, bert_tokenizer)
robert_single_train_loader, robert_single_val_loader = encode_edited(x_new, x_grades, rob_tokenizer)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




### Prepare Glove Preprocessing

In [20]:
# Glove embedding based data loading
# Creating word vectors
training_vocab, training_tokenized_corpus = create_vocab(training_data)
test_vocab, test_tokenized_corpus = create_vocab(test_data)

# Creating joint vocab from test and train:
joint_vocab, joint_tokenized_corpus = create_vocab(pd.concat([training_data, test_data]))

# We create representations for our tokens
wvecs = [] # word vectors
word2idx = [] # word2index
idx2word = []

# This is a large file, it will take a while to load in the memory!
with codecs.open('glove.6B.100d.txt', 'r','utf-8') as f:
  index = 1
  for line in f.readlines():
    # Ignore the first line - first line typically contains vocab, dimensionality
    if len(line.strip().split()) > 3:
      word = line.strip().split()[0]
      if word in joint_vocab:
          (word, vec) = (word,
                     list(map(float,line.strip().split()[1:])))
          wvecs.append(vec)
          word2idx.append((word, index))
          idx2word.append((index, word))
          index += 1

wvecs = np.array(wvecs)
word2idx = dict(word2idx)
idx2word = dict(idx2word)

INPUT_DIM = len(word2idx)
vectorized_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in training_tokenized_corpus]

# To avoid any sentences being empty (if no words match to our word embeddings)
feature = [x if len(x) > 0 else [0] for x in vectorized_seqs]

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_and_dev = Task1Dataset(feature, train_df['meanGrade'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev, (train_examples, dev_examples))

train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

FileNotFoundError: ignored

# Training
Initialise optimisers and start training!

In [23]:
bert_double_pretrained = BertForSequenceClassification.from_pretrained(
    bert_type, 
    num_labels = 1, # Regression model
)

bert_double_pretrained = bert_double_pretrained.to(device) # Move to cuda if possible
bert_double_pretrained = bert_double_pretrained.double() # Convert classifier to regressor
# Optimizer and Scheduler setup
optimizer = AdamW(bert_double_pretrained.parameters(), lr=0.0001)
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=(3 * len(bert_both_train_loader))
)

results = train_bert(bert_both_train_loader, bert_both_val_loader, bert_double_pretrained, 3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Training model.


RuntimeError: ignored

### Bert with 2 

In [18]:
## Bert Training with Both Sentences
#=====================================#
# Define run specific hyperparameters #
#=====================================#
lr = 2e-5
epochs = 4
#=====================================#
# Load Pretrained Bert model for double sequence training
bert_double_pretrained = BertForSequenceClassification.from_pretrained(
    bert_type, 
    num_labels = 1,
    output_hidden_states = True
)
bert_double_pretrained = bert_double_pretrained.to(device) # Move to cuda if possible
bert_double_pretrained = bert_double_pretrained.double() # Convert classifier to regressor

# Initialise full model
model = BERT(e_dim, h_dim, batch_size, device, bert_double_pretrained).to(device)

# Optimizer and Scheduler setup
optimizer = AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=(epochs * len(bert_both_train_loader))
)

print("Model initialised.")

# Define the loss function
loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

results = train(bert_both_train_loader, bert_both_val_loader, model, epochs, loss_fn)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Model initialised.
Training model.
| Epoch: 01 | Train Loss: 0.3317 | Train RMSE: 0.5759 | Val. Loss: 0.2848 | Val. RMSE: 0.5337 |
| Epoch: 02 | Train Loss: 0.2521 | Train RMSE: 0.5021 | Val. Loss: 0.2961 | Val. RMSE: 0.5441 |
| Epoch: 03 | Train Loss: 0.1681 | Train RMSE: 0.4100 | Val. Loss: 0.3228 | Val. RMSE: 0.5682 |
| Epoch: 04 | Train Loss: 0.1109 | Train RMSE: 0.3330 | Val. Loss: 0.3218 | Val. RMSE: 0.5673 |


In [18]:
## Bert Training with Single Sentence
#=====================================#
# Define run specific hyperparameters #
#=====================================#
lr = 2e-5
epochs = 4
#=====================================#
# Initialise full model
model = BERT(e_dim, h_dim, batch_size, device).to(device)

# Optimizer and Scheduler setup
optimizer = AdamW(model.parameters(), lr=lr)
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0, 
    num_training_steps=(epochs * len(bert_single_train_loader))
)
print("Model initialised.")

# Define the loss function
loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

results = train(bert_single_train_loader, bert_single_val_loader, model, epochs, loss_fn)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…


Model initialised.
Training model.
| Epoch: 01 | Train Loss: 0.3373 | Train RMSE: 0.5807 | Val. Loss: 0.2923 | Val. RMSE: 0.5407 |
| Epoch: 02 | Train Loss: 0.2529 | Train RMSE: 0.5029 | Val. Loss: 0.2808 | Val. RMSE: 0.5299 |
| Epoch: 03 | Train Loss: 0.1750 | Train RMSE: 0.4184 | Val. Loss: 0.3125 | Val. RMSE: 0.5590 |
| Epoch: 04 | Train Loss: 0.1152 | Train RMSE: 0.3394 | Val. Loss: 0.3220 | Val. RMSE: 0.5674 |


In [22]:
## Roberta Training
#=====================================#
# Define run specific hyperparameters #
#=====================================#
lr = 0.001
epochs = 3
#=====================================#
model = RoBERTa(e_dim, h_dim, batch_size, device).to(device)

optimizer = torch.optim.Adam(model.parameters())

print("Model initialised.")

# Define the loss function
loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

print("Model initialised.")
results = train(robert_single_train_loader, robert_single_val_loader, model, epochs, loss_fn)

Model initialised.
Model initialised.
Training model.


  return F.mse_loss(input, target, reduction=self.reduction)


RuntimeError: ignored

In [None]:
## Approach 1 code, using functions defined above:

#=====================================#
# Define run specific hyperparameters #
#=====================================#
lr = 0.001
epochs = 6
#=====================================#

model = BiLSTM(e_dim, h_dim, INPUT_DIM, batch_size, device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
model.to(device)

# We provide the model with our embeddings
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

# Define the loss function
loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

print("Model initialised.")
results = train(train_loader, dev_loader, model, epochs, loss_fn)

Vocab created.
Model initialised.
Dataloaders created.
Training model.
| Epoch: 01 | Train Loss: 0.36 | Train MSE: 0.36 | Train RMSE: 0.60 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 02 | Train Loss: 0.34 | Train MSE: 0.34 | Train RMSE: 0.59 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 03 | Train Loss: 0.34 | Train MSE: 0.34 | Train RMSE: 0.59 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 04 | Train Loss: 0.34 | Train MSE: 0.34 | Train RMSE: 0.58 |         Val. Loss: 0.35 | Val. MSE: 0.35 |  Val. RMSE: 0.59 |
| Epoch: 05 | Train Loss: 0.33 | Train MSE: 0.33 | Train RMSE: 0.57 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 06 | Train Loss: 0.28 | Train MSE: 0.28 | Train RMSE: 0.53 |         Val. Loss: 0.35 | Val. MSE: 0.35 |  Val. RMSE: 0.59 |
| Epoch: 07 | Train Loss: 0.26 | Train MSE: 0.26 | Train RMSE: 0.51 |         Val. Loss: 0.38 | Val. MSE: 0.38 |  Val. RMSE: 0.62 |
| Epo

#### Approach 2: No pre-trained representations

In [None]:
train_and_dev = train_df['edit']

training_data, dev_data, training_y, dev_y = train_test_split(train_df['edit'], train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
regression_model = LinearRegression().fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)


Train performance:
| MSE: 0.13 | RMSE: 0.37 |

Dev performance:
| MSE: 0.36 | RMSE: 0.60 |


#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + np.mean(training_y)
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, dev_y, True)


Baseline performance:
| MSE: 0.34 | RMSE: 0.58 |


# Deprecated Code

Kept temporarily until the devs are ready to delete forever :D

In [20]:
# We set our training data and test data
training_data = train_df
test_data = test_df

# Parse training and test data to tuple of original sentences, new sentences, etc.
x_og, x_new, _, _ = get_sentences(training_data)
y_og, y_new, _, _ = get_sentences(test_data)

# Bert Preprocessing
tokenizer = BertTokenizer.from_pretrained(bert_type, do_lower_case=True)

# Tokenise sentences and add special Bert tokens as necessary
train_input = []
train_masks = []
test_input  = []
test_masks  = []
train_grades = train_df['meanGrade']
for og, new in zip(x_og, x_new):
  encoded_plus = tokenizer.encode_plus(og, text_pair=new, max_length=pad_len, 
                                       truncation=True, padding='max_length', return_tensors='pt')
  train_input.append(encoded_plus['input_ids'])
  train_masks.append(encoded_plus['attention_mask'])

# Tokenise Test set
for og, new in zip(y_og, y_new):
  encoded_plus = tokenizer.encode_plus(og, text_pair=new, max_length=pad_len, 
                                       truncation=True, padding='max_length', return_tensors='pt')
  test_input.append(encoded_plus['input_ids'])
  test_masks.append(encoded_plus['attention_mask'])

# Convert to Tensors
train_input = torch.cat(train_input)
train_masks = torch.cat(train_masks)
train_grades = torch.tensor(train_grades)
test_input  = torch.cat(test_input)
test_masks  = torch.cat(test_masks)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [21]:
# Split train dataset to train and validation sets
train_val_dataset = TensorDataset(train_input, train_masks, train_grades)
num_train = round(len(train_val_dataset) * train_proportion)
num_val = len(train_val_dataset) - num_train
train_dataset, val_dataset = random_split(train_val_dataset, (num_train, num_val))

# Create dataloaders from the tokenised embeddings
train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
val_dataloader = DataLoader(val_dataset, sampler=RandomSampler(val_dataset), batch_size=batch_size)