### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

In [1]:
# You will need to download any word embeddings required for your code, e.g.:




!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

#! pip install torch
! pip install transformers

--2021-02-20 11:17:19--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-02-20 11:17:19--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-02-20 11:17:20--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [3]:
# Get test and train data files
import os
if not os.path.exists('dev.csv'):
  !wget -q --show-progress https://raw.githubusercontent.com/matt-malarkey/nlp-cw-data/master/dev.csv
  !wget -q --show-progress https://raw.githubusercontent.com/matt-malarkey/nlp-cw-data/master/train.csv

In [4]:
# Imports
from transformers import RobertaTokenizer, RobertaModel, BertPreTrainedModel

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import codecs


import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from transformers import RobertaTokenizer, RobertaModel, BertPreTrainedModel
import torch.optim as optim
import codecs
import tqdm
import os
import re

In [5]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [6]:
# Load data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('dev.csv')

In [7]:
# Number of epochs
epochs = 10

# Proportion of training data for train compared to dev
train_proportion = 0.8

In [21]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    
    print("Training model.")
    
    loss_fn = nn.MSELoss()
    loss_fn = loss_fn.to(device)

    for epoch in range(1, number_epoch+1):

        model.train()
        epoch_loss = 0
        epoch_sse = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:
            ids = batch['ids'].to(device).squeeze()
            mask = batch['mask'].to(device).squeeze()
            target = batch['target'].to(device, dtype=torch.float)
            predictions = model(ids, mask).squeeze(1)
            optimizer.zero_grad()
            no_observations = no_observations + target.shape[0]
            loss = loss_fn(predictions, target)

            sse, __ = model_performance(predictions.detach().cpu().numpy(), target.detach().cpu().numpy())

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse

        valid_loss, valid_mse, __, __ = eval(dev_iter, model)

        epoch_loss, epoch_mse = epoch_loss / no_observations, epoch_sse / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train MSE: {epoch_mse:.2f} | Train RMSE: {epoch_mse**0.5:.2f} | \
        Val. Loss: {valid_loss:.2f} | Val. MSE: {valid_mse:.2f} |  Val. RMSE: {valid_mse**0.5:.2f} |')

In [22]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_sse = 0
    pred_all = []
    trg_all = []
    no_observations = 0
    loss_fn = nn.MSELoss()
    loss_fn = loss_fn.to(device)

    with torch.no_grad():
        for batch in data_iter:
            ids = batch['ids'].to(device).squeeze()
            mask = batch['mask'].to(device).squeeze()
            target = batch['target'].to(device, dtype=torch.float)
            no_observations = no_observations + target.shape[0]
            predictions = model(ids, mask).squeeze(1)

            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            sse, __ = model_performance(pred, trg)

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse
            pred_all.extend(pred)
            trg_all.extend(trg)

    return epoch_loss/no_observations, epoch_sse/no_observations, np.array(pred_all), np.array(trg_all)

In [23]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

    return sse, mse

In [24]:
# Inherits from Dataset class so the DataLoader can use it
class Task1Dataset(Dataset):

    def __init__(self, tokenizer, train_data, labels):
        self.x_train = train_data
        self.y_train = labels
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        headlines = self.x_train[item]
        target = self.y_train[item]
        encoding = self.tokenizer.encode_plus(headlines, return_tensors='pt',
                                              padding='max_length', truncation=True,
                                              max_length=128, pad_to_max_length=True)
        ids = encoding['input_ids']
        mask = encoding['attention_mask']
        return {
            'ids': torch.tensor(ids),
            'mask': torch.tensor(mask),
            'target': torch.tensor(target)
        }

In [25]:
class RobertaRegressionModel(BertPreTrainedModel):
  def __init__(self, config):
    super(RobertaRegressionModel, self).__init__(config)
    
    self.config = config
    self.roberta = RobertaModel.from_pretrained('roberta-base')
    self.linear = nn.Linear(768, 1)
    self.loss = nn.MSELoss()

  def forward(self, ids, mask):
        outputs = self.roberta(input_ids=ids, attention_mask=mask)[1]
        x = self.linear(outputs)
        return x   

In [None]:
## Approach 1 code, using functions defined above:
BATCH_SIZE = 32

# Initialise RoBERTa model and tokenizer
model = RobertaRegressionModel.from_pretrained('roberta-base').to(device)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# We set our training data and test data
training_data = train_df['original1']
test_data = test_df['original1']

train_and_dev = Task1Dataset(tokenizer, training_data, train_df['meanGrade1'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))

train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE)

print("Dataloaders created.")


optimizer = torch.optim.Adam(model.parameters())
train(train_loader, dev_loader, model, epochs)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaRegressionModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaRegressionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaRegressionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaRegressionModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['linear.weight', 'linear.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and infer

Dataloaders created.
Training model.




| Epoch: 01 | Train Loss: 0.41 | Train MSE: 0.41 | Train RMSE: 0.64 |         Val. Loss: 0.38 | Val. MSE: 0.38 |  Val. RMSE: 0.62 |
| Epoch: 02 | Train Loss: 0.35 | Train MSE: 0.35 | Train RMSE: 0.59 |         Val. Loss: 0.37 | Val. MSE: 0.37 |  Val. RMSE: 0.61 |


#### Approach 2: No pre-trained representations

In [None]:
train_and_dev = train_df['edit']

training_data, dev_data, training_y, dev_y = train_test_split(train_df['edit'], train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
regression_model = LinearRegression().fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)


Train performance:
| MSE: 0.13 | RMSE: 0.37 |

Dev performance:
| MSE: 0.36 | RMSE: 0.60 |


#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + np.mean(training_y)
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, dev_y, True)


Baseline performance:
| MSE: 0.34 | RMSE: 0.58 |
