# Notebook of Transformer Experiments for the Shared Task Challenge

Authors: Hadi Asghari & Freya Hewett
Date:   June 2022

V3 builds upon v1/v2 along with kfold cross validation

**Story so far**:
- Our best transformer so far was an xlm-roberta-base transformer with a regressor layer on top (nb v1/v2)
- Using 80:20 split we achieved an RSME score of 0.60 on the best model.  (This translated to rmse_map of 0.48 on the public test data)
- In this notebook we use kfold validation (with K=5). So we shall end up with n models...
- Future: lets also experiment with xlm-roberta-large, but for now my GPU cannot handle more) 


**Competition background**:
- https://github.com/babaknaderi/TextComplexityDE
- https://codalab.lisn.upsaclay.fr/competitions/4964

**Transformer finetuning inspiration**:
- https://github.com/kozodoi/Text_Readability_Prediction
- https://huggingface.co/course/chapter3/3?fw=pt


In [1]:
# This is just FYI; GPU details are handled by HF's Trainer 
import torch
print("GPU/CUDA available:", torch.cuda.is_available())

GPU/CUDA available: True


In [2]:
# STEP 1: LOAD TRAINING SET AND SLICE IT ACCORDINGLY 
# note: the trainsingset's CSV files header needs to be: idx, sentence, label!!  

from datasets import load_dataset, DatasetDict

dataset_all = load_dataset(
    "csv", 
    data_files= {"train": "public_data_text_complexity22/training_set.csv",},
    sep=","
)['train']

K = 5
fold_datasets = []
num_val_samples = len(dataset_all) // K

for i in range(K):
    # one may wish to add some shuffling somewhere somehow :)
    val_idx = list(range(i*num_val_samples, (i+1)*num_val_samples))
    train_idx = list(range(i*num_val_samples)) + list(range((i+1)*num_val_samples,len(dataset_all)))
    assert len(set(val_idx) & set(train_idx)) == 0   # sanity check :)

    ds = DatasetDict({
        "train":dataset_all.select(train_idx),
        "validation":dataset_all.select(val_idx),
    })
    fold_datasets.append(ds)

print(len(fold_datasets), "of:\n", fold_datasets[1])

Using custom data configuration default-523f5d74073b9b87
Reusing dataset csv (/home/hadi/.cache/huggingface/datasets/csv/default-523f5d74073b9b87/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

5 of:
 DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 800
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 200
    })
})


In [3]:
# STEP 2A. LOAD THE PROPER TOKENIZER IN PREPRATION FOR APPLYING TO OUR DATA. 

from transformers import XLMRobertaTokenizer 
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

# also needs a datacollator.
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

# and a mapping function which will be called later. important both padding&truncation:True. 
def tokenize_function(x):
    return tokenizer(x["sentence"], padding=True, truncation=True)  # is_split_into_words=False?

In [4]:
# STEP 2B. Construct our model: XLM-R with a new head regressor layer
# Notes:
# - The base model's .forward() can take more parameters but for training I think these are what matter
# - One can make the layers more compelx in future versions by accessing hidden state and concating them
# - For speed/memory, one could also freeze some of the earlier layers (not necessary now)

import torch.nn as nn
from transformers import XLMRobertaModel
from transformers import Trainer    
    
    
class TheModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.base_model = XLMRobertaModel.from_pretrained("xlm-roberta-base")
        self.dropout = nn.Dropout(0.20)  # a bit slower at start but less zigzagy val-loss in training :)
        self.regressor = nn.Linear(768, 1)
        
    def forward(self, input_ids, attention_mask=None, return_dict=False):
        assert not return_dict
        raw_output = self.base_model(input_ids, attention_mask, return_dict=True)  
        output = raw_output["pooler_output"]  # shape is [batch_size, 768]
        output = self.dropout(output)
        output = self.regressor(output)
        return output
    
    
# we need a custom train because our loss function is different (RMSE)    
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(inputs["input_ids"], inputs["attention_mask"])  # forward pass
        outputs = outputs.squeeze()  # neccessary to avoid diff dims/size errors
        loss = torch.sqrt(nn.MSELoss()(outputs, labels)) 
        return (loss, outputs) if return_outputs else loss
         

In [5]:
# STEP 3: THE ACTUAL TRAINING!
# Notes, 
# - we run this one at a time, since our GPU will probably run out of ram, and we want to restart this.... 
#   TODO: if I could get the GPU To release it's memory at the end, we could easily automate this loop :)
# - there are training platos (in terms of val-loss/over-fitting) between epoch 10-20; trainer saves the best

from transformers import TrainingArguments

n = 0
EPOCHS = 10  #  seems sufficient given the final RSME

if n > 0:
    assert n <= K  # sanity
    fold_dataset = fold_datasets[n-1]
    tokenized_dataset = fold_dataset.map(tokenize_function, batched=True)

    model = TheModel()  # instantiate new one

    trainer = CustomTrainer(
        model=model,
        args=TrainingArguments(
                num_train_epochs=EPOCHS,  
                output_dir=f"checkpoints/train-xlmrreg-n{n}",     
                overwrite_output_dir=True,
                logging_strategy="epoch",
                evaluation_strategy="epoch",
                save_strategy="epoch",
                save_total_limit=2,
                load_best_model_at_end=True,
        ),
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    trainer.train()  # yay!

    # pickle it -- 1.1GB each :O
    import pickle
    print("pickling", end="..")
    model.to("cpu")  # perhaps so it can be properly picked
    pickle.dump(model, open(f"checkpoints/model-xlmrreg-n{n}.p", "wb"))
    print(".!")
    
    
# Loading best model from checkpoints/train-xlmrreg-n1/checkpoint-300 (score: 0.5455247759819031). 
# Loading best model from checkpoints/train-xlmrreg-n2/checkpoint-900 (score: 0.6186634302139282).
# Loading best model from checkpoints/train-xlmrreg-n3/checkpoint-800 (score: 0.7035192847251892). 
# Loading best model from checkpoints/train-xlmrreg-n4/checkpoint-1000 (score: 0.5991728901863098). 
# Loading best model from checkpoints/train-xlmrreg-n5/checkpoint-900 (score: 0.645281970500946). 

In [19]:
# STEP 4A: LETS RELOAD ALL THREE MODELS TO CALCULATE ENSEMBLE RMSE :)

assert n == 0  # don't run this cell until training fully done

import pickle
models = []

for i in range(K):
    model = pickle.load(open(f"checkpoints-20220702/model-xlmrreg-n{i+1}.p", "rb"))
    model.eval()  # evaluation (inference) mode
    models.append(model)

    
print("Loaded", K, "models")

Loaded 5 models


In [8]:
# STEP 4B: CALCULATE RMSE FOR WHOLESET. 
# note: using method below, instead of `trainer.predict` (for some reason that removes the first element of each batch)
# more info: https://discuss.huggingface.co/t/evalprediction-returning-one-less-prediction-than-label-id-for-each-batch/6958/7

import torch  
import numpy as np

def root_mean_squared_error(y_true, y_pred):
    return np.sqrt(np.mean(np.square(y_pred - y_true)))    

tokenized_dataset = dataset_all.map(tokenize_function, batched=True)
feat_inputs = torch.tensor(tokenized_dataset['input_ids'])
feat_attns = torch.tensor(tokenized_dataset['attention_mask'])
truth = np.array(dataset_all["label"])
print(feat_inputs.shape, feat_attns.shape, truth.shape)

l_preds = []
for i in range(K):
    model = models[i]    
    print(i+1, end="..")
    # wierd how slow this is?! it also uses about 70GB RAM :/ maybe needs some batching, or GPU with model out-of-eval
    %time preds = model(feat_inputs, feat_attns)  
    print(".", end="")
    preds = preds.squeeze().detach().numpy()
    l_preds.append(preds)
    print(".", end="")
    print(root_mean_squared_error(truth, preds))
    

# THEN DO A MEAN OF THE PREDICTIONS
l_preds = np.array(l_preds).mean(0)  # this is now the mean of the ensemble predicton
print(root_mean_squared_error(truth, l_preds))
    
# 1....0.4972061220307628
# 2....0.35702121329234604
# 3....0.4064348121115841
# 4....0.3277875091084862
# 5....0.4385743332602079

Loading cached processed dataset at /home/hadi/.cache/huggingface/datasets/csv/default-523f5d74073b9b87/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519/cache-33d69d9f7f36f451.arrow


CPU times: user 17.7 ms, sys: 0 ns, total: 17.7 ms
Wall time: 15.6 ms
CPU times: user 32.6 ms, sys: 0 ns, total: 32.6 ms
Wall time: 33.1 ms
CPU times: user 31.2 ms, sys: 0 ns, total: 31.2 ms
Wall time: 31.2 ms
0....0.4972061220307628
1....0.35702121329234604
2....0.4064348121115841
3....0.3277875091084862
4....0.4385743332602079


In [17]:
# MEAN => 0.237 ... this is insanely good, probably has quite some overfitting :) 
print(root_mean_squared_error(truth, l_preds))

0.2373300214258478


In [21]:
# STEP 5: PREDICT THE COMPETITION (TEST) DATASET SCORES

test_dataset = load_dataset("csv", data_files= {"test": "./part2_public.csv"}, sep=",")["test"]
test_tokenized = test_dataset.map(tokenize_function, batched=True) 

l_preds = []
for i in range(K):
    model = models[i]    
    print(i+1, end="..")
    preds = model(torch.tensor(test_tokenized['input_ids']), torch.tensor(test_tokenized['attention_mask']))
    print(".", end="")
    preds = preds.squeeze().detach().numpy()
    l_preds.append(preds)    

# THEN DO A MEAN OF THE PREDICTIONS
l_preds = np.array(l_preds).mean(0)

# save it!
import csv
ids = list()
with open("./part2_public.csv", 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        ids.append(row[0])
        #score.append(row[2])

with open("answer-xlmr-v3.csv", 'w') as ofile:
    print("ID,MOS", file=ofile)
    for m,i in enumerate(ids[1:]):
        print(str(i)+ ','+str(l_preds[m]), file=ofile)

Using custom data configuration default-3fae48d91a6d34b8


Downloading and preparing dataset csv/default to /home/hadi/.cache/huggingface/datasets/csv/default-3fae48d91a6d34b8/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /home/hadi/.cache/huggingface/datasets/csv/default-3fae48d91a6d34b8/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

1...2...3...4...5...