# T5 Alternate Uses Task Scoring 

<a href="https://colab.research.google.com/github/massivetexts/llm_aut_study/blob/main/notebooks/T5 AUT Scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook using PyTorch Lightning and HuggingFace Transformers to evaluate transformer architectures for originality scoring. Currently, it evaluates *T5*, though *BERT*, *distilBERT*, and *RoBERTa* may be sensible to measure.

The ground truth was processed in [Process_AUT_GT.ipynb](https://colab.research.google.com/github/massivetexts/llm_aut_study/blob/main/notebook/Process_AUT_GT.ipynb). See also [GPT-3 AUT Scoring](https://colab.research.google.com/github/massivetexts/llm_aut_study/blob/main/notebooks/GPT-3%20AUT%20Scoring.ipynb).

This is one of the experiments from Organisciak, P., Acar, S., Dumas, D., & Berthiaume, K. (2022). Beyond Semantic Distance: Automated Scoring of Divergent Thinking Greatly Improves with Large Language Models. http://dx.doi.org/10.13140/RG.2.2.32393.31840.

In [None]:
#@title Installs
# for TPU support in PyTorch - this is probably dead in the water for performance reasons
#!pip install -q google-api-python-client==1.12.1 google-cloud-pubsub
tpu_or_gpu = "gpu" #@param ["tpu", "gpu"]

if tpu_or_gpu == "tpu":
    !pip install -qq cloud-tpu-client==0.10 torch==1.11.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl
else:
    !nvcc --version
    import torch
    print("Torch version (check match to CUDA):", torch.__version__) # doublecheck that torch cu version is same as actual cuda version
    # if there's a mismatch - best to find the torch that matches the install CUDA at
    # https://pytorch.org/get-started/previous-versions/
    #!pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

!pip install -qq sentencepiece transformers pytorch-lightning wandb
#!pip install -q git+git://github.com/williamFalcon/pytorch-lightning.git@master --upgrade

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Torch version (check match to CUDA): 1.12.0+cu113


In [None]:
# GPU Memory: K80: 12GB, P100: 16GB, V100: 16GB, P4: 8GB, T4: 16GB, A100: 40GB
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-c47bd875-bd49-e543-3ad9-2896f704f3d0)


In [None]:
#@title Imports
import torch
import wandb
import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
if tpu_or_gpu == 'tpu':
    import torch_xla
    import torch_xla.core.xla_model as xm

import warnings
#import logging
import os
import pandas as pd
import glob
import json
import numpy as np
import random
import re
import argparse
from functools import lru_cache
from sklearn.model_selection import train_test_split
from datetime import datetime

import shutil

from tqdm import tqdm
from pathlib import Path

from transformers import (AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer,
                          get_linear_schedule_with_warmup)
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW

In [None]:
#@title Params
base_dir = Path('drive/MyDrive/Grants/MOTES/') #@param { type: 'raw' }
gt_dir = base_dir / 'Data' / 'aut_ground_truth' #@param { type: 'raw' }
print("GT options", [x.name.split('.')[0] for x in gt_dir.glob('*tar.gz')])
data_subdir = "gt_byprompt" #@param ['gt_main2', 'gt_byparticipant', 'gt_byprompt', 'all'] {allow-input: true}

!cp "{gt_dir}/{data_subdir}.tar.gz" .
!rm -rf data
!tar -xf {data_subdir}.tar.gz
data_dir = Path(f"data/{data_subdir}")

random_seed = 987 #@param {type:'number'}
#@markdown [mt5-base](https://huggingface.co/google/mt5-base) is the new multi-lingual extension.
#@markdown [t5-v.1.1](https://huggingface.co/google/t5-v1_1-base) is a slightly adjusted model, with no pretrained tasks.
model_name_or_path = "t5-base" #@param ["t5-base", "t5-large", "google/t5-v1_1-large", "google/t5-v1_1-base", "google/mt5-base"]
if '-large' in model_name_or_path:
    warnings.warn("Large models likely won't fine-tune on Colab GPUs. Half-precision or TPU training can support them in memory, though each currently stalls.")

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

fname = f"{model_name_or_path.split('/')[-1]}-{data_subdir}-{datetime.now().strftime('%m-%d-%s')}"

print('name', fname)
wandb_logger = WandbLogger(fname, project='aut-t5')
# manual login when something was crashing - it resolved itself
#with open('/content/drive/MyDrive/keys/wandbkey.txt', mode='r') as f:
#    wandb.login(key = f.read()) 
set_seed(random_seed)

GT options ['gt_main', 'gt_bypart3', 'gt_byprompt4', 'gt_byparticipant', 'gt_byprompt', 'all', 'gt_main2', 'gt_main_std']
name t5-base-gt_byprompt-08-05-1659739678


[34m[1mwandb[0m: Currently logged in as: [33mporg[0m ([33mmassive-texts[0m). Use [1m`wandb login --relogin`[0m to force relogin


## Model

This model is a Pytorch Lightning model, adapted from [Patil Suraj's notebook](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) which in turn is based on the [Lightning docs](https://github.com/PytorchLightning/pytorch-lightning).

In [None]:
#@markdown *T5FineTuner* and Logger definitions
class T5FineTuner(pl.LightningModule):
    def __init__(self, model_name_or_path: str,
                tokenizer_name_or_path:str,
                data_dir:str,
                output_dir:str,
                max_seq_length: int=512,
                max_grad_norm: float = 1.0,
                gradient_accumulation_steps: int = 16,
                num_train_epochs: int = 2,
                learning_rate: float = 2e-5,
                adam_epsilon: float = 1e-8,
                warmup_steps: int = 0,
                early_stop_callback: bool = False, 
                weight_decay: float = 0.0,
                batch_size: int = 4,
                pin_dl_memory: bool = False,
                data_loader_workers: int = 4, 
                seed: float = 1234,
                **kwargs):
        super(T5FineTuner, self).__init__()
        self.save_hyperparameters()

        self.config = AutoConfig.from_pretrained(self.hparams.model_name_or_path)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(self.hparams.model_name_or_path,
                                                            config=self.config)
        self.tokenizer = AutoTokenizer.from_pretrained(self.hparams.model_name_or_path, model_max_length=512)
  
    def is_logger(self):
        return True

  #def on_post_move_to_device(self):
      # This is an attempt to adjust for a TPU issue, but could be source of a 
      # bug
  #  self.decoder.weight = self.encoder.weight
  
    #@torch.autocast(device_type="cuda")
    def forward(
        self, input_ids, attention_mask=None, decoder_input_ids=None, 
        decoder_attention_mask=None, labels=None
    ):
        return self.model(
            input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            labels=labels,
        )

    def _step(self, batch):
        labels = batch["target_ids"]
        labels[labels[:, :] == self.tokenizer.pad_token_id] = -100

        outputs = self(
            input_ids=batch["source_ids"],
            attention_mask=batch["source_mask"],
            labels=labels,
            decoder_attention_mask=batch['target_mask']
        )

        loss = outputs[0]

        return loss

    def training_step(self, batch, batch_idx):
        loss = self._step(batch)
        self.log('train_loss', loss)
        return loss

    def training_epoch_end(self, outputs):
        avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
        self.log("avg_train_loss", avg_train_loss)

    def validation_step(self, batch, batch_idx):
        loss = self._step(batch)
        self.log('val_loss', loss, on_epoch=True)
        return loss

    @lru_cache()
    def total_steps(self):
        return len(self.train_dataloader()) // self.hparams.gradient_accumulation_steps * self.hparams.num_train_epochs
            
    def configure_optimizers(self):
        "Prepare optimizer and schedule (linear warmup and decay)"

        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.hparams.weight_decay,
            },
            {
                "params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        
        optimizer = AdamW(optimizer_grouped_parameters,
                            lr=self.hparams.learning_rate, 
                            eps=self.hparams.adam_epsilon)
        self.opt = optimizer

        scheduler = get_linear_schedule_with_warmup(
                        self.opt,
                        num_warmup_steps=self.hparams.warmup_steps,
                        num_training_steps=self.total_steps()
        )
        self.lr_scheduler = scheduler

        # Return a scheduler dict, as a second list with step interval so
        # that the warmup works properly
        # https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#configure-optimizers
        return [self.opt], [{"scheduler": self.lr_scheduler, "interval": "step"}]
  
    def get_tqdm_dict(self):
        tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}
        return tqdm_dict

    def train_dataloader(self):
        train_dataset = get_dataset(tokenizer=self.tokenizer, type_path="train", 
                                    args=self.hparams)
        dataloader = DataLoader(train_dataset, batch_size=self.hparams.batch_size, 
                                drop_last=True, shuffle=True,
                                pin_memory=self.hparams.pin_dl_memory,
                                num_workers=self.hparams.data_loader_workers)

        return dataloader

    def val_dataloader(self):
        val_dataset = get_dataset(tokenizer=self.tokenizer, type_path="val", args=self.hparams)
        return DataLoader(val_dataset, batch_size=self.hparams.batch_size,
                          pin_memory=self.hparams.pin_dl_memory,
                          num_workers=self.hparams.data_loader_workers)


#logger = logging.getLogger(__name__)

#class LoggingCallback(pl.Callback):
#  def on_validation_end(self, trainer, pl_module):
#    logger.info("***** Validation results *****")
#    if pl_module.is_logger():
#      metrics = trainer.callback_metrics
#      # Log results
#      for key in sorted(metrics):
#        if key not in ["log", "progress_bar"]:
#          logger.info("{} = {}\n".format(key, str(metrics[key])))
#
#  def on_test_end(self, trainer, pl_module):
#    logger.info("***** Test results *****")
#
#    if pl_module.is_logger():
#      metrics = trainer.callback_metrics

      # Log and save results to file
#      output_test_results_file = os.path.join(pl_module.hparams.output_dir, "test_results.txt")
#      with open(output_test_results_file, "w") as writer:
#        for key in sorted(metrics):
#          if key not in ["log", "progress_bar"]:
#            logger.info("{} = {}\n".format(key, str(metrics[key])))
#            writer.write("{} = {}\n".format(key, str(metrics[key])))

## Set up Dataset

T5 is text-to-text, so this class should simply format for that form of input.

In [None]:
#@markdown *AUTCorpus* definition
class AutCorpus():
    def __init__(self, tokenizer, data_dir, type_path=None, max_len=512):
        if type_path:
            self.path = os.path.join(data_dir, type_path)
        else:
            self.path = data_dir
        self.all_files = os.listdir(self.path)
        self.max_len = max_len
        self.tokenizer = tokenizer
        self.inputs = []
        self.targets = []
        self.ids = []
        
        self._build()

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        source_ids = self.inputs[index]["input_ids"].squeeze()
        target_ids = self.targets[index]["input_ids"].squeeze()

        src_mask    = self.inputs[index]["attention_mask"].squeeze()  # might need to squeeze
        target_mask = self.targets[index]["attention_mask"].squeeze()  # might need to squeeze

        return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids, "target_mask": target_mask}

    def _build(self):
        for fname in self.all_files:
            with open(os.path.join(self.path, fname), 'r') as f:
                item = json.load(f)
            
            prefix = "autscore"
            
            if type(item['response']) is not str:
                item['response'] = '<unk>'
            try:
                prompt = "question: " + self._process_text(item['question'])
                response = "response: " + self._process_text(item['response'])
            except:
                # don't catch error - this may be something to investigate
                print(item, fname)
                raise
            in_text = f"{prompt}"

            if 'target' in item:
                score = str(item['target'])
            else:
                # this data doesn't have ground truth
                score = '<unk>'

            input = self.tokenizer(f"{prefix} {prompt} {response}",
                                   max_length=self.max_len,
                                   truncation=True,
                                   padding="max_length",
                                   return_tensors="pt")
            
            target = self.tokenizer(score,
                                    truncation=True,
                                    padding="max_length",
                                    return_tensors="pt")
            
            self.ids.append(item['id'])
            self.inputs.append(input)
            self.targets.append(target)


    def _process_text(self, line):
        line = line.strip()
        line = re.sub("[.;:!\'?,\"()\[\]]", "", line)
        return line + ' </s>'

## Initialize T5 Model

In [None]:
#@markdown checkpoint params
temp_checkpoint_dir = 'checkpoints' #@param {type:'string'}
final_checkpoint_dir = base_dir / 'models' #@param {type:'raw'}
load_checkpoint = True #@param {type:'boolean'}
if load_checkpoint:
    chkpts = sorted(list(final_checkpoint_dir.glob(f"{model_name_or_path.split('/')[-1]}-{data_subdir}-*ckpt")))
    print("Checkpoints for this model+gt:", chkpts, "(Loading last)")
    checkpath = chkpts[-1]

    if not os.path.exists(checkpath):
        print("checkpoint can't be found")
        load_checkpoint = False

Checkpoints for this model+gt: [PosixPath('drive/MyDrive/Grants/MOTES/models/t5-base-gt_byprompt-08-05-1659715349.ckpt')] (Loading last)


In [None]:
#@markdown ### Define train params and load model
epochs =  7#@param {type:"integer"}

#@markdown #### Batch Sizes
batch_size =  5#@param {type:"integer"}
if ("-large" in model_name_or_path):
    try:
        assert (tpu_or_gpu == 'tpu')
    except:
        warnings.warn("Large models likely need TPU.")

#@markdown `power` and `binsearch` scaling will start with a batch size of 1 and keep doubling until it reaches OOM
auto_scale_batch_size = None #@param ["None", "\"power\"", "\"binsearch\""] {type:"raw"}
#@markdown From [PyLightning training tips](https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html): 
#@markdown accumulating gradients helps improve training by effectively
#@markdown mimicing a bigger batchsize (16 is a good value)
gradient_accumulation_steps = 16 #@param {type:"integer"}

args = dict(
    data_dir=data_dir, # path for data files
    output_dir=temp_checkpoint_dir, # path to save the checkpoints
    model_name_or_path=model_name_or_path,
    tokenizer_name_or_path=model_name_or_path,
    max_seq_length=512,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=3e-4,
    weight_decay=0.0,
    adam_epsilon=1e-8,
    warmup_steps=20,
    batch_size=batch_size,
    num_train_epochs=epochs,
    # rule of thumb for data loader num_workers is 4 * num_GPU, unless memory is at a premium
    # https://www.pytorchlightning.ai/blog/7-tips-to-maximize-pytorch-performance
    data_loader_workers=4,
    max_grad_norm=1, # clip to avoid exploding gradients; 0 is off, 0.5 is sensible https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#gradient-clipping
    seed=random_seed,
)


pin_dl_memory = True #@param {type:'boolean'}
args['pin_dl_memory'] = pin_dl_memory
#@markdown Use 16-bit precision. This can be done without apex in newer pytorch
fp_16 = False #@param {type:'boolean'}
args['fp_16'] = fp_16

checkpoint_callback = pl.callbacks.ModelCheckpoint(
    dirpath=args['output_dir'],
    filename=data_subdir+'-{epoch}-{val_loss:.2f}-{other_metric:.2f}',
    monitor="val_loss", mode="min", save_top_k=5
)

if load_checkpoint:
    model = T5FineTuner.load_from_checkpoint(checkpath)
else:
    model = T5FineTuner(**args)


In [None]:
corpus = AutCorpus(model.tokenizer, data_dir, "train")
print(model.tokenizer.decode(corpus[10]['source_ids'], skip_special_tokens=True))
print(model.tokenizer.decode(corpus[10]['target_ids'], skip_special_tokens=True))

autscore question: What is a surprising use for a FORK response: as a homemade slingshot
3.5


## Train

In [None]:
#@markdown Initialize Trainer
def get_dataset(tokenizer, type_path, args):
    # This is a generic function called in the Lightning module. 
    # return the training/validation dataset
    return AutCorpus(tokenizer=tokenizer, data_dir=args.data_dir, 
                          type_path=type_path,  max_len=args.max_seq_length)

#Initialize trainer
train_params = dict(
    accumulate_grad_batches=gradient_accumulation_steps,
    max_epochs=args['num_train_epochs'],
    precision= 16 if args['fp_16'] else 32,
    amp_backend='native',
    auto_scale_batch_size=auto_scale_batch_size,
    gradient_clip_val=args['max_grad_norm'],
    logger = wandb_logger,
    callbacks=[#LoggingCallback(),
               ],
)
if 'byprompt' not in data_subdir:
    train_params['callbacks'] += [pl.callbacks.EarlyStopping(monitor="val_loss"),
                                  checkpoint_callback]

tpu_cores = 1#@param {type:'integer'}
if tpu_or_gpu == "tpu":
    train_params['tpu_cores'] = tpu_cores 
elif tpu_or_gpu == "gpu":
    train_params['gpus'] = 1
else:
    raise

trainer = pl.Trainer(**train_params)

  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [None]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  f"Total length of `{dataloader.__class__.__name__}` across ranks is zero."


Training: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


In [None]:
wandb.finish()

In [None]:
# save checkpoint and logs
trainer.save_checkpoint(final_checkpoint_dir / (fname + '.ckpt'))

#logpath = Path('lightning_logs')
#last_version = max([int(x.name.split('version_')[1]) for x in logpath.glob("*version*")])
#last_logs = logpath / f"version_{last_version}"
#shutil.move(last_logs, base_dir / "Data/logs" / fname)

In [None]:
1

## Test

In [None]:
ex = "autscore question: What is a suprising use for an hammer  response: art"
inputs = model.tokenizer(ex, return_tensors="pt",
                         truncation=True, padding="max_length")
generation_output = model.model.generate(**inputs)
model.tokenizer.decode(generation_output[0], skip_special_tokens=True)



'2.8'

In [None]:
testdata = AutCorpus(model.tokenizer, data_dir, 'test')
loader = DataLoader(testdata,batch_size=256, shuffle=False, num_workers=4)

dec = []
texts = []
targets = []
for batch in tqdm(loader):
    print('.', end='')
    outs = model.model.generate(input_ids=batch['source_ids'], 
                              attention_mask=batch['source_mask'] ,
                              max_length=512)
    
    dec += [model.tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]

    texts += [model.tokenizer.decode(ids, skip_special_tokens=True) for ids in batch['source_ids']]
    targets += [model.tokenizer.decode(ids, skip_special_tokens=True) for ids in batch['target_ids']]

print("Done")

  0%|          | 0/14 [00:00<?, ?it/s]

.

  7%|▋         | 1/14 [04:51<1:03:09, 291.51s/it]

.

 14%|█▍        | 2/14 [09:37<57:42, 288.50s/it]  

.

 21%|██▏       | 3/14 [14:21<52:29, 286.29s/it]

.

 29%|██▊       | 4/14 [19:05<47:32, 285.21s/it]

.

 36%|███▌      | 5/14 [23:47<42:37, 284.18s/it]

.

 43%|████▎     | 6/14 [28:33<37:59, 284.93s/it]

.

 50%|█████     | 7/14 [33:20<33:18, 285.55s/it]

.

 57%|█████▋    | 8/14 [38:05<28:31, 285.28s/it]

.

 64%|██████▍   | 9/14 [42:45<23:37, 283.56s/it]

.

 71%|███████▏  | 10/14 [47:21<18:44, 281.19s/it]

.

 79%|███████▊  | 11/14 [51:59<14:00, 280.26s/it]

.

 86%|████████▌ | 12/14 [56:41<09:21, 280.78s/it]

.

 93%|█████████▎| 13/14 [1:01:21<04:40, 280.77s/it]

.

100%|██████████| 14/14 [1:02:30<00:00, 267.92s/it]

Done





In [None]:
# load test data at dataframe, convert outputdata to dataframe, merge and save results
testdata_df = pd.DataFrame([pd.read_json(x, orient='index')[0] for x in (data_dir / 'test').glob('*json')])

outdata = pd.DataFrame(zip(testdata.ids, texts, targets, dec), columns=['id', 'prompt', 'target', 'predicted_raw'])
outdata['predicted'] = pd.to_numeric(outdata.predicted_raw, errors='coerce')
outdata.target = pd.to_numeric(outdata.target, errors='coerce')
outdata = outdata.rename(columns={'prompt':'t5-input'})
outdata['model'] = 't5-base'
outdata['split'] = data_subdir
x = outdata.drop('target', axis='columns').merge(testdata_df, how='left')
assert len(x) == len(outdata)
x = x[['id', 'model', 'participant', 'prompt', 'target', 'predicted', 'src', 'split']]
x.to_csv(base_dir / 'Data' / 'evaluation' / data_subdir / (fname + '.csv'))

outdata.sample(5)

FileNotFoundError: ignored

In [None]:
x.corr()

In [None]:
x.groupby('src').corr()

Unnamed: 0_level_0,Unnamed: 1_level_0,target,predicted
src,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
hmsl,target,1.0,0.703223
hmsl,predicted,0.703223,1.0
motes,target,1.0,0.320681
motes,predicted,0.320681,1.0
paca,target,1.0,0.773263
paca,predicted,0.773263,1.0
s08,target,1.0,0.498952
s08,predicted,0.498952,1.0


In [None]:
for i in range(1, 5):
    subset = outdata[outdata.id.str.contains(f'-g{i}')]
    c = subset.corr().round(2)
    print(f"Game {i}")
    print("\tT5:\t", c.loc['predicted', 'target'], '\tExamples:', len(subset))

for src in ['paca', 'motes']:
    subset = outdata[outdata.id.str.contains(f'{src}-')]
    c = subset.corr().round(2)
    print(src.upper())
    print("\tT5:\t", c.loc['predicted', 'target'], '\tExamples:', len(subset))

Game 1
	T5:	 0.34 	Examples: 90
Game 2
	T5:	 0.51 	Examples: 72
Game 3
	T5:	 0.35 	Examples: 45
Game 4
	T5:	 0.41 	Examples: 54
PACA
	T5:	 0.68 	Examples: 665
MOTES
	T5:	 0.43 	Examples: 261


In [None]:
import difflib
import time
d = difflib.Differ()
with open(time.strftime("spelling-corrections-%m-%d-%Y.txt"), mode='w') as f:
    for i in range(len(texts)):
        #print("Input\t\t", texts[i])
        #print("Hand Fixed\t", targets[i])
        #print("Auto fixed\t", dec[i])
        diffs = ["\t\t\t"+x for x in d.compare([texts[i]+'\n'], [targets[i]+'\n'])]
        diffs[0] = "Original:" + diffs[0][2:]
        if len(diffs) > 1:
            j = 2 if diffs[1].startswith('\t\t\t?') else 1
            diffs = [diffs[0]] + diffs[j:]
            diffs[1] = "Hand-fixed:" + diffs[1][2:]
        else:
            diffs.append("(Hand-fix is unchanged)\n")
        diffs_auto = ["\t\t\t"+x for x in d.compare([texts[i]+'\n'], [dec[i]+'\n'])]
        if len(diffs_auto) > 2:
            j = 2 if diffs_auto[1].startswith('\t\t\t?') else 1
            diffs_auto[j] = "Auto-fixed:" + diffs_auto[j][2:]
            diffs += diffs_auto[j:]

        else:
            diffs.append("(Auto-fix is unchanged)\n")

        diffs.append("=================\n")
        
        if len(diffs) > 4:
            print("".join([x.replace(':\t', ':\t\t') for x in diffs]))
            f.write("".join(diffs))


In [None]:
## Compare to Glove

In [None]:
glove = pd.read_csv('glove_test_data.csv')

In [None]:
a = glove[['id','prompt','question','response', 'truth', 'glove_norm']]
b = a.merge(outdata[['id', 't5', 'prompt']], how='inner', on='id')
b.to_csv('t5-and-glove.csv')
b

Unnamed: 0,id,prompt_x,question,response,truth,glove_norm,t5,prompt_y
0,motes-14ML-g4_library,library,When the kids were in the library they found...,a funny book and laugh but their voices where ...,5.0,4.5,5.5,autscore question: When the kids were in the l...
1,paca-shoe-88852d0f1a994dae025ac0605ed08786,shoe,What is a suprising use for a SHOE?,hold a door open,3.5,5.0,3.5,autscore question: What is a suprising use for...
2,paca-pants-e6a9d88312672fa11002a5c69c111eb0,pants,What is a suprising use for a PANTS?,keep someone warm,1.5,4.5,1.5,autscore question: What is a suprising use for...
3,motes-1RG-g1_backpack,backpack,What is a surprising use for a BACKPACK?,A halloween costume.,4.5,5.0,4.5,autscore question: What is a surprising use fo...
4,paca-rope-9f37391faf541c2e2a2a7dea1d470a97,rope,What is a suprising use for a ROPE?,use it to secure a boat,2.0,4.0,4.0,autscore question: What is a suprising use for...
...,...,...,...,...,...,...,...,...
741,paca-fork-8b38ed63b9c167401ba2a354f67016fd,fork,What is a suprising use for a FORK?,eating utensil,1.0,5.0,1.0,autscore question: What is a suprising use for...
742,paca-brick-f09328d495fa622700aabe5707edf00b,brick,What is a suprising use for a BRICK?,hit,4.0,4.5,4.0,autscore question: What is a suprising use for...
743,paca-fork-68de719c46275bf4cad9fbfe8c088cd3,fork,What is a suprising use for a FORK?,to put holes in wall,3.5,4.5,3.5,autscore question: What is a suprising use for...
744,paca-brick-976ce911df1da1cf7090aff8608f50b0,brick,What is a suprising use for a BRICK?,making house,1.5,4.0,1.0,autscore question: What is a suprising use for...


In [None]:
for i in range(1, 5):
    c = b[b.id.str.contains(f'-g{i}')].corr().round(2)
    print(f"Game {i}")
    print("\tGlove:\t", c.loc['glove_norm', 'truth'])
    print("\tT5:\t", c.loc['t5', 'truth'])

Game 1
	Glove:	 0.22
	T5:	 0.47
Game 2
	Glove:	 0.47
	T5:	 0.37
Game 3
	Glove:	 0.51
	T5:	 0.31
Game 4
	Glove:	 0.43
	T5:	 0.41


T5 

```
Game 1
	Glove:	0.22
	T5 1:	 0.47
    T5 2:	 0.39

Game 2
	Glove:	0.47
	T5 1:	 0.37
    T5 2:	 0.46

Game 3 (removed post-pilot)
	Glove:	0.51
	T5 1:	 0.31
    T5 2:	 0.48

Game 4 (game 3 post-pilot)
	Glove:	0.43
	T5 1:	 0.41
    T5 2:	 0.24
```

In [None]:
outdata.corr()

Unnamed: 0,truth,t5
truth,1.0,0.661971
t5,0.661971,1.0


In [None]:
b.to_csv('t5-and-glove.csv')

In [None]:
b.shape

(746, 8)

In [None]:
outdata.corr()

Unnamed: 0,truth,t5
truth,1.0,0.690679
t5,0.690679,1.0


## Run on Full MOTES Data

Data was pre-processed in [Prepare Motes Full Data.ipynb](https://colab.research.google.com/drive/17pUEqMHbhx8mnDg9EHiqeJ3QMc7QtMMO?usp=sharing)

In [None]:
fulldata = AutCorpus(model.tokenizer, base_dir/"Data/motes-full/json", None)
loader = DataLoader(fulldata,batch_size=128, shuffle=False, num_workers=4)

dec = []
texts = []
targets = []
for batch in tqdm(loader):
    print('.', end='')
    outs = model.model.generate(input_ids=batch['source_ids'], 
                              attention_mask=batch['source_mask'] ,
                              max_length=512)
    
    dec += [model.tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]

    texts += [model.tokenizer.decode(ids, skip_special_tokens=True) for ids in batch['source_ids']]
    targets += [model.tokenizer.decode(ids, skip_special_tokens=True) for ids in batch['target_ids']]

print("Done")

  0%|          | 0/50 [00:00<?, ?it/s]

.

  2%|▏         | 1/50 [01:54<1:33:47, 114.84s/it]

.

  4%|▍         | 2/50 [03:47<1:30:48, 113.51s/it]

.

  6%|▌         | 3/50 [05:39<1:28:33, 113.06s/it]

.

  8%|▊         | 4/50 [07:31<1:26:22, 112.66s/it]

.

 10%|█         | 5/50 [09:24<1:24:25, 112.57s/it]

.

 12%|█▏        | 6/50 [11:16<1:22:32, 112.56s/it]

.

 14%|█▍        | 7/50 [13:08<1:20:31, 112.35s/it]

.

 16%|█▌        | 8/50 [15:01<1:18:42, 112.44s/it]

.

 18%|█▊        | 9/50 [16:53<1:16:47, 112.39s/it]

.

 20%|██        | 10/50 [18:45<1:14:51, 112.29s/it]

.

 22%|██▏       | 11/50 [20:38<1:12:59, 112.31s/it]

.

 24%|██▍       | 12/50 [22:30<1:11:10, 112.37s/it]

.

 26%|██▌       | 13/50 [24:22<1:09:09, 112.16s/it]

.

 28%|██▊       | 14/50 [26:14<1:07:13, 112.04s/it]

.

 30%|███       | 15/50 [28:05<1:05:16, 111.89s/it]

.

 32%|███▏      | 16/50 [29:57<1:03:20, 111.77s/it]

.

 34%|███▍      | 17/50 [31:49<1:01:29, 111.79s/it]

.

 36%|███▌      | 18/50 [33:40<59:31, 111.61s/it]  

.

 38%|███▊      | 19/50 [35:31<57:38, 111.58s/it]

.

 40%|████      | 20/50 [37:23<55:48, 111.63s/it]

.

 42%|████▏     | 21/50 [39:15<54:00, 111.73s/it]

.

 44%|████▍     | 22/50 [41:06<52:06, 111.68s/it]

.

 46%|████▌     | 23/50 [42:58<50:15, 111.70s/it]

.

 48%|████▊     | 24/50 [44:50<48:28, 111.87s/it]

.

 50%|█████     | 25/50 [46:43<46:41, 112.05s/it]

.

 52%|█████▏    | 26/50 [48:35<44:47, 111.96s/it]

.

 54%|█████▍    | 27/50 [50:26<42:53, 111.87s/it]

.

 56%|█████▌    | 28/50 [52:18<41:00, 111.85s/it]

.

 58%|█████▊    | 29/50 [54:10<39:10, 111.92s/it]

.

 60%|██████    | 30/50 [56:02<37:18, 111.93s/it]

.

 62%|██████▏   | 31/50 [57:55<35:29, 112.09s/it]

.

 64%|██████▍   | 32/50 [59:47<33:39, 112.19s/it]

.

 66%|██████▌   | 33/50 [1:01:40<31:48, 112.29s/it]

.

 68%|██████▊   | 34/50 [1:03:32<29:57, 112.34s/it]

.

 70%|███████   | 35/50 [1:05:24<28:04, 112.30s/it]

.

 72%|███████▏  | 36/50 [1:07:16<26:11, 112.27s/it]

.

 74%|███████▍  | 37/50 [1:09:09<24:19, 112.26s/it]

.

 76%|███████▌  | 38/50 [1:11:02<22:29, 112.49s/it]

.

 78%|███████▊  | 39/50 [1:12:54<20:37, 112.54s/it]

..

 82%|████████▏ | 41/50 [1:16:39<16:51, 112.41s/it]

.

 84%|████████▍ | 42/50 [1:18:32<14:59, 112.47s/it]

.

 86%|████████▌ | 43/50 [1:20:25<13:08, 112.58s/it]

.

 88%|████████▊ | 44/50 [1:22:20<11:21, 113.51s/it]

.

 90%|█████████ | 45/50 [1:24:13<09:26, 113.35s/it]

.

 92%|█████████▏| 46/50 [1:26:05<07:32, 113.00s/it]

.

 94%|█████████▍| 47/50 [1:27:58<05:38, 112.92s/it]

.

 96%|█████████▌| 48/50 [1:29:51<03:45, 112.80s/it]

.

 98%|█████████▊| 49/50 [1:31:43<01:52, 112.83s/it]

.

100%|██████████| 50/50 [1:32:20<00:00, 110.82s/it]

Done





In [None]:
outdata = pd.DataFrame(zip(fulldata.ids, texts, targets, dec), columns=['id', 'prompt', 'target', 'predicted_raw'])
outdata['predicted'] = pd.to_numeric(outdata.predicted_raw, errors='coerce')
outdata.target = pd.to_numeric(outdata.target, errors='coerce')
outdata.to_csv(base_dir / 'Data' / 'evaluation' / ('motesfull-' + fname + '.csv'))
outdata.sample(5)

Unnamed: 0,id,prompt,target,predicted_raw,predicted
2007,motesfull-R_3Mu0oACM7BbSLFa-g2_wet,autscore question: What is a surprising exampl...,,6.5,6.5
2945,motesfull-R_Z33wMWuM1dL68gh-g3_schoolbus,autscore question: When I got on the school bu...,,4.5,4.5
1431,motesfull-R_1lawf8FaYk3WmPt-g3_library,autscore question: When the kids were in the l...,,5.5,5.5
3345,motesfull-R_1NrHH6CSGCHDezr-g2_red,autscore question: What is a surprising exampl...,,3.5,3.5
5558,motesfull-R_3QPCr3zlr9JKxe1-g1_spoon,autscore question: What is a surprising use fo...,,1.0,1.0
