<a href="https://colab.research.google.com/github/jsvan/MaskTests4LanguageModels/blob/main/Summerwinograndeattempt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

How does a language model understand words? It represents input text in a rich and complex feature vector. Simple word vectors learn co-occurence probabilities between words, and represent each word in context with every other word. Better embeddings use the maximum space allowed to represent words as far away from each other as possible. 

When Roberta trains on Winogrande, how is it able to differentiate the different words in its task? For example, 
> "The employees threw a [party] and drank so much [alcohol] that they could not go into work the next day. The _ was loud. "

"Party" is represented very differently in vector space than "Alcohol". It should be no problem for the classifier to differentiate those objects. What happens if we mask the objects, thereby removing the representational advantage these words have. Will the classifier get confused between what "mask1" and "mask2" refer to? Do different masking techniques alter accuracy to any noticable degree?

In [None]:
%%capture
!pip install transformers
!pip install datasets 

In [None]:
%%capture
from datasets import load_dataset, concatenate_datasets

# You can switch between these two datasets
dataset = load_dataset("winogrande", 'winogrande_debiased')
#dataset = load_dataset("winogrande", 'winogrande_l')

# dataset is dict with keys ['train', 'test', 'validation']
# Each with an enumerable of 
"""
{'answer': '2',
 'option1': 'Kyle',
 'option2': 'Logan',
 'sentence': "Kyle doesn't wear leg warmers to bed, while Logan almost always does. _ is more likely to live in a colder climate."}

"""
# Use Validation instead of Test because Test lacks labels.

In [None]:
dataset['validation'][4]

{'answer': '1',
 'option1': 'Jeffrey',
 'option2': 'Hunter',
 'sentence': 'At night, Jeffrey always stays up later than Hunter to watch TV because _ wakes up late.'}

In [None]:
# Textizer
"""
    Each sentence was split on "_" placeholder symbol.
    Each option was concatenated with the second part of the split, thus transforming each example into two text segment pairs.
    Text segment pairs corresponding to correct and incorrect options were marked with True and False labels accordingly.
    Text segment pairs were shuffled thereafter.

"""

from datasets import Dataset

def prepare_data(dataset):

  # internal function
  def prep_ds(dataset):
    sentences, answers, o1, o2 = [], [], [], []
    for p in dataset:
      s1 = p['option1'].join(p['sentence'].split('_'))
      s2 = p['option2'].join(p['sentence'].split('_'))
      a1 = int(p['answer'] == '1')
      a2 = int(p['answer'] == '2')

      
      sentences.append(s1)
      answers.append(a1)
      sentences.append(s2)
      answers.append(a2)
      o1.append(p['option1'])
      o2.append(p['option2'])
      o1.append(p['option1'])
      o2.append(p['option2'])

    return {'sentence':sentences, 'labels':answers, 'option1':o1, 'option2':o2}
    # end internal function

  train = prep_ds(dataset["train"])
  test = prep_ds(dataset["validation"])
  trainds = Dataset.from_dict( train ).shuffle()
  testds = Dataset.from_dict( test ).shuffle()

  return {"train":trainds, "test":testds, "name":"Standard Dataset"}





def mask_copy(dataset):
  sentences = []
  toprint = 10

  for p in dataset:
    if toprint > 0:
      print(f"[{p['option1']}], [{p['option2']}], [{p['sentence']}]")

    sentences.append(p['sentence'].replace(p['option1'], 'option1').replace(p['option2'], 'option2'))
    
    if toprint > 0:
      print(sentences[-1])
      toprint -= 1

  build = {'sentence':sentences, 'labels':dataset['labels'], 'option1':dataset['option1'], 'option2':dataset['option2']}
  return Dataset.from_dict(build) # DON'T SHUFFLE





def mask_datasets(dataset):
  return {"train":mask_copy(dataset['train']), "test":mask_copy(dataset['test']), "name":"Masked Dataset"}




In [None]:
%%capture
#dicts of {'train', 'test'}
std_datasets = prepare_data(dataset)

masked_datasets = mask_datasets(std_datasets)

If we peak at the two datasets, we can see that the masking did indeed work.

In [None]:
std_datasets['train']['sentence'][:5]

['Robert lent the book to Ian for the reason that Robert already read the book.',
 'The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.',
 'Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.',
 'I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.',
 'Jessica often experiences severe nausea, Victoria does not therefore Victoria often rides big roller coasters.']

In [None]:
masked_datasets['test']['sentence'][:5]

["option1 went to option2's house to play with the new dog, but there was no answer. option1 was at the park.",
 'No one would have noticed the option1 on that option2 because the option2 is small.',
 'Eating spicy foods better suited option1 and not option2 because option1 never got acid reflux from salsa.',
 'Timmy bought a option1 for his cat so he could take him on the option2 but the option1 was too small.',
 'option1 has dark lips unlike option2 due to option2 forgetting to put on chapsticks at night.']

This upcoming section looks at how the pretrained model performs on winogrande standard vs. the masking technique 'option1' and 'option2'. 

This model is a Roberta-Large model that was trained on winogrande_xl. On the winogrande_m dataset:
the standard dataset receives 85% accuracy, and
the masked dataset receives 80% accuracy. 

When we use the winogrande_debiased dataset, results fall to %69 and 68% respectively. Because these are so bad, let's retune the model on the debiased_train set. 

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from pprint import pprint
from tqdm import tqdm

tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/roberta-large-winogrande", model_max_length=64)

def test_datasets(tokenizer, std_datasets, masked_datasets, model=False,):

  print("Testing 1 2 3 ...")
  delete = False
  if not model:
    model = AutoModelForSequenceClassification.from_pretrained("DeepPavlov/roberta-large-winogrande")
    delete = True

  elif isinstance(model, str):
    delete = True
    with torch.no_grad():
      torch.cuda.empty_cache()

    model = torch.load(model) # open(model, "rb"))

    with torch.no_grad():
      torch.cuda.empty_cache()
  try:
    for ds in (std_datasets, masked_datasets):
      print("Performing", ds["name"])
      
      #combinedtraintest = concatenate_datasets([ds['train'], ds['test']])
      #encoded_train = combinedtraintest.map(lambda examples: tokenizer(examples['sentence'], padding='max_length'), batched=True) # , return_tensors='pt'
      encoded_test = ds['test'].map(lambda examples: tokenizer(examples['sentence'], padding='max_length'), batched=True)

      #encoded_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
      encoded_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
      #dataloader_train = torch.utils.data.DataLoader(encoded_train, batch_size=32)
      dataloader_test = torch.utils.data.DataLoader(encoded_test, batch_size=32)

      device = 'cuda' if torch.cuda.is_available() else 'cpu' 
      #model.train().to(device)
      model.to(device)
      #optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)

      correct = 0
      total = 0
      for i, batch in enumerate(tqdm(dataloader_test)):
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        #for oo, lab in zip(outputs.logits, batch['labels'] ):
        #  print(oo.argmax().item(), lab.items())
        correct += sum((int(x.argmax().item() == y.item()) for x, y in zip(outputs.logits, batch['labels'])))
        total += len(batch['labels'])
        
        
      print("Score", correct / total, '/ 1.00')

    if delete:
      del model
      print("Deleted model")

  except Exception as e:
    del model
    print(e, e.__str__)
    print("Deleted model")


# test_datasets(tokenizer, std_datasets, masked_datasets)

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

We'll train a model on the unmasked debiased data train set for one epoch. 

On the first epoch, the testdataset achieves 75% accuracy and the masked dataset 70% accuracy. 

Further epochs see no change (tested with 3)

In [None]:

def train_model(trainingset, compareset, tokenizer, model=False, copy=False):
  if not model:
    model = AutoModelForSequenceClassification.from_pretrained("DeepPavlov/roberta-large-winogrande")
  elif copy: # ie, if copy and model
    print('copy')
    with torch.no_grad():
      torch.cuda.empty_cache()
    model = torch.load(model) # open(model, "rb"))
    with torch.no_grad():
      torch.cuda.empty_cache()
    #model_copy = type(model)() # get a new instance
    #model_copy.load_state_dict(model.state_dict()) # copy weights and stuff
    #model = model_copy
    print('finished?')
  try:
    encoded_train = trainingset['train'].map(lambda examples: tokenizer(examples['sentence'], padding='max_length'), batched=True) # , return_tensors='pt'
    encoded_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
    dataloader_train = torch.utils.data.DataLoader(encoded_train, batch_size=32)

    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model.train().to(device)
    optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)

    for epoch in range(1):
      print("Epoch", epoch)
      correct = 0
      total = 0
      for i, batch in enumerate(tqdm(dataloader_train)):
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        if i % 10 == 0:
          print(f" loss: {loss}")
        #for oo, lab in zip(outputs.logits, batch['labels'] ):
        #  print(oo.argmax().item(), lab.items())
        correct += sum((int(x.argmax().item() == y.item()) for x, y in zip(outputs.logits, batch['labels'])))
        total += len(batch['labels'])
        
      print("Score", correct / total, '/ 1.00')
      test_datasets(tokenizer, trainingset, compareset, model)
      model.train().to(device)
  except Exception as e:
    del model
    print(e, e.__str__)
    print("deleted model")
  return model




In [None]:

# This model will become the base model for everything we build upon, 
# so we will save it to disk to load it fresh for each experiment. 
debiased_tuned_model = train_model(std_datasets, masked_datasets, tokenizer)
torch.save(debiased_tuned_model, f="debiased_model")
debiased_tuned_model = None

Downloading:   0%|          | 0.00/820 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]



  0%|          | 0/19 [00:00<?, ?ba/s]

Epoch 0


  0%|          | 1/578 [00:00<07:44,  1.24it/s]

 loss: 0.30050966143608093


  2%|▏         | 11/578 [00:07<06:41,  1.41it/s]

 loss: 0.06143321096897125


  4%|▎         | 21/578 [00:14<06:34,  1.41it/s]

 loss: 0.09205632656812668


  5%|▌         | 31/578 [00:22<06:27,  1.41it/s]

 loss: 0.09935687482357025


  7%|▋         | 41/578 [00:29<06:20,  1.41it/s]

 loss: 0.06699220091104507


  9%|▉         | 51/578 [00:36<06:14,  1.41it/s]

 loss: 0.008995802141726017


 11%|█         | 61/578 [00:43<06:06,  1.41it/s]

 loss: 0.22418028116226196


 12%|█▏        | 71/578 [00:50<05:58,  1.41it/s]

 loss: 0.05139536410570145


 14%|█▍        | 81/578 [00:57<05:51,  1.41it/s]

 loss: 0.006116499658674002


 16%|█▌        | 91/578 [01:04<05:44,  1.41it/s]

 loss: 0.027860887348651886


 17%|█▋        | 101/578 [01:11<05:37,  1.41it/s]

 loss: 0.026967735961079597


 19%|█▉        | 111/578 [01:18<05:30,  1.41it/s]

 loss: 0.00401950441300869


 21%|██        | 121/578 [01:25<05:23,  1.41it/s]

 loss: 0.008318298496305943


 23%|██▎       | 131/578 [01:32<05:16,  1.41it/s]

 loss: 0.010341075249016285


 24%|██▍       | 141/578 [01:39<05:09,  1.41it/s]

 loss: 0.020266059786081314


 26%|██▌       | 151/578 [01:46<05:02,  1.41it/s]

 loss: 0.050350967794656754


 28%|██▊       | 161/578 [01:54<04:55,  1.41it/s]

 loss: 0.034362152218818665


 30%|██▉       | 171/578 [02:01<04:47,  1.41it/s]

 loss: 0.021577056497335434


 31%|███▏      | 181/578 [02:08<04:40,  1.41it/s]

 loss: 0.029037704691290855


 33%|███▎      | 191/578 [02:15<04:34,  1.41it/s]

 loss: 0.015812911093235016


 35%|███▍      | 201/578 [02:22<04:27,  1.41it/s]

 loss: 0.09099285304546356


 37%|███▋      | 211/578 [02:29<04:19,  1.41it/s]

 loss: 0.007730567827820778


 38%|███▊      | 221/578 [02:36<04:13,  1.41it/s]

 loss: 0.011599455960094929


 40%|███▉      | 231/578 [02:43<04:05,  1.41it/s]

 loss: 0.008352228440344334


 42%|████▏     | 241/578 [02:50<03:58,  1.41it/s]

 loss: 0.05046742036938667


 43%|████▎     | 251/578 [02:57<03:51,  1.41it/s]

 loss: 0.010022688657045364


 45%|████▌     | 261/578 [03:04<03:44,  1.41it/s]

 loss: 0.009866062551736832


 47%|████▋     | 271/578 [03:11<03:37,  1.41it/s]

 loss: 0.0257477518171072


 49%|████▊     | 281/578 [03:18<03:30,  1.41it/s]

 loss: 0.04880094900727272


 50%|█████     | 291/578 [03:26<03:23,  1.41it/s]

 loss: 0.013394075445830822


 52%|█████▏    | 301/578 [03:33<03:16,  1.41it/s]

 loss: 0.10113755613565445


 54%|█████▍    | 311/578 [03:40<03:09,  1.41it/s]

 loss: 0.035136688500642776


 56%|█████▌    | 321/578 [03:47<03:01,  1.41it/s]

 loss: 0.13019055128097534


 57%|█████▋    | 331/578 [03:54<02:54,  1.41it/s]

 loss: 0.11520209163427353


 59%|█████▉    | 341/578 [04:01<02:47,  1.41it/s]

 loss: 0.025441447272896767


 61%|██████    | 351/578 [04:08<02:40,  1.41it/s]

 loss: 0.04516652598977089


 62%|██████▏   | 361/578 [04:15<02:33,  1.41it/s]

 loss: 0.011285915970802307


 64%|██████▍   | 371/578 [04:22<02:26,  1.41it/s]

 loss: 0.16046884655952454


 66%|██████▌   | 381/578 [04:29<02:19,  1.41it/s]

 loss: 0.015912827104330063


 68%|██████▊   | 391/578 [04:36<02:12,  1.41it/s]

 loss: 0.004010608419775963


 69%|██████▉   | 401/578 [04:43<02:05,  1.41it/s]

 loss: 0.006898090243339539


 71%|███████   | 411/578 [04:50<01:58,  1.41it/s]

 loss: 0.0661335363984108


 73%|███████▎  | 421/578 [04:57<01:51,  1.41it/s]

 loss: 0.0048845992423594


 75%|███████▍  | 431/578 [05:05<01:44,  1.41it/s]

 loss: 0.023679763078689575


 76%|███████▋  | 441/578 [05:12<01:36,  1.41it/s]

 loss: 0.036163147538900375


 78%|███████▊  | 451/578 [05:19<01:29,  1.41it/s]

 loss: 0.0821063369512558


 80%|███████▉  | 461/578 [05:26<01:22,  1.41it/s]

 loss: 0.009664991870522499


 81%|████████▏ | 471/578 [05:33<01:15,  1.41it/s]

 loss: 0.006672048009932041


 83%|████████▎ | 481/578 [05:40<01:08,  1.41it/s]

 loss: 0.0695372074842453


 85%|████████▍ | 491/578 [05:47<01:01,  1.41it/s]

 loss: 0.0916987881064415


 87%|████████▋ | 501/578 [05:54<00:54,  1.41it/s]

 loss: 0.09198770672082901


 88%|████████▊ | 511/578 [06:01<00:47,  1.41it/s]

 loss: 0.03408629447221756


 90%|█████████ | 521/578 [06:08<00:40,  1.41it/s]

 loss: 0.0296997781842947


 92%|█████████▏| 531/578 [06:15<00:33,  1.41it/s]

 loss: 0.025811005383729935


 94%|█████████▎| 541/578 [06:22<00:26,  1.41it/s]

 loss: 0.05862284079194069


 95%|█████████▌| 551/578 [06:29<00:19,  1.41it/s]

 loss: 0.01408953033387661


 97%|█████████▋| 561/578 [06:37<00:12,  1.41it/s]

 loss: 0.003849643748253584


 99%|█████████▉| 571/578 [06:44<00:04,  1.41it/s]

 loss: 0.028261344879865646


100%|██████████| 578/578 [06:49<00:00,  1.41it/s]

Score 0.9824826989619377 / 1.00
Testing 1 2 3 ...
Performing Standard Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.30it/s]

Score 0.7505919494869772 / 1.00
Performing Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.31it/s]


Score 0.7087608524072613 / 1.00


Does training it on the masked testset change anything?

In [None]:
#Clear the GPU RAM...

with torch.no_grad():
    torch.cuda.empty_cache()

In [None]:
#masked_tuned_model = train_model(masked_datasets, std_datasets, tokenizer)


A pre-tuned model on Winogrande_xl that's tuned on the masked-debiased training set gets slightly better accuracy on the masked test set. (73.2% accuracy for standard, 73.8% for masked). 

The same holds for if we use the winogrande_l dataset instead of debiased. On Winogrande_l standard we get 72% accuracy and 71.5% on the masked. 

What happens if we tune the masked set on top of the tuned standard debiased model?

In [None]:
#with torch.no_grad():
#    torch.cuda.empty_cache()

In [None]:
#debiased_masked_tuned_model = train_model(masked_datasets, std_datasets, tokenizer, model = debiased_tuned_model)


In this case, the standard debiased dataset achieves 73.6% accuracy and the masked dataset 73.4% accuracy. Practically the same. 

Let's try a different masking style. What happens if we cover up one of the Options with an [unknown] mask? Can the model deal with and identify objecthood of a masked object?

Take the following example sentence:

{
  "sentence": "The [plant] took up too much room in the [urn], because the [plant] was small.",
  "label": false
}

There are two ways to test the model. If we mask the plant, then there are two references to the same masked object, and the True/False question directly references that object. If we mask the urn, then the contextual object is hidden, but is never the subject of inquery. Let's test covering up each at a time. The first test is more interesting, but might as well see if we find anything.

In [None]:
import re

def mask_copy_1(dataset, unk):
  sentences = []
  toprint = 5

  for p in dataset:
    # This is annoying because option1 can be substrings of other words, usually option2. They can also have uppercase letters. 
    # If one is a substring, then I will cover up the larger word with a temporary mask to not confuse anything else.
    sentence = p['sentence']
    option1, option2 = re.compile(p['option1'], re.IGNORECASE), re.compile(p['option2'], re.IGNORECASE)
    maskedoption1, maskedoption2 = "OPTION_ONE", "OPTION_TWO"
    first_search, second_search, first_mask, second_mask = None, None, None, None

    # "table" in "tablecloth" --> cover bigger one
    if len(p['option1']) > len(p['option2']):
      # cover option1 first
      first_search, second_search = option1, option2
      first_mask, second_mask = maskedoption1, maskedoption2
    else:
      first_search, second_search = option2, option1
      first_mask, second_mask = maskedoption2, maskedoption1
    
    sentence = first_search.sub(first_mask, sentence) #Mask the longer word with OPTION_ mask
    sentence = second_search.sub(second_mask, sentence) #then the shorter one

    # IT gets kinda confusing which word now to <IGNORE> because it's language and some words appear more than two, more than three, times. 
    # I think the smartest approach is to assume the final word used is the question word and mask that one. 

    # Find final word used
    # maskedoption1 is the final word
    if sentence.rfind(maskedoption1) > sentence.rfind(maskedoption2):
      # IGNORE final word
      # Convert other word back to original
      sentence = sentence.replace(maskedoption1, unk)
      sentence = sentence.replace(maskedoption2, p['option2'])
    else:
      sentence = sentence.replace(maskedoption2, unk)
      sentence = sentence.replace(maskedoption1, p['option2'])


    if toprint > 0:
      print(f"[{p['option1']}], [{p['option2']}], [{p['sentence']}]")

    sentences.append(sentence)
    
    if toprint > 0:
      print(sentences[-1])
      toprint -= 1

  build = {'sentence':sentences, 'labels':dataset['labels'], 'option1':dataset['option1'], 'option2':dataset['option2']}
  return Dataset.from_dict(build) # DON'T SHUFFLE





def mask_datasets_1(dataset, tokenizer):
  unk = tokenizer.unk_token
  return {"train":mask_copy_1(dataset['train'], unk), "test":mask_copy_1(dataset['test'], unk), "name":"Double <Unk> Masked Dataset"}

In [None]:
#unk_datasets = mask_datasets_1(std_datasets, tokenizer)



In [None]:
#test_datasets(tokenizer, std_datasets, unk_datasets, debiased_tuned_model)

The [unk] tokens reduce performance from 75% accuracy to 65% accuracy. Let's tune the model on the [unk] masked dataset and see if it redeems itself. 

In [None]:
#unk_masked_tuned_model = train_model(unk_datasets, std_datasets, tokenizer)


On training on the debiased [unk] masked set, we perform with 73% accuracy. On the standard set we perform 69% accuracy. Strange the standard dataset decreased. That may mean there is some bias in the dataset I created where it learns to do tricks. 

Let's try masking the single word. 

In [None]:
import re


# This masked <unk> tokens on the option which is NOT involved in the question. 
# If the option involved in the question is wrong, it will need to identify the <unk> token as having importance.
def mask_copy_2(dataset, unk):
  sentences = []
  toprint = 5

  for p in dataset:
    # This is annoying because option1 can be substrings of other words, usually option2. They can also have uppercase letters. 
    # If one is a substring, then I will cover up the larger word with a temporary mask to not confuse anything else.
    sentence = p['sentence']
    option1, option2 = re.compile(p['option1'], re.IGNORECASE), re.compile(p['option2'], re.IGNORECASE)
    maskedoption1, maskedoption2 = "OPTION_ONE", "OPTION_TWO"
    first_search, second_search, first_mask, second_mask = None, None, None, None

    # "table" in "tablecloth" --> cover bigger one
    if len(p['option1']) > len(p['option2']):
      # cover option1 first
      first_search, second_search = option1, option2
      first_mask, second_mask = maskedoption1, maskedoption2
    else:
      first_search, second_search = option2, option1
      first_mask, second_mask = maskedoption2, maskedoption1
    
    sentence = first_search.sub(first_mask, sentence) #Mask the longer word with OPTION_ mask
    sentence = second_search.sub(second_mask, sentence) #then the shorter one

    # IT gets kinda confusing which word now to <IGNORE> because it's language and some words appear more than two, more than three, times. 
    # I think the smartest approach is to assume the final word used is the question word and mask that one. 

    # Find final word used
    # maskedoption1 is NOT the final word
    if sentence.rfind(maskedoption1) < sentence.rfind(maskedoption2):
      # IGNORE final word
      # Convert other word back to original
      sentence = sentence.replace(maskedoption1, unk)
      sentence = sentence.replace(maskedoption2, p['option2'])
    else:
      sentence = sentence.replace(maskedoption2, unk)
      sentence = sentence.replace(maskedoption1, p['option2'])


    if toprint > 0:
      print(f"[{p['option1']}], [{p['option2']}], [{p['sentence']}]")

    sentences.append(sentence)
    
    if toprint > 0:
      print(sentences[-1])
      toprint -= 1

  build = {'sentence':sentences, 'labels':dataset['labels'], 'option1':dataset['option1'], 'option2':dataset['option2']}
  return Dataset.from_dict(build) # DON'T SHUFFLE





def mask_datasets_2(dataset, tokenizer):
  unk = tokenizer.unk_token
  return {"train":mask_copy_2(dataset['train'], unk), "test":mask_copy_2(dataset['test'], unk), "name":"Single <Unk> Masked Dataset"}

In [None]:
unk_dataset_2 = mask_datasets_2(std_datasets, tokenizer)

#test_datasets(tokenizer, std_datasets, unk_dataset_2, debiased_tuned_model)

[Robert], [Ian], [Robert lent the book to Ian for the reason that Robert already read the book.]
Ian lent the book to <unk> for the reason that Ian already read the book.
[sweater], [jacket], [The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.]
The woman needed to decide between the <unk> and the jacket and ultimately chose the jacket because she disliked the color of the yarn.
[Dennis], [Donald], [Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.]
<unk> won a swimming race to Donald because Donald had shorter and weaker arms and legs.
[phone], [restaurant], [I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.]
I had to communicate the news over the restaurant or at the <unk> tonight. I told them at the restaurant even though I wanted them to enjoy their evening.


The standard set achieves 75% accuracy and the single \<unk\> 68%. 

Let's train the model on the single \<unk\>. It achieves 75% on standard and 73% on \<unk\>, which is 5 percentage points higher


In [None]:
with torch.no_grad():
  torch.cuda.empty_cache()

In [None]:

#train_model(unk_dataset_2, std_datasets, tokenizer, model="debiased_model", copy = True)

What happens if we try a bunch of stupid masks? Can we make masks so linguistically non-sensical that the model performance drops?

We can try such masks as 



In [None]:
stupid_tries = [
  ("dog", "doggy"),
  ("red", "blue"),
  ("flavor", "flavour"),
  ('A', 'B'),
  ('X', 'Y'),
  ('1', '2'),
  ('first', 'second'),
  ('alpha', 'beta'),
  ('#', '@'),
  ('primero', 'secundo'),
  ('Alice', 'Bob'),
  ('_', '__')
  
]


In [None]:
import re


# This masked <unk> tokens on the option which is NOT involved in the question. 
# If the option involved in the question is wrong, it will need to identify the <unk> token as having importance.
def stupid_masking(dataset, mask1, mask2):
  sentences = []
  toprint = 5

  for p in dataset:
    # This is annoying because option1 can be substrings of other words, usually option2. They can also have uppercase letters. 
    # If one is a substring, then I will cover up the larger word with a temporary mask to not confuse anything else.
    sentence = p['sentence']
    option1, option2 = re.compile(p['option1'], re.IGNORECASE), re.compile(p['option2'], re.IGNORECASE)
    maskedoption1, maskedoption2 = "OPTION_ONE", "OPTION_TWO"
    first_search, second_search, first_mask, second_mask = None, None, None, None

    # "table" in "tablecloth" --> cover bigger one
    if len(p['option1']) > len(p['option2']):
      # cover option1 first
      first_search, second_search = option1, option2
      first_mask, second_mask = maskedoption1, maskedoption2
    else:
      first_search, second_search = option2, option1
      first_mask, second_mask = maskedoption2, maskedoption1
    
    sentence = first_search.sub(first_mask, sentence) #Mask the longer word with OPTION_ mask
    sentence = second_search.sub(second_mask, sentence) #then the shorter one

    # IT gets kinda confusing which word now to <IGNORE> because it's language and some words appear more than two, more than three, times. 
    # I think the smartest approach is to assume the final word used is the question word and mask that one. 

    # IGNORE final word
    # Convert other word back to original
    sentence = sentence.replace(maskedoption1, mask1)
    sentence = sentence.replace(maskedoption2, mask2)


    if toprint > 0:
      print(f"[{p['option1']}], [{p['option2']}], [{p['sentence']}]")

    sentences.append(sentence)
    
    if toprint > 0:
      print(sentences[-1])
      toprint -= 1

  build = {'sentence':sentences, 'labels':dataset['labels'], 'option1':dataset['option1'], 'option2':dataset['option2']}
  return Dataset.from_dict(build) # DON'T SHUFFLE





def stupid_datasets(dataset, tokenizer, masks):
  mask1, mask2 = masks
  return {"train":stupid_masking(dataset['train'], mask1, mask2), "test":stupid_masking(dataset['test'], mask1, mask2), "name":f"({mask1}, {mask2}) Masked Dataset"}

In [None]:


for masks in stupid_tries[3:]:
  stupid_ds = stupid_datasets(std_datasets, tokenizer, masks)
  test_datasets(tokenizer, std_datasets, stupid_ds, model="debiased_model")
  train_model(stupid_ds, std_datasets, tokenizer, model="debiased_model", copy = True)


[Robert], [Ian], [Robert lent the book to Ian for the reason that Robert already read the book.]
A lent the book to B for the reason that A already read the book.
[sweater], [jacket], [The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.]
The woman needed to decide between the B and the A and ultimately chose the A because she disliked the color of the yarn.
[Dennis], [Donald], [Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.]
A won a swimming race to B because B had shorter and weaker arms and legs.
[phone], [restaurant], [I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.]
I had to communicate the news over the A or at the B tonight. I told them at the A even though I wanted them to enjoy their evening.
[Jessica], [Victoria], [Jessica often experiences severe n

  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.32it/s]

Score 0.7375690607734806 / 1.00
Performing (A, B) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.31it/s]


Score 0.7194159431728493 / 1.00
Deleted model
copy
finished?


  0%|          | 0/19 [00:00<?, ?ba/s]

Epoch 0


  0%|          | 1/578 [00:00<06:46,  1.42it/s]

 loss: 0.20177307724952698


  2%|▏         | 11/578 [00:07<06:41,  1.41it/s]

 loss: 0.06029757857322693


  4%|▎         | 21/578 [00:14<06:33,  1.41it/s]

 loss: 0.3357870578765869


  5%|▌         | 31/578 [00:21<06:26,  1.41it/s]

 loss: 0.217998206615448


  7%|▋         | 41/578 [00:28<06:19,  1.42it/s]

 loss: 0.06710170209407806


  9%|▉         | 51/578 [00:36<06:12,  1.41it/s]

 loss: 0.3497629463672638


 11%|█         | 61/578 [00:43<06:05,  1.41it/s]

 loss: 0.23354032635688782


 12%|█▏        | 71/578 [00:50<05:58,  1.41it/s]

 loss: 0.24676306545734406


 14%|█▍        | 81/578 [00:57<05:51,  1.41it/s]

 loss: 0.035707905888557434


 16%|█▌        | 91/578 [01:04<05:44,  1.41it/s]

 loss: 0.3410860300064087


 17%|█▋        | 101/578 [01:11<05:37,  1.41it/s]

 loss: 0.04445911571383476


 19%|█▉        | 111/578 [01:18<05:30,  1.41it/s]

 loss: 0.14096419513225555


 21%|██        | 121/578 [01:25<05:23,  1.41it/s]

 loss: 0.17945091426372528


 23%|██▎       | 131/578 [01:32<05:16,  1.41it/s]

 loss: 0.08919466286897659


 24%|██▍       | 141/578 [01:39<05:09,  1.41it/s]

 loss: 0.16019189357757568


 26%|██▌       | 151/578 [01:46<05:02,  1.41it/s]

 loss: 0.05117867514491081


 28%|██▊       | 161/578 [01:53<04:55,  1.41it/s]

 loss: 0.15521982312202454


 30%|██▉       | 171/578 [02:00<04:47,  1.41it/s]

 loss: 0.10067044943571091


 31%|███▏      | 181/578 [02:08<04:40,  1.41it/s]

 loss: 0.1326969414949417


 33%|███▎      | 191/578 [02:15<04:33,  1.41it/s]

 loss: 0.10816100239753723


 35%|███▍      | 201/578 [02:22<04:26,  1.41it/s]

 loss: 0.11588425189256668


 37%|███▋      | 211/578 [02:29<04:19,  1.41it/s]

 loss: 0.2975476384162903


 38%|███▊      | 221/578 [02:36<04:12,  1.41it/s]

 loss: 0.10748960077762604


 40%|███▉      | 231/578 [02:43<04:05,  1.41it/s]

 loss: 0.08939293771982193


 42%|████▏     | 241/578 [02:50<03:58,  1.41it/s]

 loss: 0.14057695865631104


 43%|████▎     | 251/578 [02:57<03:51,  1.41it/s]

 loss: 0.01794896461069584


 45%|████▌     | 261/578 [03:04<03:44,  1.41it/s]

 loss: 0.07193461060523987


 47%|████▋     | 271/578 [03:11<03:37,  1.41it/s]

 loss: 0.10655082017183304


 49%|████▊     | 281/578 [03:18<03:30,  1.41it/s]

 loss: 0.07115829735994339


 50%|█████     | 291/578 [03:25<03:23,  1.41it/s]

 loss: 0.15640373528003693


 52%|█████▏    | 301/578 [03:32<03:15,  1.41it/s]

 loss: 0.21091602742671967


 54%|█████▍    | 311/578 [03:40<03:09,  1.41it/s]

 loss: 0.16225148737430573


 56%|█████▌    | 321/578 [03:47<03:02,  1.41it/s]

 loss: 0.0822160542011261


 57%|█████▋    | 331/578 [03:54<02:54,  1.41it/s]

 loss: 0.12167328596115112


 59%|█████▉    | 341/578 [04:01<02:47,  1.41it/s]

 loss: 0.09321099519729614


 61%|██████    | 351/578 [04:08<02:42,  1.39it/s]

 loss: 0.0774698480963707


 62%|██████▏   | 361/578 [04:15<02:33,  1.41it/s]

 loss: 0.154220849275589


 64%|██████▍   | 371/578 [04:22<02:26,  1.41it/s]

 loss: 0.11230767518281937


 66%|██████▌   | 381/578 [04:29<02:19,  1.41it/s]

 loss: 0.07874473184347153


 68%|██████▊   | 391/578 [04:36<02:12,  1.41it/s]

 loss: 0.03789893537759781


 69%|██████▉   | 401/578 [04:43<02:05,  1.41it/s]

 loss: 0.03272530063986778


 71%|███████   | 411/578 [04:50<01:58,  1.41it/s]

 loss: 0.15136376023292542


 73%|███████▎  | 421/578 [04:58<01:51,  1.41it/s]

 loss: 0.09762292355298996


 75%|███████▍  | 431/578 [05:05<01:44,  1.41it/s]

 loss: 0.17440907657146454


 76%|███████▋  | 441/578 [05:12<01:36,  1.41it/s]

 loss: 0.12608219683170319


 78%|███████▊  | 451/578 [05:19<01:29,  1.41it/s]

 loss: 0.064091756939888


 80%|███████▉  | 461/578 [05:26<01:22,  1.41it/s]

 loss: 0.13963431119918823


 81%|████████▏ | 471/578 [05:33<01:15,  1.41it/s]

 loss: 0.052683889865875244


 83%|████████▎ | 481/578 [05:40<01:08,  1.41it/s]

 loss: 0.25518566370010376


 85%|████████▍ | 491/578 [05:47<01:01,  1.41it/s]

 loss: 0.0702907145023346


 87%|████████▋ | 501/578 [05:54<00:54,  1.41it/s]

 loss: 0.045923978090286255


 88%|████████▊ | 511/578 [06:01<00:47,  1.41it/s]

 loss: 0.030699584633111954


 90%|█████████ | 521/578 [06:08<00:40,  1.41it/s]

 loss: 0.14403744041919708


 92%|█████████▏| 531/578 [06:15<00:33,  1.41it/s]

 loss: 0.10503995418548584


 94%|█████████▎| 541/578 [06:22<00:26,  1.41it/s]

 loss: 0.03740633651614189


 95%|█████████▌| 551/578 [06:30<00:19,  1.40it/s]

 loss: 0.046660225838422775


 97%|█████████▋| 561/578 [06:37<00:12,  1.41it/s]

 loss: 0.26672232151031494


 99%|█████████▉| 571/578 [06:44<00:04,  1.41it/s]

 loss: 0.15308795869350433


100%|██████████| 578/578 [06:49<00:00,  1.41it/s]

Score 0.9442582179930796 / 1.00
Testing 1 2 3 ...
Performing (A, B) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.30it/s]

Score 0.7324388318863457 / 1.00
Performing Standard Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.31it/s]


Score 0.7494080505130228 / 1.00
[Robert], [Ian], [Robert lent the book to Ian for the reason that Robert already read the book.]
X lent the book to Y for the reason that X already read the book.
[sweater], [jacket], [The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.]
The woman needed to decide between the Y and the X and ultimately chose the X because she disliked the color of the yarn.
[Dennis], [Donald], [Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.]
X won a swimming race to Y because Y had shorter and weaker arms and legs.
[phone], [restaurant], [I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.]
I had to communicate the news over the X or at the Y tonight. I told them at the X even though I wanted them to enjoy their evening.
[Jessica], [Victoria], [Je

  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.30it/s]

Score 0.7458563535911602 / 1.00
Performing (X, Y) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.31it/s]


Score 0.7217837411207577 / 1.00
Deleted model
copy
finished?


  0%|          | 0/19 [00:00<?, ?ba/s]

Epoch 0


  0%|          | 1/578 [00:00<06:46,  1.42it/s]

 loss: 0.1254408061504364


  2%|▏         | 11/578 [00:07<06:41,  1.41it/s]

 loss: 0.10150214284658432


  4%|▎         | 21/578 [00:14<06:34,  1.41it/s]

 loss: 0.3952895700931549


  5%|▌         | 31/578 [00:21<06:27,  1.41it/s]

 loss: 0.1750134974718094


  7%|▋         | 41/578 [00:29<06:20,  1.41it/s]

 loss: 0.09750885516405106


  9%|▉         | 51/578 [00:36<06:13,  1.41it/s]

 loss: 0.48830050230026245


 11%|█         | 61/578 [00:43<06:06,  1.41it/s]

 loss: 0.25445330142974854


 12%|█▏        | 71/578 [00:50<05:59,  1.41it/s]

 loss: 0.13214655220508575


 14%|█▍        | 81/578 [00:57<05:51,  1.41it/s]

 loss: 0.025815449655056


 16%|█▌        | 91/578 [01:04<05:45,  1.41it/s]

 loss: 0.22708822786808014


 17%|█▋        | 101/578 [01:11<05:37,  1.41it/s]

 loss: 0.1918398141860962


 19%|█▉        | 111/578 [01:18<05:30,  1.41it/s]

 loss: 0.22300733625888824


 21%|██        | 121/578 [01:25<05:24,  1.41it/s]

 loss: 0.10376501083374023


 23%|██▎       | 131/578 [01:32<05:16,  1.41it/s]

 loss: 0.08194859325885773


 24%|██▍       | 141/578 [01:39<05:09,  1.41it/s]

 loss: 0.07449281215667725


 26%|██▌       | 151/578 [01:46<05:02,  1.41it/s]

 loss: 0.06633337587118149


 28%|██▊       | 161/578 [01:54<04:55,  1.41it/s]

 loss: 0.18086354434490204


 30%|██▉       | 171/578 [02:01<04:48,  1.41it/s]

 loss: 0.08359923213720322


 31%|███▏      | 181/578 [02:08<04:41,  1.41it/s]

 loss: 0.15361204743385315


 33%|███▎      | 191/578 [02:15<04:34,  1.41it/s]

 loss: 0.1632547378540039


 35%|███▍      | 201/578 [02:22<04:26,  1.41it/s]

 loss: 0.1035037413239479


 37%|███▋      | 211/578 [02:29<04:19,  1.41it/s]

 loss: 0.4071400761604309


 38%|███▊      | 221/578 [02:36<04:13,  1.41it/s]

 loss: 0.23665820062160492


 40%|███▉      | 231/578 [02:43<04:06,  1.41it/s]

 loss: 0.08902093768119812


 42%|████▏     | 241/578 [02:50<03:58,  1.41it/s]

 loss: 0.0677960142493248


 43%|████▎     | 251/578 [02:57<03:51,  1.41it/s]

 loss: 0.033918362110853195


 45%|████▌     | 261/578 [03:04<03:44,  1.41it/s]

 loss: 0.1334676742553711


 47%|████▋     | 271/578 [03:11<03:37,  1.41it/s]

 loss: 0.08598878234624863


 49%|████▊     | 281/578 [03:19<03:30,  1.41it/s]

 loss: 0.19540537893772125


 50%|█████     | 291/578 [03:26<03:23,  1.41it/s]

 loss: 0.2749294936656952


 52%|█████▏    | 301/578 [03:33<03:16,  1.41it/s]

 loss: 0.15613090991973877


 54%|█████▍    | 311/578 [03:40<03:09,  1.41it/s]

 loss: 0.09015282988548279


 56%|█████▌    | 321/578 [03:47<03:02,  1.41it/s]

 loss: 0.09878534078598022


 57%|█████▋    | 331/578 [03:54<02:54,  1.41it/s]

 loss: 0.3204070031642914


 59%|█████▉    | 341/578 [04:01<02:48,  1.41it/s]

 loss: 0.09173139929771423


 61%|██████    | 351/578 [04:08<02:40,  1.41it/s]

 loss: 0.03797321394085884


 62%|██████▏   | 361/578 [04:15<02:33,  1.41it/s]

 loss: 0.16325664520263672


 64%|██████▍   | 371/578 [04:22<02:26,  1.41it/s]

 loss: 0.17264021933078766


 66%|██████▌   | 381/578 [04:29<02:19,  1.41it/s]

 loss: 0.08397656679153442


 68%|██████▊   | 391/578 [04:36<02:12,  1.41it/s]

 loss: 0.09026843309402466


 69%|██████▉   | 401/578 [04:44<02:05,  1.41it/s]

 loss: 0.10749303549528122


 71%|███████   | 411/578 [04:51<01:58,  1.41it/s]

 loss: 0.041617024689912796


 73%|███████▎  | 421/578 [04:58<01:51,  1.41it/s]

 loss: 0.18860965967178345


 75%|███████▍  | 431/578 [05:05<01:44,  1.41it/s]

 loss: 0.2250489890575409


 76%|███████▋  | 441/578 [05:12<01:36,  1.41it/s]

 loss: 0.16676867008209229


 78%|███████▊  | 451/578 [05:19<01:29,  1.41it/s]

 loss: 0.07928349077701569


 80%|███████▉  | 461/578 [05:26<01:22,  1.41it/s]

 loss: 0.17867732048034668


 81%|████████▏ | 471/578 [05:33<01:15,  1.41it/s]

 loss: 0.06821432709693909


 83%|████████▎ | 481/578 [05:40<01:08,  1.41it/s]

 loss: 0.4143008887767792


 85%|████████▍ | 491/578 [05:47<01:01,  1.41it/s]

 loss: 0.11649227887392044


 87%|████████▋ | 501/578 [05:54<00:54,  1.41it/s]

 loss: 0.055190153419971466


 88%|████████▊ | 511/578 [06:01<00:47,  1.41it/s]

 loss: 0.22334344685077667


 90%|█████████ | 521/578 [06:08<00:40,  1.41it/s]

 loss: 0.050914082676172256


 92%|█████████▏| 531/578 [06:16<00:33,  1.41it/s]

 loss: 0.10586341470479965


 94%|█████████▎| 541/578 [06:23<00:26,  1.41it/s]

 loss: 0.12618392705917358


 95%|█████████▌| 551/578 [06:30<00:19,  1.41it/s]

 loss: 0.038934092968702316


 97%|█████████▋| 561/578 [06:37<00:12,  1.41it/s]

 loss: 0.23652388155460358


 99%|█████████▉| 571/578 [06:44<00:04,  1.41it/s]

 loss: 0.0887916311621666


100%|██████████| 578/578 [06:49<00:00,  1.41it/s]

Score 0.9459883217993079 / 1.00
Testing 1 2 3 ...
Performing (X, Y) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.29it/s]

Score 0.7363851617995264 / 1.00
Performing Standard Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.30it/s]


Score 0.7600631412786109 / 1.00
[Robert], [Ian], [Robert lent the book to Ian for the reason that Robert already read the book.]
1 lent the book to 2 for the reason that 1 already read the book.
[sweater], [jacket], [The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.]
The woman needed to decide between the 2 and the 1 and ultimately chose the 1 because she disliked the color of the yarn.
[Dennis], [Donald], [Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.]
1 won a swimming race to 2 because 2 had shorter and weaker arms and legs.
[phone], [restaurant], [I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.]
I had to communicate the news over the 1 or at the 2 tonight. I told them at the 1 even though I wanted them to enjoy their evening.
[Jessica], [Victoria], [Je

  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.30it/s]

Score 0.7486187845303868 / 1.00
Performing (1, 2) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.29it/s]


Score 0.7087608524072613 / 1.00
Deleted model
copy
finished?


  0%|          | 0/19 [00:00<?, ?ba/s]

Epoch 0


  0%|          | 1/578 [00:00<06:44,  1.43it/s]

 loss: 0.2769586443901062


  2%|▏         | 11/578 [00:07<06:42,  1.41it/s]

 loss: 0.16365773975849152


  4%|▎         | 21/578 [00:14<06:35,  1.41it/s]

 loss: 0.31001847982406616


  5%|▌         | 31/578 [00:21<06:27,  1.41it/s]

 loss: 0.17070962488651276


  7%|▋         | 41/578 [00:29<06:21,  1.41it/s]

 loss: 0.08926239609718323


  9%|▉         | 51/578 [00:36<06:13,  1.41it/s]

 loss: 0.2756010890007019


 11%|█         | 61/578 [00:43<06:06,  1.41it/s]

 loss: 0.21747969090938568


 12%|█▏        | 71/578 [00:50<05:59,  1.41it/s]

 loss: 0.1492706686258316


 14%|█▍        | 81/578 [00:57<05:51,  1.41it/s]

 loss: 0.04088940471410751


 16%|█▌        | 91/578 [01:04<05:45,  1.41it/s]

 loss: 0.2449897825717926


 17%|█▋        | 101/578 [01:11<05:37,  1.41it/s]

 loss: 0.1208687573671341


 19%|█▉        | 111/578 [01:18<05:30,  1.41it/s]

 loss: 0.13461905717849731


 21%|██        | 121/578 [01:25<05:23,  1.41it/s]

 loss: 0.2238522171974182


 23%|██▎       | 131/578 [01:32<05:16,  1.41it/s]

 loss: 0.08363597095012665


 24%|██▍       | 141/578 [01:39<05:09,  1.41it/s]

 loss: 0.10408834367990494


 26%|██▌       | 151/578 [01:46<05:02,  1.41it/s]

 loss: 0.1272539496421814


 28%|██▊       | 161/578 [01:54<04:56,  1.41it/s]

 loss: 0.020669281482696533


 30%|██▉       | 171/578 [02:01<04:49,  1.41it/s]

 loss: 0.18032310903072357


 31%|███▏      | 181/578 [02:08<04:41,  1.41it/s]

 loss: 0.0859593003988266


 33%|███▎      | 191/578 [02:15<04:34,  1.41it/s]

 loss: 0.1301630586385727


 35%|███▍      | 201/578 [02:22<04:27,  1.41it/s]

 loss: 0.3061490058898926


 37%|███▋      | 211/578 [02:29<04:20,  1.41it/s]

 loss: 0.46358078718185425


 38%|███▊      | 221/578 [02:36<04:13,  1.41it/s]

 loss: 0.10860510170459747


 40%|███▉      | 231/578 [02:43<04:06,  1.41it/s]

 loss: 0.02730378694832325


 42%|████▏     | 241/578 [02:50<03:59,  1.41it/s]

 loss: 0.17432965338230133


 43%|████▎     | 251/578 [02:57<03:52,  1.41it/s]

 loss: 0.045676082372665405


 45%|████▌     | 261/578 [03:04<03:44,  1.41it/s]

 loss: 0.09568622708320618


 47%|████▋     | 271/578 [03:12<03:37,  1.41it/s]

 loss: 0.07749681919813156


 49%|████▊     | 281/578 [03:19<03:30,  1.41it/s]

 loss: 0.1636265218257904


 50%|█████     | 291/578 [03:26<03:23,  1.41it/s]

 loss: 0.05404104292392731


 52%|█████▏    | 301/578 [03:33<03:16,  1.41it/s]

 loss: 0.2037254124879837


 54%|█████▍    | 311/578 [03:40<03:09,  1.41it/s]

 loss: 0.11452631652355194


 56%|█████▌    | 321/578 [03:47<03:02,  1.41it/s]

 loss: 0.07492773979902267


 57%|█████▋    | 331/578 [03:54<02:55,  1.41it/s]

 loss: 0.31932422518730164


 59%|█████▉    | 341/578 [04:01<02:48,  1.41it/s]

 loss: 0.13711398839950562


 61%|██████    | 351/578 [04:08<02:40,  1.41it/s]

 loss: 0.06749751418828964


 62%|██████▏   | 361/578 [04:15<02:33,  1.41it/s]

 loss: 0.2599323093891144


 64%|██████▍   | 371/578 [04:22<02:26,  1.41it/s]

 loss: 0.1476629078388214


 66%|██████▌   | 381/578 [04:29<02:19,  1.41it/s]

 loss: 0.0917249470949173


 68%|██████▊   | 391/578 [04:37<02:12,  1.41it/s]

 loss: 0.053204379975795746


 69%|██████▉   | 401/578 [04:44<02:05,  1.41it/s]

 loss: 0.16339044272899628


 71%|███████   | 411/578 [04:51<01:58,  1.41it/s]

 loss: 0.17781394720077515


 73%|███████▎  | 421/578 [04:58<01:51,  1.41it/s]

 loss: 0.11308039724826813


 75%|███████▍  | 431/578 [05:05<01:44,  1.41it/s]

 loss: 0.19233138859272003


 76%|███████▋  | 441/578 [05:12<01:37,  1.41it/s]

 loss: 0.15756067633628845


 78%|███████▊  | 451/578 [05:19<01:29,  1.41it/s]

 loss: 0.05884823575615883


 80%|███████▉  | 461/578 [05:26<01:22,  1.41it/s]

 loss: 0.13774830102920532


 81%|████████▏ | 471/578 [05:33<01:15,  1.41it/s]

 loss: 0.08152131736278534


 83%|████████▎ | 481/578 [05:40<01:08,  1.41it/s]

 loss: 0.22510379552841187


 85%|████████▍ | 491/578 [05:47<01:01,  1.41it/s]

 loss: 0.04936157912015915


 87%|████████▋ | 501/578 [05:55<00:54,  1.41it/s]

 loss: 0.01687091588973999


 88%|████████▊ | 511/578 [06:02<00:47,  1.41it/s]

 loss: 0.059844374656677246


 90%|█████████ | 521/578 [06:09<00:40,  1.41it/s]

 loss: 0.05685418099164963


 92%|█████████▏| 531/578 [06:16<00:33,  1.41it/s]

 loss: 0.17887310683727264


 94%|█████████▎| 541/578 [06:23<00:26,  1.41it/s]

 loss: 0.10020551085472107


 95%|█████████▌| 551/578 [06:30<00:19,  1.41it/s]

 loss: 0.1033797413110733


 97%|█████████▋| 561/578 [06:37<00:12,  1.41it/s]

 loss: 0.4621124565601349


 99%|█████████▉| 571/578 [06:44<00:04,  1.41it/s]

 loss: 0.09090244770050049


100%|██████████| 578/578 [06:49<00:00,  1.41it/s]

Score 0.9429065743944637 / 1.00
Testing 1 2 3 ...
Performing (1, 2) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.26it/s]

Score 0.7430939226519337 / 1.00
Performing Standard Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.26it/s]


Score 0.7655880031570639 / 1.00
[Robert], [Ian], [Robert lent the book to Ian for the reason that Robert already read the book.]
first lent the book to second for the reason that first already read the book.
[sweater], [jacket], [The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.]
The woman needed to decide between the second and the first and ultimately chose the first because she disliked the color of the yarn.
[Dennis], [Donald], [Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.]
first won a swimming race to second because second had shorter and weaker arms and legs.
[phone], [restaurant], [I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.]
I had to communicate the news over the first or at the second tonight. I told them at the first even though I wanted th

  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.26it/s]

Score 0.7513812154696132 / 1.00
Performing (first, second) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.25it/s]


Score 0.6882399368587214 / 1.00
Deleted model
copy
finished?


  0%|          | 0/19 [00:00<?, ?ba/s]

Epoch 0


  0%|          | 1/578 [00:00<06:46,  1.42it/s]

 loss: 0.24809610843658447


  2%|▏         | 11/578 [00:07<06:42,  1.41it/s]

 loss: 0.1638454794883728


  4%|▎         | 21/578 [00:14<06:35,  1.41it/s]

 loss: 0.37466657161712646


  5%|▌         | 31/578 [00:22<06:28,  1.41it/s]

 loss: 0.2888939678668976


  7%|▋         | 41/578 [00:29<06:21,  1.41it/s]

 loss: 0.056043241173028946


  9%|▉         | 51/578 [00:36<06:14,  1.41it/s]

 loss: 0.32509028911590576


 11%|█         | 61/578 [00:43<06:07,  1.41it/s]

 loss: 0.08668509125709534


 12%|█▏        | 71/578 [00:50<06:00,  1.40it/s]

 loss: 0.08418454974889755


 14%|█▍        | 81/578 [00:57<05:53,  1.41it/s]

 loss: 0.020438769832253456


 16%|█▌        | 91/578 [01:04<05:45,  1.41it/s]

 loss: 0.22133132815361023


 17%|█▋        | 101/578 [01:11<05:39,  1.41it/s]

 loss: 0.26572152972221375


 19%|█▉        | 111/578 [01:18<05:31,  1.41it/s]

 loss: 0.1758394092321396


 21%|██        | 121/578 [01:25<05:24,  1.41it/s]

 loss: 0.15913666784763336


 23%|██▎       | 131/578 [01:33<05:17,  1.41it/s]

 loss: 0.12762074172496796


 24%|██▍       | 141/578 [01:40<05:10,  1.41it/s]

 loss: 0.125023752450943


 26%|██▌       | 151/578 [01:47<05:03,  1.41it/s]

 loss: 0.08530529588460922


 28%|██▊       | 161/578 [01:54<04:56,  1.41it/s]

 loss: 0.06261514872312546


 30%|██▉       | 171/578 [02:01<04:48,  1.41it/s]

 loss: 0.05630635470151901


 31%|███▏      | 181/578 [02:08<04:41,  1.41it/s]

 loss: 0.23977142572402954


 33%|███▎      | 191/578 [02:15<04:34,  1.41it/s]

 loss: 0.13335829973220825


 35%|███▍      | 201/578 [02:22<04:27,  1.41it/s]

 loss: 0.19563771784305573


 37%|███▋      | 211/578 [02:29<04:20,  1.41it/s]

 loss: 0.3218815326690674


 38%|███▊      | 221/578 [02:36<04:13,  1.41it/s]

 loss: 0.18775790929794312


 40%|███▉      | 231/578 [02:44<04:06,  1.41it/s]

 loss: 0.05444183573126793


 42%|████▏     | 241/578 [02:51<03:59,  1.41it/s]

 loss: 0.18456047773361206


 43%|████▎     | 251/578 [02:58<03:52,  1.41it/s]

 loss: 0.03938994184136391


 45%|████▌     | 261/578 [03:05<03:45,  1.41it/s]

 loss: 0.09704793989658356


 47%|████▋     | 271/578 [03:12<03:38,  1.41it/s]

 loss: 0.18248336017131805


 49%|████▊     | 281/578 [03:19<03:31,  1.41it/s]

 loss: 0.33834701776504517


 50%|█████     | 291/578 [03:26<03:23,  1.41it/s]

 loss: 0.09243817627429962


 52%|█████▏    | 301/578 [03:33<03:16,  1.41it/s]

 loss: 0.16065046191215515


 54%|█████▍    | 311/578 [03:40<03:09,  1.41it/s]

 loss: 0.1752268224954605


 56%|█████▌    | 321/578 [03:47<03:02,  1.41it/s]

 loss: 0.06553418934345245


 57%|█████▋    | 331/578 [03:55<02:55,  1.41it/s]

 loss: 0.08334410190582275


 59%|█████▉    | 341/578 [04:02<02:48,  1.41it/s]

 loss: 0.17171600461006165


 61%|██████    | 351/578 [04:09<02:41,  1.41it/s]

 loss: 0.03274071589112282


 62%|██████▏   | 361/578 [04:16<02:34,  1.41it/s]

 loss: 0.09678369760513306


 64%|██████▍   | 371/578 [04:23<02:27,  1.41it/s]

 loss: 0.17614403367042542


 66%|██████▌   | 381/578 [04:30<02:19,  1.41it/s]

 loss: 0.09227553009986877


 68%|██████▊   | 391/578 [04:37<02:12,  1.41it/s]

 loss: 0.08839896321296692


 69%|██████▉   | 401/578 [04:44<02:05,  1.41it/s]

 loss: 0.1714429408311844


 71%|███████   | 411/578 [04:51<01:58,  1.41it/s]

 loss: 0.08469679206609726


 73%|███████▎  | 421/578 [04:59<01:51,  1.41it/s]

 loss: 0.11724445223808289


 75%|███████▍  | 431/578 [05:06<01:44,  1.41it/s]

 loss: 0.1922956109046936


 76%|███████▋  | 441/578 [05:13<01:37,  1.41it/s]

 loss: 0.39098432660102844


 78%|███████▊  | 451/578 [05:20<01:30,  1.41it/s]

 loss: 0.08754751831293106


 80%|███████▉  | 461/578 [05:27<01:23,  1.41it/s]

 loss: 0.1568981260061264


 81%|████████▏ | 471/578 [05:34<01:16,  1.41it/s]

 loss: 0.07945509254932404


 83%|████████▎ | 481/578 [05:41<01:08,  1.41it/s]

 loss: 0.32032066583633423


 85%|████████▍ | 491/578 [05:48<01:01,  1.41it/s]

 loss: 0.1047159731388092


 87%|████████▋ | 501/578 [05:55<00:54,  1.41it/s]

 loss: 0.054771777242422104


 88%|████████▊ | 511/578 [06:02<00:47,  1.41it/s]

 loss: 0.051744963973760605


 90%|█████████ | 521/578 [06:10<00:40,  1.41it/s]

 loss: 0.09641802310943604


 92%|█████████▏| 531/578 [06:17<00:33,  1.41it/s]

 loss: 0.15930140018463135


 94%|█████████▎| 541/578 [06:24<00:26,  1.41it/s]

 loss: 0.13206034898757935


 95%|█████████▌| 551/578 [06:31<00:19,  1.41it/s]

 loss: 0.01949036493897438


 97%|█████████▋| 561/578 [06:38<00:12,  1.41it/s]

 loss: 0.33317574858665466


 99%|█████████▉| 571/578 [06:45<00:04,  1.41it/s]

 loss: 0.12315405160188675


100%|██████████| 578/578 [06:50<00:00,  1.41it/s]

Score 0.9409602076124568 / 1.00
Testing 1 2 3 ...
Performing (first, second) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.25it/s]

Score 0.7320441988950276 / 1.00
Performing Standard Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.27it/s]


Score 0.7691397000789266 / 1.00
[Robert], [Ian], [Robert lent the book to Ian for the reason that Robert already read the book.]
alpha lent the book to beta for the reason that alpha already read the book.
[sweater], [jacket], [The woman needed to decide between the jacket and the sweater and ultimately chose the sweater because she disliked the color of the yarn.]
The woman needed to decide between the beta and the alpha and ultimately chose the alpha because she disliked the color of the yarn.
[Dennis], [Donald], [Dennis won a swimming race to Donald because Donald had shorter and weaker arms and legs.]
alpha won a swimming race to beta because beta had shorter and weaker arms and legs.
[phone], [restaurant], [I had to communicate the news over the phone or at the restaurant tonight. I told them at the phone even though I wanted them to enjoy their evening.]
I had to communicate the news over the alpha or at the beta tonight. I told them at the alpha even though I wanted them to enjo

  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.26it/s]

Score 0.755327545382794 / 1.00
Performing (alpha, beta) Masked Dataset





  0%|          | 0/3 [00:00<?, ?ba/s]

100%|██████████| 80/80 [00:18<00:00,  4.25it/s]


Score 0.7123125493291239 / 1.00
Deleted model
copy
finished?


  0%|          | 0/19 [00:00<?, ?ba/s]

Epoch 0


  0%|          | 1/578 [00:00<06:50,  1.40it/s]

 loss: 0.28132858872413635


  2%|▏         | 11/578 [00:07<06:42,  1.41it/s]

 loss: 0.09700728207826614


  4%|▎         | 21/578 [00:14<06:35,  1.41it/s]

 loss: 0.33251139521598816


  5%|▌         | 31/578 [00:22<06:28,  1.41it/s]

 loss: 0.17998211085796356


  7%|▋         | 41/578 [00:29<06:21,  1.41it/s]

 loss: 0.08641567081212997


  9%|▉         | 51/578 [00:36<06:14,  1.41it/s]

 loss: 0.3969529867172241


 11%|█         | 61/578 [00:43<06:07,  1.41it/s]

 loss: 0.403788298368454


 12%|█▏        | 71/578 [00:50<06:00,  1.41it/s]

 loss: 0.1341141164302826


 14%|█▍        | 81/578 [00:57<05:52,  1.41it/s]

 loss: 0.017577776685357094


 16%|█▌        | 91/578 [01:04<05:45,  1.41it/s]

 loss: 0.28189563751220703


 17%|█▋        | 101/578 [01:11<05:38,  1.41it/s]

 loss: 0.030814796686172485


 19%|█▉        | 111/578 [01:18<05:31,  1.41it/s]

 loss: 0.12938763201236725


 21%|██        | 121/578 [01:25<05:25,  1.40it/s]

 loss: 0.20520755648612976


 23%|██▎       | 131/578 [01:33<05:17,  1.41it/s]

 loss: 0.12438429892063141


 24%|██▍       | 141/578 [01:40<05:10,  1.41it/s]

 loss: 0.2050190269947052


 26%|██▌       | 151/578 [01:47<05:03,  1.41it/s]

 loss: 0.09619738906621933


 28%|██▊       | 161/578 [01:54<04:56,  1.41it/s]

 loss: 0.20255063474178314


 30%|██▉       | 171/578 [02:01<04:49,  1.41it/s]

 loss: 0.06224685534834862


 31%|███▏      | 181/578 [02:08<04:42,  1.41it/s]

 loss: 0.1306384801864624


 33%|███▎      | 191/578 [02:15<04:34,  1.41it/s]

 loss: 0.08846758306026459


 35%|███▍      | 201/578 [02:22<04:27,  1.41it/s]

 loss: 0.13563695549964905


 37%|███▋      | 211/578 [02:29<04:20,  1.41it/s]

 loss: 0.22712033987045288


 38%|███▊      | 221/578 [02:36<04:13,  1.41it/s]

 loss: 0.0902828574180603


 40%|███▉      | 231/578 [02:44<04:06,  1.41it/s]

 loss: 0.029887663200497627


 42%|████▏     | 241/578 [02:51<03:59,  1.41it/s]

 loss: 0.08502241969108582


 43%|████▎     | 251/578 [02:58<03:52,  1.41it/s]

 loss: 0.04349224641919136


 45%|████▌     | 261/578 [03:05<03:45,  1.40it/s]

 loss: 0.16291408240795135


 45%|████▌     | 262/578 [03:06<03:44,  1.41it/s]

Results:

| Dataset | Accuracy (before training) | Accuracy (after training on silly set) |
|:-------:|:------------------:|:----------------:|
|Standard set| 75% | 75% |
|(dog, doggy)| 65% | 73% | 
|(red, blue) | 71% | 73% |
|(darling, sweetheart)|  |  |
| (A, B) | 72% | 73% |
| (X, Y) | 72% | 74% |
| (1, 2) | 71% | 74% |
|(first, second)| 69% | 73% |
|(alpha, beta)| 72% |  |
|(#, @)|  |  |
|(primero, secundo)|  |  |
|(Alice, Bob)|  |  |
|(flavor, flavour)|  |  |


