ASSIGNMENT 2 NLP - Question answering with transformers on CoQA

Authors:

*   Fabian Vincenzi fabian.vincenzi@studio.unibo.it
*   Davide Perozzi davide.perozzi@studio.unibo.it
*   Martina Ianaro martina.ianaro@studio.unibo.it

Link to github repo - https://github.com/martinaianaro99/Natural_Language_Processing/tree/main/Assignments/Assignment2



### Download data

In [None]:
import os
import urllib.request
from tqdm import tqdm

# typing
from typing import List, Callable, Dict

class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [None]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:18, 2.67MB/s]                            


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:03, 2.57MB/s]                            

Download completed!





In [None]:
import json

data_path = "coqa"

train_path = os.path.join(data_path, f'train.json')
test_path = os.path.join(data_path, f'test.json')
train_dialogues = {}
test_dialogues = {}
with open(train_path) as f:
    train_dialogues = json.load(f)["data"]
    print("Train set loaded.")

with open(test_path) as f:
    test_dialogues = json.load(f)["data"]
    print("Test set loaded.")

Train set loaded.
Test set loaded.


In [None]:
print("Train dialogues: ", len(train_dialogues))
print("Test dialogues: ", len(test_dialogues))

Train dialogues:  7199
Test dialogues:  500


In [None]:
import pandas as pd

def extract_data( json ):
    data = []
    for d in json:
      row = {
            "passage" : d["story"],
            "question" : [q["input_text"] for q in d["questions"]],
            "answer" : [a["input_text"] for a in d["answers"]]
        }
      data.append(row)
    df = pd.DataFrame(data) 
    return df    

In [None]:
# build dataframe with passage, question and answer features
train_df = extract_data(train_dialogues)
test_df = extract_data(test_dialogues)

#[Task 1] - Remove unaswerable QA pairs

In [None]:
def remove_unanswerable(df):
  deleted = 0

  for i in range(df.shape[0]): # update passages deleting unanswerable questions
    answers = df.iloc[i]["answer"]
    questions = df.iloc[i]["question"]
    
    to_delete = [index for index in range(len(answers)) if answers[index] == "unknown"]
    
    new_answers = [ answers[j] for j in range(len(answers)) if j not in to_delete ] 
    new_questions = [ questions[j] for j in range(len(questions)) if j not in to_delete ] 

    df.at[i,"answer"] = new_answers if len(new_answers)>0 else float('nan')
    df.at[i,"question"] = new_questions if len(new_answers)>0 else float('nan')

  df.dropna(subset=['answer'], inplace=True) # drop passages with only unanswerable questions

  return df

train_df = remove_unanswerable(train_df)
test_df = remove_unanswerable(test_df)

#[Task 2] - Train, Validation and Test splits

### Split val set from train set at dialogue level

In [None]:
import random
import numpy as np
import tensorflow as tf

def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

set_reproducibility(42)

In [None]:
from sklearn.model_selection import train_test_split

# to be used in training loop
def split(train):

  train_df, val_df = train_test_split(train, test_size=0.2, shuffle=True)
  
  return train_df, val_df



### Data pre-processing

In [None]:
def exploder( df, col ):
    return df.set_index(col).apply(pd.Series.explode).reset_index()

# build and add history column to dataframe
def add_history(df):
  history=[[]]
  for i in range(1,df.shape[0]):

    if df.iloc[i-1]["passage"] != df.iloc[i]["passage"]: # new passage group
      history.append([])
    else:
      latest = history[-1].copy() 
      question = df.iloc[i-1]["question"]
      answer = df.iloc[i-1]["answer"]
      if latest != [] :
        latest.extend([question, answer])
        history.append(latest)
      else: 
        history.append([question, answer])
  
  df["history"] = pd.Series(history)
  return df

In [None]:
from functools import reduce
def lower(text: str) -> str:
    return text.lower()

# define the preprocessing operation that we have to do 
PREPROCESSING_PIPELINE = [lower]

# define function that execute all the operation in preprocessing_pipeline
def text_prepare(text: str,
                 filter_methods: List[Callable[[str], str]] = None) -> str:

    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

def text_preprocessing(df):
  for c in  df:
    df[c] = df[c].apply(lambda txt: text_prepare(txt))
  return df


In [None]:
def preprocess(df):
  temp = exploder(df, ['passage'])
  df = text_preprocessing(temp)
  return df

#[Task 3] - Model definition


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### [M1] DistilRoBERTa (distilroberta-base)

In [None]:
from transformers import AutoTokenizer, EncoderDecoderModel

#method that instantiate the model and the tokenizer
def get_m1():
  # tokenizer
  distilroberta_tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

  # model
  distilroberta_model = EncoderDecoderModel.from_encoder_decoder_pretrained("distilroberta-base", "distilroberta-base",
                                                                            tie_encoder_decoder=True)
  # config                                                                                                                                                                                                                                                                                                                                                                                  
  distilroberta_model.config.decoder_start_token_id = distilroberta_tokenizer.cls_token_id
  distilroberta_model.config.eos_token_id = distilroberta_tokenizer.sep_token_id
  distilroberta_model.config.pad_token_id = distilroberta_tokenizer.pad_token_id
  distilroberta_model.config.vocab_size = distilroberta_model.config.encoder.vocab_size
  distilroberta_model.config.max_length = 142
  distilroberta_model.config.min_length = 56
  distilroberta_model.config.no_repeat_ngram_size = 3
  distilroberta_model.config.early_stopping = True
  distilroberta_model.config.length_penalty = 2.0
  distilroberta_model.config.num_beams = 4

  return distilroberta_model, distilroberta_tokenizer

### [M2] BERTTiny (bert-tiny)

In [None]:
from transformers import AutoTokenizer, EncoderDecoderModel

#method that instantiate the model and the tokenizer
def get_m2():
  
  # tokenizer
  berttiny_tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')

  # model
  berttiny_model = EncoderDecoderModel.from_encoder_decoder_pretrained('prajjwal1/bert-tiny', 'prajjwal1/bert-tiny',  
                                                                      tie_encoder_decoder=True).to("cuda")
  # config
  berttiny_model.config.decoder_start_token_id = berttiny_tokenizer.cls_token_id
  berttiny_model.config.eos_token_id = berttiny_tokenizer.sep_token_id
  berttiny_model.config.pad_token_id = berttiny_tokenizer.pad_token_id
  berttiny_model.config.vocab_size = berttiny_model.config.encoder.vocab_size
  berttiny_model.config.max_length = 142
  berttiny_model.config.min_length = 56
  berttiny_model.config.no_repeat_ngram_size = 3
  berttiny_model.config.early_stopping = True
  berttiny_model.config.length_penalty = 2.0
  berttiny_model.config.num_beams = 4
  
  return berttiny_model, berttiny_tokenizer

# [Task 4] Question generation with text passage $P$ and question $Q$

### [M1] DistilRoBERTa (distilroberta-base)

In [None]:
#method that build question-generation model that return an answer to given
#context and question.
def question_generation_m1(context, question):
  input_ids = distilroberta_tokenizer(context, question, return_tensors="pt").input_ids.to("cuda")

  generated_ids = distilroberta_model.generate(input_ids)
  generated_text = distilroberta_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  
  return generated_text

### [M2] BERTTiny (bert-tiny)

In [None]:
#method that build question-generation model that return an answer to given
#context and question.
def question_generation_m2(context, question):
  input_ids = berttiny_tokenizer(context, question, return_tensors="pt").input_ids.to("cuda")

  generated_ids = berttiny_model.generate(input_ids)
  generated_text = berttiny_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  
  return generated_text

# [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

### [M1] DistilRoBERTa (distilroberta-base)

In [None]:
#method that build question-generation model that return an answer to given
#context, question and history.
def question_generation_m1_h(context, question, history):
  input_ids = distilroberta_tokenizer(context, question, history, return_tensors="pt",  padding="max_length", truncation=True, max_length=encoder_max_length).input_ids.to("cuda")
  generated_ids = distilroberta_model.generate(input_ids)
  generated_text = distilroberta_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  
  return generated_text

### [M2] BERTTiny (bert-tiny)

In [None]:
#method that build question-generation model that return an answer to given
#context, question and history.
def question_generation_m2_h(context, question, history):
  input_ids = berttiny_tokenizer(context, question, history, return_tensors="pt",  padding="max_length", truncation=True, max_length=encoder_max_length).input_ids.to("cuda")
  generated_ids = berttiny_model.generate(input_ids)
  generated_text = berttiny_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
  
  return generated_text

# [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

### Training packages and functions

In [None]:
!pip install allennlp_models
!pip install datasets

!rm seq2seq_trainer.py
!rm seq2seq_training_args.py
!wget https://raw.githubusercontent.com/huggingface/transformers/main/examples/legacy/seq2seq/seq2seq_trainer.py
!wget https://raw.githubusercontent.com/huggingface/transformers/main/examples/legacy/seq2seq/seq2seq_training_args.py


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2023-01-09 13:44:45--  https://raw.githubusercontent.com/huggingface/transformers/main/examples/legacy/seq2seq/seq2seq_trainer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11214 (11K) [text/plain]
Saving to: ‘seq2seq_trainer.py’


2023-01-09 13:44:45 (80.1 MB/s) - ‘seq2seq_trainer.py’ saved [11214/11214]

--2023-01-09 13:44:45--  https://raw.githubusercontent.com/huggingface/transformers/main/examples/legacy/seq2seq/seq2seq_training_args.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133,

In [None]:
# run this to make squad import works
import os
os.kill(os.getpid(), 9)

In [None]:
from datasets import Dataset
from transformers import DataCollatorForSeq2Seq
from allennlp_models.rc.tools import squad

In [None]:
def process_data_to_model_inputs(batch, tokenizer, encoder_max_length, decoder_max_length):
  sep = tokenizer.sep_token

  if "history" in batch:
    # concatenate passage, question and history before encoding
    tmp = [batch["passage"][i]+sep+batch["question"][i]+sep+sep.join(batch["history"][i]) for i in range(len(batch["passage"])) ]
  else: 
    # concatenate passage and question before encoding
    tmp = [batch["passage"][i]+sep+batch["question"][i] for i in range(len(batch["passage"])) ]

  # encode inputs and labels
  inputs = tokenizer(tmp, padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["answer"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]
  return batch

def preparation(df, process_function, tokenizer, encoder_max_length, decoder_max_length, n):
  if n is not None: 
    df = df[:n] # subset to train faster

  ds = Dataset.from_pandas(df[:n])
  ds = Dataset.from_pandas(df)
  ds = ds.map(
      process_function,
      batched=True, 
      batch_size=batch_size, 
      remove_columns=["passage", "question", "answer"],
      fn_kwargs={"tokenizer": tokenizer, "encoder_max_length": encoder_max_length , "decoder_max_length": decoder_max_length }
    )
  ds.set_format(
      type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
  )
  return ds

###############################################

from seq2seq_training_args import Seq2SeqTrainingArguments
from seq2seq_trainer import Seq2SeqTrainer

batch_size = 256

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    output_dir="./",
    logging_steps=2,
    save_steps=10,
    eval_steps=4,
)

### [M1] DistilRoBERTa (distilroberta-base)

#### Function for metrics

In [None]:
# compute metric squad f1
def compute_metrics_f1_m1(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = distilroberta_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = distilroberta_tokenizer.pad_token_id
    label_str = distilroberta_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
   
    squad_f1_output = [squad.compute_f1(a_pred=pred_str[i], a_gold=label_str[i]) for i in range(len(pred_str))]
    
    return {
        "squad_f1_precision": sum(squad_f1_output) / len(squad_f1_output), # do the average
    }
    

#### Training loop

In [None]:
#Eval on test and val set on SQUAD F1-score
#Report evaluation SQUAD F1-score computed on the validation and test sets.
#Perform multiple train and evaluation on test set and val set with 3 seeds.
seeds = [42, 2022, 1337]

n = 5000 # subset length to train faster, "None" for whole set

for seed in seeds:
    print(f'Running with seed: {seed}')
    set_reproducibility(seed)
    
    #with shuffle
    train_df, val_df = split(train_df)

    # text preprocess
    train_df = preprocess(train_df)
    val_df = preprocess(val_df)
    test_df = preprocess(test_df)
   
    # build df with history
    h_train_df = add_history(train_df.copy())
    h_val_df = add_history(val_df.copy())
    h_test_df = add_history(test_df.copy())
 
    df = train_df.append(val_df.append(test_df))

    encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
    decoder_max_length = int(pd.Series([len(df.iloc[i]["answer"]) for i in range(len(df["answer"]))]).quantile())

    print("Train df dialogues: ",train_df.shape,h_train_df.shape)
    print("Validation df dialogues: ",val_df.shape,h_val_df.shape)
    print("Test df dialogues: ",test_df.shape, h_test_df.shape)

####################################
# NO HISTORY
    
    # model and tokenizer
    distilroberta_model, distilroberta_tokenizer = get_m1()

    # process dataset to model input
    train_ds = preparation(train_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n) 
    val_ds = preparation(val_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)
    test_ds = preparation(test_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)

    # data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer = distilroberta_tokenizer,
        model = distilroberta_model,
        label_pad_token_id = -100,
        return_tensors = 'pt' )

    # trainer
    trainer = Seq2SeqTrainer( 
        model=distilroberta_model,
        tokenizer=distilroberta_tokenizer,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics_f1_m1,
        train_dataset=train_ds,
        eval_dataset=val_ds
        )

# finetune for 3 epochs without history
    result = trainer.train()
    print(result)

# evaluate m1 - TEST SET
    eval_ts = trainer.evaluate(test_ds)
    print(eval_ts)

# evaluate m1 - VAL SET
    eval_vs = trainer.evaluate(val_ds)
    print(eval_vs)

####################################
# WITH HISTORY

    # model
    distilroberta_model,_ = get_m1()

    # process dataset to model input
    h_train_ds = preparation(h_train_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n) 
    h_val_ds = preparation(h_val_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)
    h_test_ds = preparation(h_test_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)

    # data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer = distilroberta_tokenizer,
        model = distilroberta_model,
        label_pad_token_id = -100,
        return_tensors = 'pt' )

    # trainer
    trainer = Seq2SeqTrainer( 
        model=distilroberta_model,
        tokenizer=distilroberta_tokenizer,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics_f1_m1,
        train_dataset=h_train_ds,
        eval_dataset=h_val_ds
        )

# finetune for 3 epochs with history 
    result_h = trainer.train()
    print(result_h)

# evaluate m1 - TEST SET
    eval_h_ts = trainer.evaluate(h_test_ds)
    print(eval_h_ts)

# evaluate m1 - VAL SET
    eval_h_vs = trainer.evaluate(h_val_ds)
    print(eval_h_vs)
   
    print("-----------------------------------------------------------") 

Running with seed: 42
Train df dialogues:  (85806, 3) (85806, 4)
Validation df dialogues:  (21470, 3) (21470, 4)
Test df dialogues:  (7917, 3) (7917, 4)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['roberta.encoder.layer.3.crossattention.self.query.weight', 'roberta.encoder.layer.0.crossattention.self.key.weight', 'roberta.encoder.layer.0.crossattention.s

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: [34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss,Validation Loss,Squad F1 Precision
4,7.1886,6.635922,0.011405
8,6.4423,6.333238,0.012007
12,6.2149,6.183811,0.019384
16,6.2285,6.081306,0.021004
20,5.9465,6.012136,0.021805
24,5.9305,5.997675,0.029804
28,5.7424,5.748796,0.018173
32,5.4672,5.51646,0.004673
36,5.527,5.385677,0.031815
40,5.4539,5.328685,0.0365


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=5.874481217066447, metrics={'train_runtime': 5586.0228, 'train_samples_per_second': 2.685, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.874481217066447, 'epoch': 3.0})




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.04840087890625, 'eval_squad_f1_precision': 0.037119748845925664, 'eval_runtime': 354.4646, 'eval_samples_per_second': 14.106, 'eval_steps_per_second': 0.056, 'epoch': 3.0}
{'eval_loss': 5.1604204177856445, 'eval_squad_f1_precision': 0.0363510325948564, 'eval_runtime': 354.9534, 'eval_samples_per_second': 14.086, 'eval_steps_per_second': 0.056, 'epoch': 3.0}


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.871,6.405077,0.010631
8,6.2711,6.188337,0.025746
12,6.079,6.120788,0.01783
16,6.0939,6.021133,0.030075
20,5.8224,5.940138,0.023846
24,5.8013,5.805984,0.000634
28,5.5834,5.473731,0.035464
32,5.2843,5.400714,0.035128
36,5.4259,5.33089,0.020733
40,5.386,5.295061,0.040912


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=5.761772123972575, metrics={'train_runtime': 5528.3256, 'train_samples_per_second': 2.713, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.761772123972575, 'epoch': 3.0})




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.0364227294921875, 'eval_squad_f1_precision': 0.039271402567161155, 'eval_runtime': 364.6322, 'eval_samples_per_second': 13.712, 'eval_steps_per_second': 0.055, 'epoch': 3.0}
{'eval_loss': 5.146191596984863, 'eval_squad_f1_precision': 0.04019035096997942, 'eval_runtime': 363.2663, 'eval_samples_per_second': 13.764, 'eval_steps_per_second': 0.055, 'epoch': 3.0}
-----------------------------------------------------------
Running with seed: 2022
Train df dialogues:  (68644, 3) (68644, 4)
Validation df dialogues:  (17162, 3) (17162, 4)
Test df dialogues:  (7917, 3) (7917, 4)


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.8214,6.394373,0.014534
8,6.2994,6.147851,0.015679
12,6.0036,6.038756,0.041786
16,5.8973,5.982933,0.015114
20,5.7309,5.860729,0.006848
24,5.7631,5.46441,0.005063
28,5.3129,5.363762,0.039939
32,5.3631,5.279062,0.022267
36,5.2022,5.210238,0.047481
40,5.1932,5.177122,0.04309


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=5.663397550582886, metrics={'train_runtime': 5561.6185, 'train_samples_per_second': 2.697, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.663397550582886, 'epoch': 3.0})




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 4.982377529144287, 'eval_squad_f1_precision': 0.03153013710403823, 'eval_runtime': 362.7976, 'eval_samples_per_second': 13.782, 'eval_steps_per_second': 0.055, 'epoch': 3.0}
{'eval_loss': 5.084376335144043, 'eval_squad_f1_precision': 0.033533058925938344, 'eval_runtime': 362.6628, 'eval_samples_per_second': 13.787, 'eval_steps_per_second': 0.055, 'epoch': 3.0}


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.8214,6.394373,0.014534
8,6.2994,6.147851,0.015679
12,6.0036,6.038756,0.041786
16,5.8973,5.982933,0.015114
20,5.7309,5.860729,0.006848
24,5.7631,5.46441,0.005063
28,5.3129,5.363762,0.039939
32,5.3631,5.279062,0.022267
36,5.2022,5.210238,0.047481
40,5.1932,5.177122,0.04309


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=5.663397550582886, metrics={'train_runtime': 5572.943, 'train_samples_per_second': 2.692, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.663397550582886, 'epoch': 3.0})




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 4.982377529144287, 'eval_squad_f1_precision': 0.03153013710403823, 'eval_runtime': 358.0406, 'eval_samples_per_second': 13.965, 'eval_steps_per_second': 0.056, 'epoch': 3.0}
{'eval_loss': 5.084376335144043, 'eval_squad_f1_precision': 0.033533058925938344, 'eval_runtime': 363.6347, 'eval_samples_per_second': 13.75, 'eval_steps_per_second': 0.055, 'epoch': 3.0}
-----------------------------------------------------------
Running with seed: 1337
Train df dialogues:  (54915, 3) (54915, 4)
Validation df dialogues:  (13729, 3) (13729, 4)
Test df dialogues:  (7917, 3) (7917, 4)


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.8659,6.392132,0.014384
8,6.3716,6.151516,0.027054
12,6.0768,6.062391,0.026656
16,5.9233,5.996062,0.019225
20,5.8737,5.918778,0.027512
24,5.696,5.770838,0.017191
28,5.4228,5.412414,0.008172
32,5.3781,5.370365,0.047083
36,5.4467,5.273441,0.040532
40,5.2823,5.185254,0.041614


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=5.702111879984538, metrics={'train_runtime': 5566.3933, 'train_samples_per_second': 2.695, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.702111879984538, 'epoch': 3.0})




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.0075507164001465, 'eval_squad_f1_precision': 0.03933865963585794, 'eval_runtime': 359.2142, 'eval_samples_per_second': 13.919, 'eval_steps_per_second': 0.056, 'epoch': 3.0}
{'eval_loss': 5.0997467041015625, 'eval_squad_f1_precision': 0.03884002063983508, 'eval_runtime': 364.0368, 'eval_samples_per_second': 13.735, 'eval_steps_per_second': 0.055, 'epoch': 3.0}


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.8659,6.392132,0.014384
8,6.3716,6.151516,0.027054
12,6.0768,6.062391,0.026656


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.8659,6.392132,0.014384
8,6.3716,6.151516,0.027054
12,6.0768,6.062391,0.026656
16,5.9233,5.996062,0.019225
20,5.8737,5.918778,0.027512
24,5.696,5.770838,0.017191
28,5.4228,5.412414,0.008172
32,5.3781,5.370365,0.047083
36,5.4467,5.273441,0.040532
40,5.2823,5.185254,0.041614


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=5.702111879984538, metrics={'train_runtime': 5507.5137, 'train_samples_per_second': 2.724, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.702111879984538, 'epoch': 3.0})




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.0075507164001465, 'eval_squad_f1_precision': 0.03933865963585794, 'eval_runtime': 358.242, 'eval_samples_per_second': 13.957, 'eval_steps_per_second': 0.056, 'epoch': 3.0}
{'eval_loss': 5.0997467041015625, 'eval_squad_f1_precision': 0.03884002063983508, 'eval_runtime': 359.2745, 'eval_samples_per_second': 13.917, 'eval_steps_per_second': 0.056, 'epoch': 3.0}
-----------------------------------------------------------


####Observations
We analyzed the squad_f1_precision on test set and validation set on model No-H and W-H for each seed and we compared the results to find the better seed value per model.
In model 1, we have:

on test set evaluation without H:

seed 42-  0.037119748845925664

seed 2022-  0.03153013710403823

seed 1337-0.03933865963585794


on test set evaluation with H:

seed 42-   0.039271402567161155

seed 2022-  0.03153013710403823

seed 1337- 0.03933865963585794

So the higer f1-squad prec is obtained by seed= 1337.
Also in evaluation on val set, we have higer f1-squad prec for seed 1337 both with H and without H models.

Chosen seed=1337, we continue the task 7 considering the model 1 trained on seed=1337. 


### [M2] BERTTiny (bert-tiny)

#### Function for metrics

In [None]:
# compute metric squad f1
def compute_metrics_f1_m2(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = berttiny_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = berttiny_tokenizer.pad_token_id
    label_str = berttiny_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    squad_f1_output = [squad.compute_f1(a_pred=pred_str[i], a_gold=label_str[i]) for i in range(len(pred_str))]
    
    return {
        "squad_f1_precision": sum(squad_f1_output) / len(squad_f1_output), # do the average
    }
    

####Training loop
 batch size 265, max length 32

####SEED 42

In [None]:
#Eval on test and val set on SQUAD F1-score
#Report evaluation SQUAD F1-score computed on the validation and test sets.
#Perform multiple train and evaluation on test set and val set with 3 seeds.

seeds = [42, 2022, 1337]

n = 5000 # subset length to train faster, "None" for whole set

for seed in seeds:
    print(f'Running with seed: {seed}')
    set_reproducibility(seed)
    
    #with shuffle
    train_df, val_df = split(train_df)

    # text preprocess
    train_df = preprocess(train_df)
    val_df = preprocess(val_df)
    test_df = preprocess(test_df)

    # build df with history
    h_train_df = add_history(train_df.copy())
    h_val_df = add_history(val_df.copy())
    h_test_df = add_history(test_df.copy())

    df = train_df.append(val_df.append(test_df))
    encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
    decoder_max_length = int(pd.Series([len(df.iloc[i]["answer"]) for i in range(len(df["answer"]))]).quantile())
    print("decoder_max_length" , decoder_max_length )

    print("Train df dialogues: ",train_df.shape,h_train_df.shape)
    print("Validation df dialogues: ",val_df.shape,h_val_df.shape)
    print("Test df dialogues: ",test_df.shape, h_test_df.shape)

####################################
# NO HISTORY
    
    # model and tokenizer
    berttiny_model, berttiny_tokenizer = get_m2()

    # process dataset to model input
    train_ds = preparation(train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
    val_ds = preparation(val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    test_ds = preparation(test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer = berttiny_tokenizer,
        model = berttiny_model,
        label_pad_token_id = -100,
        return_tensors = 'pt' )

    # trainer
    trainer = Seq2SeqTrainer( 
        model=berttiny_model,
        tokenizer=berttiny_tokenizer,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics_f1_m2,
        train_dataset=train_ds,
        eval_dataset=val_ds
        )

# finetune for 3 epochs without history
    result = trainer.train()
    print("TRAIN ")
    print(result)

# evaluate m1 - TEST SET
    eval_ts = trainer.evaluate(test_ds)
    print("EVAL TEST SET")
    print(eval_ts)

# evaluate m1 - VAL SET
    eval_vs = trainer.evaluate(val_ds)
    print("EVAL VAL SET")
    print(eval_vs)

####################################
# WITH HISTORY

    # model
    berttiny_model,_ = get_m2()

    # process dataset to model input
    h_train_ds = preparation(h_train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
    h_val_ds = preparation(h_val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    h_test_ds = preparation(h_test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # trainer
    trainer = Seq2SeqTrainer(
        model=berttiny_model,
        tokenizer=berttiny_tokenizer,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics_f1_m2,
        train_dataset=h_train_ds, 
        eval_dataset=h_val_ds
        )

# finetune for 3 epochs with history 
    result_h = trainer.train()
    print("TRAIN H")
    print(result_h)

# evaluate m1 - TEST SET
    eval_h_ts = trainer.evaluate(h_test_ds)
    print("EVAL H TEST SET")
    print(eval_h_ts)

# evaluate m1 - VAL SET
    eval_h_vs = trainer.evaluate(h_val_ds)
    print("EVAL H VAL SET")
    print(eval_h_vs)
   
    print("-----------------------------------------------------------") 

Running with seed: 42
decoder_max_length 10
Train df dialogues:  (35145, 3) (35145, 4)
Validation df dialogues:  (8787, 3) (8787, 4)
Test df dialogues:  (7917, 3) (7917, 4)


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4336,11.32,0.001508
8,11.0839,9.711552,0.001229
12,9.5493,8.434757,0.002039
16,8.5848,7.47822,0.001724
20,7.7585,6.72488,0.001876
24,6.9477,6.216929,0.001143
28,6.5941,6.039185,0.000249
32,6.5057,5.900805,0.000817
36,6.1449,5.687316,0.00092
40,6.0929,5.50463,0.000741


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TRAIN 
TrainOutput(global_step=60, training_loss=7.442484966913859, metrics={'train_runtime': 12336.1632, 'train_samples_per_second': 1.216, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.442484966913859, 'epoch': 3.0})




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


EVAL TEST SET
{'eval_loss': 5.126132011413574, 'eval_squad_f1_precision': 0.0040592352923334235, 'eval_runtime': 762.0286, 'eval_samples_per_second': 6.561, 'eval_steps_per_second': 0.026, 'epoch': 3.0}
EVAL VAL SET
{'eval_loss': 5.241342067718506, 'eval_squad_f1_precision': 0.0034644237536942883, 'eval_runtime': 787.7256, 'eval_samples_per_second': 6.347, 'eval_steps_per_second': 0.025, 'epoch': 3.0}


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4537,11.330351,0.001121
8,11.1101,9.723199,0.001069
12,9.5766,8.438064,0.002069
16,8.6118,7.478153,0.001527
20,7.7668,6.726127,0.00069
24,6.9542,6.218497,0.001453
28,6.6058,6.039907,0.001027
32,6.504,5.89933,0.001851
36,6.1315,5.680974,0.000434
40,6.0834,5.49578,0.000739


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TRAIN H
TrainOutput(global_step=60, training_loss=7.448433128992717, metrics={'train_runtime': 12624.5538, 'train_samples_per_second': 1.188, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.448433128992717, 'epoch': 3.0})




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


EVAL H TEST SET
{'eval_loss': 5.126367092132568, 'eval_squad_f1_precision': 0.005426307720284922, 'eval_runtime': 834.3064, 'eval_samples_per_second': 5.993, 'eval_steps_per_second': 0.024, 'epoch': 3.0}
EVAL H VAL SET
{'eval_loss': 5.23521089553833, 'eval_squad_f1_precision': 0.0049122834675423095, 'eval_runtime': 840.4239, 'eval_samples_per_second': 5.949, 'eval_steps_per_second': 0.024, 'epoch': 3.0}
-----------------------------------------------------------
Running with seed: 2022
decoder_max_length 10
Train df dialogues:  (28116, 3) (28116, 4)
Validation df dialogues:  (7029, 3) (7029, 4)
Test df dialogues:  (7917, 3) (7917, 4)


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.3451,11.30187,0.001182
8,10.9544,9.689857,0.00089
12,9.593,8.403697,0.001715
16,8.6944,7.441975,0.001378
20,7.708,6.693463,0.000546
24,7.0052,6.200339,0.001111
28,6.4376,6.02883,0.001179
32,6.3143,5.884699,0.001091
36,5.9831,5.663999,0.000554
40,5.8925,5.477365,0.001137


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TRAIN 
TrainOutput(global_step=60, training_loss=7.407644017537435, metrics={'train_runtime': 12287.869, 'train_samples_per_second': 1.221, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.407644017537435, 'epoch': 3.0})




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


EVAL TEST SET
{'eval_loss': 5.137295722961426, 'eval_squad_f1_precision': 0.005341292841617165, 'eval_runtime': 747.4351, 'eval_samples_per_second': 6.69, 'eval_steps_per_second': 0.027, 'epoch': 3.0}
EVAL VAL SET
{'eval_loss': 5.212394714355469, 'eval_squad_f1_precision': 0.005745095045936686, 'eval_runtime': 756.2235, 'eval_samples_per_second': 6.612, 'eval_steps_per_second': 0.026, 'epoch': 3.0}


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.3451,11.30187,0.001182
8,10.9544,9.689857,0.00089
12,9.593,8.403697,0.001715
16,8.6944,7.441975,0.001378
20,7.708,6.693463,0.000546
24,7.0052,6.200339,0.001111
28,6.4376,6.02883,0.001179
32,6.3143,5.884699,0.001091
36,5.9831,5.663999,0.000554
40,5.8925,5.477365,0.001137


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

####SEED 2022

In [None]:
#Eval on test and val set on SQUAD F1-score
#Report evaluation SQUAD F1-score computed on the validation and test sets.
#perform train with seed=2022
seed =2022

n = 5000 # subset length to train faster, "None" for whole set

#for seed in seeds:
print(f'Running with seed: {seed}')
set_reproducibility(seed)
    
    #with shuffle
train_df, val_df = split(train_df)

    # text preprocess
train_df = preprocess(train_df)
val_df = preprocess(val_df)
test_df = preprocess(test_df)

    # build df with history
h_train_df = add_history(train_df.copy())
h_val_df = add_history(val_df.copy())
h_test_df = add_history(test_df.copy())

df = train_df.append(val_df.append(test_df))
encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
decoder_max_length = int(pd.Series([len(df.iloc[i]["answer"]) for i in range(len(df["answer"]))]).quantile())
print("decoder_max_length" , decoder_max_length )

print("Train df dialogues: ",train_df.shape,h_train_df.shape)
print("Validation df dialogues: ",val_df.shape,h_val_df.shape)
print("Test df dialogues: ",test_df.shape, h_test_df.shape)

####################################
# NO HISTORY
print("NO- HISTORY -------------------------")  
    # model and tokenizer
berttiny_model, berttiny_tokenizer = get_m2()

    # process dataset to model input
train_ds = preparation(train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
val_ds = preparation(val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
test_ds = preparation(test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # data collator
data_collator = DataCollatorForSeq2Seq(tokenizer = berttiny_tokenizer,model = berttiny_model,label_pad_token_id = -100,return_tensors = 'pt' )

    # trainer
trainer = Seq2SeqTrainer( model=berttiny_model,tokenizer=berttiny_tokenizer,data_collator=data_collator,args=training_args,compute_metrics=compute_metrics_f1_m2,train_dataset=train_ds,eval_dataset=val_ds)

# finetune for 3 epochs without history
print("TRAIN ")
result = trainer.train()
print(result)

# evaluate m1 - TEST SET
print("EVAL TEST SET")
eval_ts = trainer.evaluate(test_ds)
print(eval_ts)

# evaluate m1 - VAL SET
print("EVAL VAL SET")
eval_vs = trainer.evaluate(val_ds)
print(eval_vs)

####################################
# WITH HISTORY
print("WITHS- HISTORY -------------------------") 
    # model
berttiny_model,_ = get_m2()

    # process dataset to model input
h_train_ds = preparation(h_train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
h_val_ds = preparation(h_val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
h_test_ds = preparation(h_test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # trainer
trainer = Seq2SeqTrainer(model=berttiny_model,tokenizer=berttiny_tokenizer,data_collator=data_collator,args=training_args,compute_metrics=compute_metrics_f1_m2,train_dataset=h_train_ds, eval_dataset=h_val_ds)

# finetune for 3 epochs with history 
print("TRAIN H")
result_h = trainer.train()
print(result_h)

# evaluate m1 - TEST SET
print("EVAL H TEST SET")
eval_h_ts = trainer.evaluate(h_test_ds)
print(eval_h_ts)

# evaluate m1 - VAL SET
print("EVAL H VAL SET")
eval_h_vs = trainer.evaluate(h_val_ds)
print(eval_h_vs)
   
#print("-----------------------------------------------------------") 

Running with seed: 2022
decoder_max_length 10
Train df dialogues:  (85685, 3) (85685, 4)
Validation df dialogues:  (21591, 3) (21591, 4)
Test df dialogues:  (7917, 3) (7917, 4)
NO- HISTORY -------------------------


Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertLMHeadModel: ['cls.seq_re

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


TRAIN 


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.6411,11.305256,0.00124
8,10.8335,9.677795,0.001131
12,9.5295,8.388158,0.001503
16,8.65,7.417842,0.001493
20,7.8208,6.658098,0.001325
24,6.9585,6.168995,0.001001
28,6.5114,6.016972,0.00087
32,6.2736,5.879708,0.001998
36,5.9778,5.661326,0.001095
40,5.8611,5.48358,0.000379


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=7.4087478478749595, metrics={'train_runtime': 10773.8456, 'train_samples_per_second': 1.392, 'train_steps_per_second': 0.006, 'total_flos': 1708444800000.0, 'train_loss': 7.4087478478749595, 'epoch': 3.0})
EVAL TEST SET




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.162013053894043, 'eval_squad_f1_precision': 0.000975284131086072, 'eval_runtime': 542.8658, 'eval_samples_per_second': 9.21, 'eval_steps_per_second': 0.037, 'epoch': 3.0}
EVAL VAL SET
{'eval_loss': 5.23555326461792, 'eval_squad_f1_precision': 0.0011333311727720582, 'eval_runtime': 529.4842, 'eval_samples_per_second': 9.443, 'eval_steps_per_second': 0.038, 'epoch': 3.0}
WITHS- HISTORY -------------------------


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


TRAIN H


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.6215,11.302345,0.001237
8,10.8293,9.67864,0.001095
12,9.5404,8.382896,0.001832
16,8.6472,7.419201,0.001423
20,7.8254,6.667495,0.000546
24,6.959,6.170409,0.000889
28,6.5129,6.001746,0.001686
32,6.2594,5.864589,0.001982
36,5.9676,5.646572,0.001084
40,5.8471,5.466915,0.001573


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=7.399402030309042, metrics={'train_runtime': 11637.567, 'train_samples_per_second': 1.289, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.399402030309042, 'epoch': 3.0})
EVAL H TEST SET




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.147007465362549, 'eval_squad_f1_precision': 0.0042358191953715, 'eval_runtime': 721.8053, 'eval_samples_per_second': 6.927, 'eval_steps_per_second': 0.028, 'epoch': 3.0}
EVAL H VAL SET
{'eval_loss': 5.217400074005127, 'eval_squad_f1_precision': 0.0035788208593569132, 'eval_runtime': 732.925, 'eval_samples_per_second': 6.822, 'eval_steps_per_second': 0.027, 'epoch': 3.0}


####SEED 1337

In [None]:
#Eval on test and val set on SQUAD F1-score
#Report evaluation SQUAD F1-score computed on the validation and test sets.
#perform train with seed=1337
seeds = [1337]
avg_metric_info = {}

n = 5000 # subset length to train faster, "None" for whole set

for seed in seeds:
    print(f'Running with seed: {seed}')
    set_reproducibility(seed)
    
    #with shuffle
    train_df, val_df = split(train_df)

    # text preprocess
    train_df = preprocess(train_df)
    val_df = preprocess(val_df)
    test_df = preprocess(test_df)

    # build df with history
    h_train_df = add_history(train_df.copy())
    h_val_df = add_history(val_df.copy())
    h_test_df = add_history(test_df.copy())

    df = train_df.append(val_df.append(test_df))

    encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
    decoder_max_length = int(pd.Series([len(df.iloc[i]["answer"]) for i in range(len(df["answer"]))]).quantile())

    print("Train df dialogues: ",train_df.shape,h_train_df.shape)
    print("Validation df dialogues: ",val_df.shape,h_val_df.shape)
    print("Test df dialogues: ",test_df.shape, h_test_df.shape)

####################################
# NO HISTORY
    
    # model and tokenizer
    berttiny_model, berttiny_tokenizer = get_m2()

    # process dataset to model input
    train_ds = preparation(train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
    val_ds = preparation(val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    test_ds = preparation(test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # data collator
    data_collator = DataCollatorForSeq2Seq(
        tokenizer = berttiny_tokenizer,
        model = berttiny_model,
        label_pad_token_id = -100,
        return_tensors = 'pt' )

    # trainer
    trainer = Seq2SeqTrainer( 
        model=berttiny_model,
        tokenizer=berttiny_tokenizer,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics_f1_m2,
        train_dataset=train_ds,
        eval_dataset=val_ds
        )

# finetune for 3 epochs without history
    result = trainer.train()
    print(result)

# evaluate m1 - TEST SET
    eval_ts = trainer.evaluate(test_ds)
    print(eval_ts)

# evaluate m1 - VAL SET
    eval_vs = trainer.evaluate(val_ds)
    print(eval_vs)

####################################
# WITH HISTORY

    # model
    berttiny_model,_ = get_m2()

    # process dataset to model input
    h_train_ds = preparation(h_train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
    h_val_ds = preparation(h_val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    h_test_ds = preparation(h_test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # trainer
    trainer = Seq2SeqTrainer(
        model=berttiny_model,
        tokenizer=berttiny_tokenizer,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics_f1_m2,
        train_dataset=h_train_ds, 
        eval_dataset=h_val_ds
        )

# finetune for 3 epochs with history 
    result_h = trainer.train()
    print(result_h)

# evaluate m1 - TEST SET
    eval_h_ts = trainer.evaluate(h_test_ds)
    print(eval_h_ts)

# evaluate m1 - VAL SET
    eval_h_vs = trainer.evaluate(h_val_ds)
    print(eval_h_vs)
   
    print("-----------------------------------------------------------") 

Running with seed: 1337
Train df dialogues:  (85722, 3) (85722, 4)
Validation df dialogues:  (21554, 3) (21554, 4)
Test df dialogues:  (7917, 3) (7917, 4)


Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.9M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertLMHeadModel: ['cls.seq_re

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: [34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4189,11.333225,0.001306
8,10.9131,9.745239,0.001334
12,9.7503,8.465384,0.00224
16,8.6879,7.499969,0.001423
20,7.8167,6.749461,0.001011
24,6.9931,6.255508,0.000909
28,6.6332,6.080179,0.000887
32,6.3438,5.933303,0.000483
36,6.1393,5.715269,0.001111
40,6.0235,5.53606,0.000546


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=7.481188535690308, metrics={'train_runtime': 11902.0068, 'train_samples_per_second': 1.26, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.481188535690308, 'epoch': 3.0})




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.162091255187988, 'eval_squad_f1_precision': 0.0024022488186602286, 'eval_runtime': 725.2914, 'eval_samples_per_second': 6.894, 'eval_steps_per_second': 0.028, 'epoch': 3.0}


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

{'eval_loss': 5.273911952972412, 'eval_squad_f1_precision': 0.002010672564513042, 'eval_runtime': 738.8288, 'eval_samples_per_second': 6.767, 'eval_steps_per_second': 0.027, 'epoch': 3.0}


loading file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/tokenizer_config.json from cache at None
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4086,11.30468,0.00137
8,10.9021,9.707073,0.001201


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4086,11.30468,0.00137
8,10.9021,9.707073,0.001201
12,9.7346,8.429567,0.001748
16,8.6651,7.475542,0.001515
20,7.7935,6.727597,0.000825
24,6.9661,6.22366,0.001113
28,6.6081,6.03802,0.001866
32,6.3117,5.891351,0.001428
36,6.1076,5.672801,0.000782
40,5.9862,5.494721,0.001869


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=7.453497139612834, metrics={'train_runtime': 12046.0044, 'train_samples_per_second': 1.245, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.453497139612834, 'epoch': 3.0})




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.130850791931152, 'eval_squad_f1_precision': 0.004460703282142007, 'eval_runtime': 799.0387, 'eval_samples_per_second': 6.258, 'eval_steps_per_second': 0.025, 'epoch': 3.0}
{'eval_loss': 5.239361763000488, 'eval_squad_f1_precision': 0.004205743734214949, 'eval_runtime': 789.0314, 'eval_samples_per_second': 6.337, 'eval_steps_per_second': 0.025, 'epoch': 3.0}
-----------------------------------------------------------


####Observations
We analyzed the squad_f1_precision on test set and validation set on model No-H and W-H for each seed and we compared the results to find the better seed value per model.
In model 2, we have:

on test set evaluation without H:

seed 42-  0.004059235292333423

seed 2022-  0.000975284131086072

seed 1337- 0.0024022488186602286

on test set evaluation with H:

seed 42-   0.005426307720284922

seed 2022-  0.0042358191953715

seed 1337- 0.004460703282142007

So the higer f1-squad prec is obtained by seed= 42.
Also in evaluation on val set, we have higer f1-squad prec for seed 42 both with H and without H models.

Chosen seed=42, we continue the task 7 considering the model 2 trained on seed=42. 


# [Task 7] Error Analysis

Report the worst 5 model errors for each source w.r.t. SQUAD F1-score.

Since the models are trained on a limited dataset, we do not have many relevant indications on their actual capabilities and limitations.

Add Source column in dataframe

In [None]:
import pandas as pd
#extract method of dataset with "source" column
def extract_data_src( json ):
    data = []
    for d in json:
      row = {
            "passage" : d["story"],
            "question" : [q["input_text"] for q in d["questions"]],
            "answer" : [a["input_text"] for a in d["answers"]],
            "source" : d["source"],
        }
      data.append(row)
    df = pd.DataFrame(data) 
    return df    

In [None]:
train_df_src = extract_data_src(train_dialogues)
test_df_src = extract_data_src(test_dialogues)
train_df_src

Unnamed: 0,passage,question,answer,source
0,"The Vatican Apostolic Library (), more commonl...","[When was the Vat formally opened?, what is th...","[It was formally established in 1475, research...",wikipedia
1,New York (CNN) -- More than 80 Michael Jackson...,"[Where was the Auction held?, How much did the...","[Hard Rock Cafe, $2 million., $120,000, Hoffma...",cnn
2,"CHAPTER VII. THE DAUGHTER OF WITHERSTEEN \n\n""...","[What did Venters call Lassiter?, Who asked La...","[gun-man, Jane, Yes, to take charge of her cat...",gutenberg
3,(CNN) -- The longest-running holiday special s...,"[Who is Rudolph's father?, Why does Rudolph ru...","[Donner, he felt like an outcast, his nose glo...",cnn
4,CHAPTER XXIV. THE INTERRUPTED MASS \n\nThe mor...,"[Who arrived at the church?, Who was followed ...","[the garrison first, Fra. Domenico, Valentina,...",gutenberg
...,...,...,...,...
7194,"CHAPTER XX \n\nFAST IN THE ICE \n\n""Well, ther...","[Who wanted to go to shore?, Did they go?, Wha...","[Andy and Chet, Yes, unknown, Barwell Dawson a...",gutenberg
7195,(CNN) -- The biological mother of a missing 7-...,"[When did the boy go missing?, How many weeks ...","[June 4, More than two weeks, His mother, Yes,...",cnn
7196,"By the time Rihanna was seventeen ,she had rel...","[what was the name of Rihanna's first album?, ...","[Pon de Replay, 2005, Def Jam Recordings, 1988...",race
7197,"Frankfurt, officially Frankfurt am Main (Liter...","[What is the largest city in Hesse?, Is it the...","[Frankfurt, no, four, a German state, the Bank...",wikipedia


In [None]:
test_df_src 

Unnamed: 0,passage,question,answer,source
0,"Once upon a time, in a barn near a farm house,...","[What color was Cotton?, Where did she live?, ...","[white, in a barn, no, with her mommy and 5 si...",mctest
1,Once there was a beautiful fish named Asta. As...,"[what was the name of the fish, What looked li...","[Asta., a bottle, Asta., Yes, Yes, a note, No,...",mctest
2,"My doorbell rings. On the step, I find the eld...","[Who is at the door?, Is she carrying somethin...","[An elderly Chinese lady and a little boy, Yes...",race
3,"(CNN) -- Dennis Farina, the dapper, mustachioe...","[Is someone in showbiz?, Whom?, What did he do...","[Yes., Dennis Farina, Actor, No, Yes, No, Fari...",cnn
4,Kendra and Quinton travel to and from school e...,[Where do Quinton and Kendra travel to and fro...,"[school, No, go to Quentin's house, No, No, st...",mctest
...,...,...,...,...
495,Alan worked in an office in the city. He worke...,"[Where does Alan decide to go?, How many activ...","[William Farm, Three, horse riding, walking, f...",race
496,The kitchen comes alive at night in the Sander...,"[What is the dad's name?, What is his last nam...","[Ryan, Sanderson, yes, Susan, go back to bed, ...",mctest
497,A440 or A4 (also known as the Stuttgart pitch)...,"[What entity standardized A4 on 440 Hertz?, Wh...",[International Organization for Standardizatio...,wikipedia
498,"The dog, called Prince, was an intelligent ani...","[What is the dog's name?, Who is his owner?, I...","[Prince, Williams, no, the general store, to g...",race


Grouping dataframe train, test and val set on source type

In [None]:
#TRAINING SET
#Print number of classess of passages in the CoQA train dataset
print("Source types in train dataset are",train_df_src["source"].unique().size, ":" ,  train_df_src["source"].unique())
print("")
print("Source types in test dataset are",test_df_src["source"].unique().size, ":" ,  test_df_src["source"].unique())
#Grouping the train dataframe by "source"
group_by_source = train_df_src.groupby("source")
#Analyzing number of classes and number of element in each class "source"
count_by_source_size= group_by_source.size()
print("--------------------------------------")
print("")
print("TRAIN SET")
print("")
print("Number of element in each class SOURCE",count_by_source_size)
#create domain-group dataframes  
grwiki=group_by_source.get_group('wikipedia')
grcnn=group_by_source.get_group('cnn')
grgut=group_by_source.get_group('gutenberg')
grrc=group_by_source.get_group('race')
grmct=group_by_source.get_group('mctest')

print("Element per group: ")

print("GR wiki", grwiki.count())
print("")
print("GR cnn", grcnn.count())
print("")
print("GR gutenberg", grgut.count())
print("")
print("GR race", grrc.count())
print("")
print("GR mctest", grmct.count())

print("--------------------------------------")
print("")
print("TEST SET")
#TEST SET
#Grouping the test dataframe by "source"
group_by_source_ts = test_df_src.groupby("source")
group_by_source_ts.indices
#Analyzing number of classes and number of element in each class "source"
count_by_source_ts_size = group_by_source_ts.size()
print("")
print("Number of element in each class SOURCE", count_by_source_ts_size)
#create domain-group dataframes  
grwiki_ts=group_by_source_ts.get_group('wikipedia')
grcnn_ts=group_by_source_ts.get_group('cnn')
grgut_ts=group_by_source_ts.get_group('gutenberg')
grrc_ts=group_by_source_ts.get_group('race')
grmct_ts=group_by_source_ts.get_group('mctest')
print("")
print("Element per group: ")
print("GR wiki", grwiki_ts.count())
print("")
print("GR cnn", grcnn_ts.count())
print("")
print("GR gutenberg", grgut_ts.count())
print("")
print("GR race", grrc_ts.count())
print("")
print("GR mctest",grmct_ts.count())

Source types in train dataset are 5 : ['wikipedia' 'cnn' 'gutenberg' 'race' 'mctest']

Source types in test dataset are 5 : ['mctest' 'race' 'cnn' 'wikipedia' 'gutenberg']
--------------------------------------

TRAIN SET

Number of element in each class SOURCE source
cnn          1702
gutenberg    1615
mctest        550
race         1711
wikipedia    1621
dtype: int64
Element per group: 
GR wiki passage     1621
question    1621
answer      1621
source      1621
dtype: int64

GR cnn passage     1702
question    1702
answer      1702
source      1702
dtype: int64

GR gutenberg passage     1615
question    1615
answer      1615
source      1615
dtype: int64

GR race passage     1711
question    1711
answer      1711
source      1711
dtype: int64

GR mctest passage     550
question    550
answer      550
source      550
dtype: int64
--------------------------------------

TEST SET

Number of element in each class SOURCE source
cnn          100
gutenberg    100
mctest       100
race      

In [None]:
#Preprocess method of dataframw with source column
def preprocess_with_source(df):
  temp = exploder(df, ['passage','source'])
  df = text_preprocessing(temp)
  return df

### [M1] DistilRoBERTa (distilroberta-base)

In [None]:
# compute metric squad f1
def compute_metrics_f1_m1(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = distilroberta_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = distilroberta_tokenizer.pad_token_id
    label_str = distilroberta_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
   
    squad_f1_output = [squad.compute_f1(a_pred=pred_str[i], a_gold=label_str[i]) for i in range(len(pred_str))]
    
    return {
        "squad_f1_precision": sum(squad_f1_output) / len(squad_f1_output), # do the average
    }

#### Training

In [None]:
#to access drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#train with chosen seed=1337
seed = 1337

n = 5000 # subset length to train faster, "None" for whole set

print(f'Setting seed: {seed}')
set_reproducibility(seed)

#with shuffle
train_df, val_df = split(train_df)

# text preprocess
train_df = preprocess(train_df)
val_df = preprocess(val_df)
test_df = preprocess(test_df)

# build df with history
h_train_df = add_history(train_df.copy())
h_val_df = add_history(val_df.copy())
h_test_df = add_history(test_df.copy())

df = train_df.append(val_df.append(test_df))

encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
decoder_max_length = int(pd.Series([len(df.iloc[i]["answer"]) for i in range(len(df["answer"]))]).quantile())

print("Train df dialogues: ",train_df.shape,h_train_df.shape)
print("Validation df dialogues: ",val_df.shape,h_val_df.shape)
print("Test df dialogues: ",test_df.shape, h_test_df.shape)

Setting seed: 1337
Train df dialogues:  (68577, 3) (68577, 4)
Validation df dialogues:  (17145, 3) (17145, 4)
Test df dialogues:  (7917, 3) (7917, 4)


In [None]:
# NO HISTORY

# model and tokenizer
distilroberta_model, distilroberta_tokenizer = get_m1()

# process dataset to model input
train_ds = preparation(train_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n) 
val_ds = preparation(val_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)
test_ds = preparation(test_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)

# data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer = distilroberta_tokenizer,
    model = distilroberta_model,
    label_pad_token_id = -100,
    return_tensors = 'pt' )

# trainer
trainer = Seq2SeqTrainer( 
    model=distilroberta_model,
    tokenizer=distilroberta_tokenizer,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics_f1_m1,
    train_dataset=train_ds,
    eval_dataset=val_ds
    )

# finetune for 3 epochs without history
result = trainer.train()
print(result)

#save model on drive
trainer.save_model ("/content/drive/MyDrive/Colab Notebooks/model_1_nohist")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['roberta.encoder.layer.0.crossattention.output.dense.weight', 'roberta.encoder.layer.0.crossattention.self.key.bias', 'roberta.encoder.layer.1.crossattention.s

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 20


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss,Validation Loss,Squad F1 Precision
4,7.034,6.591421,0.013425
8,6.3409,6.250375,0.019491
12,6.134,6.118921,0.029416
16,5.9547,6.051966,0.035476
20,5.9532,6.016149,0.025562
24,5.8133,5.957539,0.030179
28,5.7462,5.653718,0.012312
32,5.5733,5.462845,0.013473
36,5.4585,5.409096,0.006045
40,5.333,5.331836,0.022117


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=5.8137824217478435, metrics={'train_runtime': 5480.5965, 'train_samples_per_second': 2.737, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.8137824217478435, 'epoch': 3.0})


Configuration saved in /content/drive/MyDrive/Colab Notebooks/model_1_nohist/config.json
Model weights saved in /content/drive/MyDrive/Colab Notebooks/model_1_nohist/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Colab Notebooks/model_1_nohist/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Colab Notebooks/model_1_nohist/special_tokens_map.json


In [None]:
# WITH HISTORY

# model
distilroberta_model_h, distilroberta_tokenizer = get_m1()

# process dataset to model input
h_train_ds = preparation(h_train_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n) 
h_val_ds = preparation(h_val_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)
h_test_ds = preparation(h_test_df, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)

# data collator
data_collator = DataCollatorForSeq2Seq(
    tokenizer = distilroberta_tokenizer,
    model = distilroberta_model_h,
    label_pad_token_id = -100,
    return_tensors = 'pt' )

# trainer
trainer_h = Seq2SeqTrainer( 
    model=distilroberta_model_h,
    tokenizer=distilroberta_tokenizer,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics_f1_m1,
    train_dataset=h_train_ds,
    eval_dataset=h_val_ds
    )

# finetune for 3 epochs with history 
result_h = trainer_h.train()
print(result_h)

#save model on drive
trainer_h.save_model ("/content/drive/MyDrive/Colab Notebooks/model_1_hist")

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['roberta.encoder.layer.2.crossattention.self.key.bias', 'roberta.encoder.layer.4.crossattention.self.query.bias', 'roberta.encoder.layer.5.crossattention.outpu

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss,Validation Loss,Squad F1 Precision
4,6.8453,6.420094,0.016261
8,6.3793,6.124621,0.040456
12,6.0117,6.018114,0.037918
16,5.9708,5.977688,0.047915
20,5.8324,5.843851,0.017431
24,5.5301,5.424154,0.016871
28,5.37,5.326441,0.023939
32,5.3746,5.25121,0.03422
36,5.2724,5.138528,0.035962
40,5.2172,5.12968,0.018369


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=5.668401257197062, metrics={'train_runtime': 5384.0584, 'train_samples_per_second': 2.786, 'train_steps_per_second': 0.011, 'total_flos': 166882109760000.0, 'train_loss': 5.668401257197062, 'epoch': 3.0})


Model weights saved in /content/drive/MyDrive/Colab Notebooks/model_1_hist/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Colab Notebooks/model_1_hist/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Colab Notebooks/model_1_hist/special_tokens_map.json


### Reports

In [None]:
# compute metric squad f1

_,distilroberta_tokenizer = get_m1()

def compute_metrics_report(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = distilroberta_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = distilroberta_tokenizer.pad_token_id
    label_str = distilroberta_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    
    squad_f1_output = [squad.compute_f1(a_pred=pred_str[i], a_gold=label_str[i]) for i in range(len(pred_str))]
    
    merged_list = list(zip(squad_f1_output, label_str))
      
    def myFunc(e):
      return e[0]

    merged_list.sort(key=myFunc)
    
    worst5_f1squad = list(list(zip(*merged_list[:5]))[0])
    worst5_lables= list(list(zip(*merged_list[:5]))[1])

    print("Worst 5 f1_squad \n", 
          pd.DataFrame({
              "ANSWERS": worst5_lables, 
              "F1-SQUAD": worst5_f1squad
              })) 
    
    return {
        "squad_f1_precision": sum(squad_f1_output) / len(squad_f1_output), # do the average
    }

In [None]:
#method that return the f1 squad score for evaluation on test set and val set
# for mod no-H and with-H
def report(group, group_ts):   
  
  train_df_src, val_df_src = split(group)

  print("")
  print("-----------------------------------------------------------")
  print("MODEL NO-HISTORY")
  print("")

  train_df_src = preprocess_with_source(train_df_src)
  val_df_src = preprocess_with_source(val_df_src)
  test_df_src = preprocess_with_source(group_ts)

  print("DIM test_df source=wikipedia :",test_df_src.size)
  print("DIM val_df source=wikipedia :", val_df_src.size)

  df_src= train_df_src.append(val_df_src).append(test_df_src)

  encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
  decoder_max_length = int(pd.Series([len(df_src.iloc[i]["answer"]) for i in range(len(df_src["answer"]))]).quantile())

  ####################################
  # NO HISTORY

  n = None # subset length to train faster, "None" for whole set 

  _,distilroberta_tokenizer = get_m1()

  # process dataset to model input
  val_ds_src = preparation(val_df_src, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)
  test_ds_src = preparation(test_df_src, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)

  # model loaded from drive
  model = EncoderDecoderModel.from_pretrained("/content/drive/MyDrive/Colab Notebooks/model_1_nohist", local_files_only=True)

  # data collator
  data_collator = DataCollatorForSeq2Seq(
    tokenizer = distilroberta_tokenizer,
    model = model,
    label_pad_token_id = -100,
    return_tensors = 'pt' )

  # trainer
  trainer = Seq2SeqTrainer( 
      model=model,
      tokenizer=distilroberta_tokenizer,
      args=training_args,
      data_collator=data_collator,
      compute_metrics=compute_metrics_report,
      )

  trainer.model = model.cuda()

  # evaluate m1 - TEST SET
  print("")
  print("Evaluation on test set")
  eval_ts_src = trainer.evaluate(test_ds_src)
  print(eval_ts_src)

  # evaluate m1 - VAL SET
  print("")
  print("Evaluation on validation set")
  eval_vs_src = trainer.evaluate(val_ds_src)
  print(eval_vs_src)

  ####################################
  #WITH HISTORY
  print("")
  print("-----------------------------------------------------------")
  print("MODEL WITH HISTORY")
  print("")

  # build df with history
  h_val_df_src = add_history(val_df_src.copy())
  h_test_df_src = add_history(test_df_src.copy())
  
  # WITH HISTORY

  # process dataset to model input
  h_val_ds_src = preparation(h_val_df_src ,process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)
  h_test_ds_src = preparation(h_test_df_src, process_data_to_model_inputs, distilroberta_tokenizer, encoder_max_length, decoder_max_length, n)

  # model loaded from drive
  model_h = EncoderDecoderModel.from_pretrained("/content/drive/MyDrive/Colab Notebooks/model_1_hist", local_files_only=True)

  # data collator
  data_collator_h = DataCollatorForSeq2Seq(
    tokenizer = distilroberta_tokenizer,
    model = model_h,
    label_pad_token_id = -100,
    return_tensors = 'pt' )

  # trainer
  trainer_h = Seq2SeqTrainer( 
      model=model_h,
      tokenizer=distilroberta_tokenizer,
      args=training_args,
      data_collator=data_collator_h,
      compute_metrics=compute_metrics_report,
      )

  trainer_h.model = model_h.cuda()

  print("")
  print("Evaluation on test set")
  # evaluate m1 - TEST SET
  eval_h_ts_src = trainer_h.evaluate(h_test_ds_src)
  print(eval_h_ts_src)

  print("")
  print("Evaluation on validation set")
  # evaluate m1 - VAL SET
  eval_h_vs_src = trainer_h.evaluate(h_val_ds_src)
  print(eval_h_vs_src)



####Source WIKIPEDIA

In [None]:
#SOURCE WIKI

report(grwiki, grwiki_ts)


-----------------------------------------------------------
MODEL NO-HISTORY

DIM test_df source=wikipedia : 6504
DIM val_df source=wikipedia : 20176


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
 


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5044
  Batch size = 256


Worst 5 f1_squad 
          answers  F1-SQUAD
0           five       0.0
1  new york city       0.0
2       new york       0.0
3        476,015       0.0
4             no       0.0
{'eval_loss': 5.49381685256958, 'eval_squad_f1_precision': 0.03690892046979118, 'eval_runtime': 114.2578, 'eval_samples_per_second': 14.231, 'eval_steps_per_second': 0.061}

Evaluation on validation set




Worst 5 f1_squad 
                                answers  F1-SQUAD
0                      a radio network       0.0
1   regular television news broadcasts       0.0
2                                daily       0.0
3  nbc conducted the split voluntarily       0.0
4    federal communications commission       0.0
{'eval_loss': 5.4936347007751465, 'eval_squad_f1_precision': 0.04099247459334903, 'eval_runtime': 355.096, 'eval_samples_per_second': 14.205, 'eval_steps_per_second': 0.056}

-----------------------------------------------------------
MODEL WITH HISTORY



  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
   


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history, source. If history, source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5044
  Batch size = 256


Worst 5 f1_squad 
          answers  F1-SQUAD
0           five       0.0
1  new york city       0.0
2       new york       0.0
3        476,015       0.0
4             no       0.0
{'eval_loss': 5.439587593078613, 'eval_squad_f1_precision': 0.04309605766256407, 'eval_runtime': 114.6925, 'eval_samples_per_second': 14.177, 'eval_steps_per_second': 0.061}

Evaluation on validation set




Worst 5 f1_squad 
                                answers  F1-SQUAD
0                      a radio network       0.0
1   regular television news broadcasts       0.0
2                                daily       0.0
3  nbc conducted the split voluntarily       0.0
4    federal communications commission       0.0
{'eval_loss': 5.440781116485596, 'eval_squad_f1_precision': 0.044984792120140366, 'eval_runtime': 356.7358, 'eval_samples_per_second': 14.139, 'eval_steps_per_second': 0.056}


On source type WIKIPEDIA, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model: eval_squad_f1_precision'= 0.03690892046979118

W-H model: val_squad_f1_precision'= 0.04309605766256407

VAL SET

No-H model: eval_squad_f1_precision'= 0.04099247459334903

W-H model: eval_squad_f1_precision'= 0.044984792120140366

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, performs worse on multiple choices and counting type.

VAL SET

Appears some errors in fluency answer type.

####Source CNN

In [None]:
#SOURCE CNN
report(grcnn, grcnn_ts)


-----------------------------------------------------------
MODEL NO-HISTORY

DIM test_df source=wikipedia : 6596
DIM val_df source=wikipedia : 20580


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
 


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5145
  Batch size = 256


Worst 5 f1_squad 
          answers  F1-SQUAD
0  dennis farina       0.0
1          actor       0.0
2             no       0.0
3             no       0.0
4   michael mann       0.0
{'eval_loss': 5.3134307861328125, 'eval_squad_f1_precision': 0.03658744584678688, 'eval_runtime': 116.6419, 'eval_samples_per_second': 14.137, 'eval_steps_per_second': 0.06}

Evaluation on validation set




Worst 5 f1_squad 
                       answers  F1-SQUAD
0                   hundreds.       0.0
1   the immigration counters.       0.0
2            boarding passes.       0.0
3          filling out forms.       0.0
4  making their lives better.       0.0
{'eval_loss': 5.326170921325684, 'eval_squad_f1_precision': 0.035145586604350916, 'eval_runtime': 365.0822, 'eval_samples_per_second': 14.093, 'eval_steps_per_second': 0.058}

-----------------------------------------------------------
MODEL WITH HISTORY



  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
   


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history, source. If history, source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5145
  Batch size = 256


Worst 5 f1_squad 
          answers  F1-SQUAD
0  dennis farina       0.0
1          actor       0.0
2             no       0.0
3             no       0.0
4   michael mann       0.0
{'eval_loss': 5.2329792976379395, 'eval_squad_f1_precision': 0.03791495818026575, 'eval_runtime': 116.318, 'eval_samples_per_second': 14.177, 'eval_steps_per_second': 0.06}

Evaluation on validation set




Worst 5 f1_squad 
                       answers  F1-SQUAD
0                   hundreds.       0.0
1   the immigration counters.       0.0
2            boarding passes.       0.0
3          filling out forms.       0.0
4  making their lives better.       0.0
{'eval_loss': 5.245533466339111, 'eval_squad_f1_precision': 0.036630308395240056, 'eval_runtime': 364.4962, 'eval_samples_per_second': 14.115, 'eval_steps_per_second': 0.058}


On source type CNN, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model: 'eval_squad_f1_precision'= 0.03658744584678688

W-H model: eval_squad_f1_precision'= 0.03791495818026575

VAL SET

No-H model: eval_squad_f1_precision'= 0.035145586604350916

W-H model: 'eval_squad_f1_precision'= 0.036630308395240056

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, performs worse on multiple choices and counting type.

VAL SET

Appears some errors in fluency answer type.

####Source Gutenberg

In [None]:
#SOURCE GR gutenberg
report(grgut, grgut_ts)


-----------------------------------------------------------
MODEL NO-HISTORY

DIM test_df source=wikipedia : 6520
DIM val_df source=wikipedia : 20108


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
 


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5027
  Batch size = 256


Worst 5 f1_squad 
        answers  F1-SQUAD
0  the _ariel_       0.0
1       lagoon       0.0
2           no       0.0
3      winters       0.0
4           no       0.0
{'eval_loss': 4.813294410705566, 'eval_squad_f1_precision': 0.041529489965927625, 'eval_runtime': 114.9853, 'eval_samples_per_second': 14.176, 'eval_steps_per_second': 0.061}

Evaluation on validation set




Worst 5 f1_squad 
          answers  F1-SQUAD
0  leif ericsson       0.0
1         biarne       0.0
2            yes       0.0
3      karlsefin       0.0
4           olaf       0.0
{'eval_loss': 4.74088191986084, 'eval_squad_f1_precision': 0.04224214726514703, 'eval_runtime': 356.2747, 'eval_samples_per_second': 14.11, 'eval_steps_per_second': 0.056}

-----------------------------------------------------------
MODEL WITH HISTORY



  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
   


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history, source. If history, source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5027
  Batch size = 256


Worst 5 f1_squad 
        answers  F1-SQUAD
0  the _ariel_       0.0
1       lagoon       0.0
2           no       0.0
3      winters       0.0
4           no       0.0
{'eval_loss': 4.740674018859863, 'eval_squad_f1_precision': 0.04173359566035589, 'eval_runtime': 114.6661, 'eval_samples_per_second': 14.215, 'eval_steps_per_second': 0.061}

Evaluation on validation set




Worst 5 f1_squad 
          answers  F1-SQUAD
0  leif ericsson       0.0
1         biarne       0.0
2      karlsefin       0.0
3           olaf       0.0
4    he tripped.       0.0
{'eval_loss': 4.663809299468994, 'eval_squad_f1_precision': 0.04484802480317122, 'eval_runtime': 354.0904, 'eval_samples_per_second': 14.197, 'eval_steps_per_second': 0.056}


On source type Gutenberg, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model: eval_squad_f1_precision = 0.041529489965927625

W-H model: eval_squad_f1_precision = 0.04173359566035589

VAL SET

No-H model: eval_squad_f1_precision = 0.04224214726514703

W-H model: eval_squad_f1_precision = 0.04484802480317122

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, performs worse on multiple choices and no type.

VAL SET

Appears some errors in yes type but are very similar.

####Source Race

In [None]:
#SOURCE GR race
report(grrc, grrc_ts)


-----------------------------------------------------------
MODEL NO-HISTORY

DIM test_df source=wikipedia : 6612
DIM val_df source=wikipedia : 20144


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
 


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5036
  Batch size = 256


Worst 5 f1_squad 
                answers  F1-SQUAD
0  a paper carrier bag       0.0
1               nicole       0.0
2             shanghai       0.0
3               mother       0.0
4                 food       0.0
{'eval_loss': 4.852605819702148, 'eval_squad_f1_precision': 0.041245931170807555, 'eval_runtime': 115.9844, 'eval_samples_per_second': 14.252, 'eval_steps_per_second': 0.06}

Evaluation on validation set




Worst 5 f1_squad 
       answers  F1-SQUAD
0  ted turner       0.0
1         cnn       0.0
2      forbes       0.0
3        navy       0.0
4     unknown       0.0
{'eval_loss': 4.994284629821777, 'eval_squad_f1_precision': 0.03983589365749629, 'eval_runtime': 354.816, 'eval_samples_per_second': 14.193, 'eval_steps_per_second': 0.056}

-----------------------------------------------------------
MODEL WITH HISTORY



  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
   


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history, source. If history, source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5036
  Batch size = 256


Worst 5 f1_squad 
                answers  F1-SQUAD
0  a paper carrier bag       0.0
1               nicole       0.0
2             shanghai       0.0
3               mother       0.0
4                 food       0.0
{'eval_loss': 4.795536518096924, 'eval_squad_f1_precision': 0.04191623665420255, 'eval_runtime': 116.4074, 'eval_samples_per_second': 14.2, 'eval_steps_per_second': 0.06}

Evaluation on validation set




Worst 5 f1_squad 
       answers  F1-SQUAD
0  ted turner       0.0
1         cnn       0.0
2      forbes       0.0
3        navy       0.0
4     unknown       0.0
{'eval_loss': 4.941098213195801, 'eval_squad_f1_precision': 0.0419895638563551, 'eval_runtime': 353.6087, 'eval_samples_per_second': 14.242, 'eval_steps_per_second': 0.057}


On source type Race, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model: eval_squad_f1_precision = 0.041245931170807555

W-H model: eval_squad_f1_precision = 0.04191623665420255

VAL SET

No-H model: eval_squad_f1_precision = 0.03983589365749629

W-H model: eval_squad_f1_precisionm= 0.0419895638563551

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, we have a majority of errors on multiple choices type and yes type.

VAL SET

For mod with H and without H we have equal answers. Appears some errors in multiple choice and yes type.

####Source Mctest

In [None]:
#SOURCE GR mctest
report(grmct, grmct_ts)


-----------------------------------------------------------
MODEL NO-HISTORY

DIM test_df source=wikipedia : 5700
DIM val_df source=wikipedia : 6124


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/distilroberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/42d6b7c87cbac84fcdf35aa69504a5ccfca878fcee2a1a9b9ff7a3d1297f9094.aa95727ac70adfa1aaf5c88bea30a4f5e50869c68e68bce96ef1ec41b5facf46
Model config RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,


  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
 


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1531
  Batch size = 256


Worst 5 f1_squad 
                answers  F1-SQUAD
0                white       0.0
1                   no       0.0
2                   no       0.0
3  she painted herself       0.0
4           the farmer       0.0
{'eval_loss': 4.369931697845459, 'eval_squad_f1_precision': 0.03979448197111554, 'eval_runtime': 101.2168, 'eval_samples_per_second': 14.079, 'eval_steps_per_second': 0.059}

Evaluation on validation set




Worst 5 f1_squad 
           answers  F1-SQUAD
0         bicycle       0.0
1             now       0.0
2         grandma       0.0
3  ran right into       0.0
4       8 candles       0.0
{'eval_loss': 4.450701713562012, 'eval_squad_f1_precision': 0.04416920460962787, 'eval_runtime': 108.8086, 'eval_samples_per_second': 14.071, 'eval_steps_per_second': 0.055}

-----------------------------------------------------------
MODEL WITH HISTORY



  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/model_1_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
   


Evaluation on test set




Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history, source. If history, source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1531
  Batch size = 256


Worst 5 f1_squad 
                answers  F1-SQUAD
0                white       0.0
1                   no       0.0
2                   no       0.0
3  she painted herself       0.0
4           the farmer       0.0
{'eval_loss': 4.322301387786865, 'eval_squad_f1_precision': 0.041214225517631015, 'eval_runtime': 101.2672, 'eval_samples_per_second': 14.072, 'eval_steps_per_second': 0.059}

Evaluation on validation set




Worst 5 f1_squad 
           answers  F1-SQUAD
0         bicycle       0.0
1             now       0.0
2         grandma       0.0
3  ran right into       0.0
4       8 candles       0.0
{'eval_loss': 4.404721736907959, 'eval_squad_f1_precision': 0.04482879144480557, 'eval_runtime': 108.0758, 'eval_samples_per_second': 14.166, 'eval_steps_per_second': 0.056}


On source type Mctest, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model: eval_squad_f1_precision = 0.03979448197111554

W-H model: eval_squad_f1_precision = 0.041214225517631015

VAL SET

No-H model: eval_squad_f1_precision = 0.04416920460962787

W-H model: eval_squad_f1_precision = 0.04482879144480557

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, we have a majority of errors on multiple choices type and no type. In model with H we have a little more fluency type.

VAL SET

For mod with H and without H we have equal answers.A majority of multipe choice answer type.

###Conclusion
We consider the Evaluation of model without H and with H on test set and val set for each source on model 1 DistilRoBERTa.

For each source, we have higher values of f1 SQUAD with models with history.

On test set and val set of model without H, better values for Gutenberg, RACE and Mctest. 

On test set and val set of model with H, better value for Wikipedia followed by Gutenberg.

### [M2] BERTTiny (bert-tiny)

#### Training

In [None]:
#to access drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Eval on test and val set on SQUAD F1-score
#Report evaluation SQUAD F1-score computed on the validation and test sets.
#train with chosen seed=42
seed =42

n = 5000 # subset length to train faster, "None" for whole set

#for seed in seeds:
print(f'Running with seed: {seed}')
set_reproducibility(seed)
    
    #with shuffle
train_df, val_df = split(train_df)

    # text preprocess
train_df = preprocess(train_df)
val_df = preprocess(val_df)
test_df = preprocess(test_df)

    # build df with history
h_train_df = add_history(train_df.copy())
h_val_df = add_history(val_df.copy())
h_test_df = add_history(test_df.copy())

df = train_df.append(val_df.append(test_df))
encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
decoder_max_length = int(pd.Series([len(df.iloc[i]["answer"]) for i in range(len(df["answer"]))]).quantile())
print("decoder_max_length" , decoder_max_length )

print("Train df dialogues: ",train_df.shape,h_train_df.shape)
print("Validation df dialogues: ",val_df.shape,h_val_df.shape)
print("Test df dialogues: ",test_df.shape, h_test_df.shape)

####################################
# NO HISTORY
print("NO- HISTORY -------------------------")  
    # model and tokenizer
berttiny_model, berttiny_tokenizer = get_m2()

    # process dataset to model input
train_ds = preparation(train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
val_ds = preparation(val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
test_ds = preparation(test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # data collator
data_collator = DataCollatorForSeq2Seq(tokenizer = berttiny_tokenizer,model = berttiny_model,label_pad_token_id = -100,return_tensors = 'pt' )

    # trainer
trainer = Seq2SeqTrainer( model=berttiny_model,tokenizer=berttiny_tokenizer,data_collator=data_collator,args=training_args,compute_metrics=compute_metrics_f1_m2,train_dataset=train_ds,eval_dataset=val_ds)

# finetune for 3 epochs without history
print("TRAIN ")
result = trainer.train()
print(result)

# evaluate m1 - TEST SET
print("EVAL TEST SET")
eval_ts = trainer.evaluate(test_ds)
print(eval_ts)

# evaluate m1 - VAL SET
print("EVAL VAL SET")
eval_vs = trainer.evaluate(val_ds)
print(eval_vs)

####################################

Running with seed: 42
decoder_max_length 10
Train df dialogues:  (68644, 3) (68644, 4)
Validation df dialogues:  (17162, 3) (17162, 4)
Test df dialogues:  (7917, 3) (7917, 4)
NO- HISTORY -------------------------


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


TRAIN 




Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4185,11.237722,0.002024
8,10.9255,9.601772,0.001292
12,9.6541,8.303882,0.001614
16,8.5311,7.320415,0.00142
20,7.7507,6.553157,0.001296
24,6.8632,6.082583,0.00152
28,6.5706,5.962583,0.001533
32,6.3782,5.826556,0.001731
36,6.2174,5.598569,0.001503
40,6.0,5.41856,0.001268


***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-20
Configuration saved in ./checkpoint-20/config.json
Model weights saved in ./checkpoint-20/pytorch_model.bin
tokenizer config file saved in ./checkpoint-20/tokenizer_config.json
Special tokens file saved in ./checkpoint-20/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 5000
  Batch s

TrainOutput(global_step=60, training_loss=7.468810065587362, metrics={'train_runtime': 11930.6444, 'train_samples_per_second': 1.257, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.468810065587362, 'epoch': 3.0})
EVAL TEST SET




***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.13712215423584, 'eval_squad_f1_precision': 0.0026732380978482582, 'eval_runtime': 737.8569, 'eval_samples_per_second': 6.776, 'eval_steps_per_second': 0.027, 'epoch': 3.0}
EVAL VAL SET


{'eval_loss': 5.169672012329102, 'eval_squad_f1_precision': 0.0026345761730913117, 'eval_runtime': 745.0264, 'eval_samples_per_second': 6.711, 'eval_steps_per_second': 0.027, 'epoch': 3.0}


In [None]:
trainer.save_model ("/content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist")

Saving model checkpoint to /content/drive/MyDrive/Colab Notebooks/model_2_nohist
Configuration saved in /content/drive/MyDrive/Colab Notebooks/model_2_nohist/config.json
Model weights saved in /content/drive/MyDrive/Colab Notebooks/model_2_nohist/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Colab Notebooks/model_2_nohist/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Colab Notebooks/model_2_nohist/special_tokens_map.json


In [None]:
# WITH HISTORY
print("WITH- HISTORY -------------------------") 
    # model
berttiny_model,_ = get_m2()

    # process dataset to model input
h_train_ds = preparation(h_train_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
h_val_ds = preparation(h_val_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
h_test_ds = preparation(h_test_df, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)

    # trainer
trainer_h = Seq2SeqTrainer(model=berttiny_model,tokenizer=berttiny_tokenizer,data_collator=data_collator,args=training_args,compute_metrics=compute_metrics_f1_m2,train_dataset=h_train_ds, eval_dataset=h_val_ds)

# finetune for 3 epochs with history 
print("TRAIN H")
result_h = trainer_h.train()
print(result_h)

# evaluate m1 - TEST SET
print("EVAL H TEST SET")
eval_h_ts = trainer_h.evaluate(h_test_ds)
print(eval_h_ts)

# evaluate m1 - VAL SET
print("EVAL H VAL SET")
eval_h_vs = trainer_h.evaluate(h_val_ds)
print(eval_h_vs)
   
#print("-----------------------------------------------------------")

WITH- HISTORY -------------------------


Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file https://huggingface.co/prajjwal1/bert-tiny/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3cf34679007e9fe5d0acd644dcc1f4b26bec5cbc9612364f6da7262aed4ef7a4.a5a11219cf90aae61ff30e1658ccf2cb4aa84d6b6e947336556f887c9828dc6d
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/prajjwal1/bert-ti

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 3
  Instantaneous batch size per device = 256
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 60
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


TRAIN H


Step,Training Loss,Validation Loss,Squad F1 Precision
4,12.4599,11.293757,0.001211
8,10.9747,9.663833,0.000927
12,9.7306,8.366546,0.00156
16,8.6242,7.400511,0.001418
20,7.8472,6.637125,0.001094
24,6.9462,6.110971,0.000797
28,6.6076,5.926317,0.000863
32,6.393,5.809954,0.002229
36,6.23,5.596219,0.001025
40,6.01,5.410874,0.000435


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256
Saving model checkpoint to ./checkpoint-10
Configuration saved in ./checkpoint-10/config.json
Model weights saved in ./checkpoint-10/pytorch_model.bin
tokenizer config file saved in ./checkpoint-10/tokenizer_config.json
Special tokens file saved in ./checkpoint-10/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument

TrainOutput(global_step=60, training_loss=7.501890881856283, metrics={'train_runtime': 12061.735, 'train_samples_per_second': 1.244, 'train_steps_per_second': 0.005, 'total_flos': 1708444800000.0, 'train_loss': 7.501890881856283, 'epoch': 3.0})
EVAL H TEST SET




The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: history. If history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 256


{'eval_loss': 5.123477935791016, 'eval_squad_f1_precision': 0.006279346770625518, 'eval_runtime': 776.5766, 'eval_samples_per_second': 6.439, 'eval_steps_per_second': 0.026, 'epoch': 3.0}
EVAL H VAL SET
{'eval_loss': 5.153509616851807, 'eval_squad_f1_precision': 0.005910828171051174, 'eval_runtime': 789.802, 'eval_samples_per_second': 6.331, 'eval_steps_per_second': 0.025, 'epoch': 3.0}


In [None]:
trainer_h.save_model ("/content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist")

Saving model checkpoint to /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist
Configuration saved in /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/config.json
Model weights saved in /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/pytorch_model.bin
tokenizer config file saved in /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/tokenizer_config.json
Special tokens file saved in /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/special_tokens_map.json


### Reports

In [None]:
# compute metric squad f1
def myFunc(e):
    return e[0]


def compute_metrics_report(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    index=[]

    pred_str = berttiny_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = berttiny_tokenizer.pad_token_id
    label_str = berttiny_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    squad_f1_output = [squad.compute_f1(a_pred=pred_str[i], a_gold=label_str[i]) for i in range(len(pred_str))]
    #copy
    squad_f1_output_2 = [squad.compute_f1(a_pred=pred_str[i], a_gold=label_str[i]) for i in range(len(pred_str))]
    
    merged_list = list(zip(squad_f1_output, label_str))
    merged_list.sort(key=myFunc)
    print('Sorted list:', merged_list)
    worst5_f1squad = list(list(zip(*merged_list[:5]))[0])
    lables= list(list(zip(*merged_list[:5]))[1])

    print("  LABEL      |     F1-SQUAD  ") # worst 5 f1_squad.
    for i in range(len(worst5_f1squad)):
            print( lables[i], "   ", worst5_f1squad[i], end =' ' )
            print()

    print("")
    return {
        
        "squad_f1_precision": sum(squad_f1_output) / len(squad_f1_output), # do the average
    }
   

In [None]:
 #METHOD MODEL 2
 #method that return the f1 squad score for evaluation on test set and val set for mod no-H and with-H
 def report_m2(group, group_ts):   
    
    #with shuffle
    train_df_src, val_df_src = split(group)
    print("DIM train_df source=wikipedia :",train_df_src.size)
    print("DIM val_df source=wikipedia :", val_df_src.size)

    train_df_src = preprocess_with_source(train_df_src)
    val_df_src = preprocess_with_source(val_df_src)
    test_df_src = preprocess_with_source(group_ts)

    df_src= train_df_src.append(val_df_src).append(test_df_src)

    encoder_max_length = 32 # int(pd.Series([len(df.iloc[i]["passage"]) for i in range(len(df["passage"]))]).quantile())
    decoder_max_length = int(pd.Series([len(df_src.iloc[i]["answer"]) for i in range(len(df_src["answer"]))]).quantile())

    ####################################
    # NO HISTORY
    n = None # subset length to train faster, "None" for whole set 

    # process dataset to model input
    train_ds_src = preparation(train_df_src, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
    val_ds_src = preparation(val_df_src, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    test_ds_src = preparation(test_df_src, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    
    print("")
    print("-----------------------------------------------------------")
    print("MODEL NO-HISTORY")
    print("")

    model = EncoderDecoderModel.from_pretrained("/content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist", local_files_only=True)

  # trainer
    trainer = Seq2SeqTrainer( 
      model=model,
      tokenizer=berttiny_tokenizer,
      args=training_args,
      compute_metrics=compute_metrics_report,
      )


    trainer.model = model.cuda()



    print("evaluate m2 - TEST SET")
    # evaluate m2 - TEST SET
    eval_ts_src =trainer.evaluate(test_ds_src)
    print(eval_ts_src)

    print("evaluate m2 - VAL SET")
    # evaluate m2 - VAL SET
    eval_vs_src = trainer.evaluate(val_ds_src)
    print(eval_vs_src)

    ####################################
    #WITH HISTORY
    # build df with history
    h_train_df_src= add_history(train_df_src.copy())
    h_val_df_src = add_history(val_df_src.copy())
    h_test_df_src = add_history(test_df_src.copy())

    # process dataset to model input
    h_train_ds_src = preparation(h_train_df_src, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n) 
    h_val_ds_src = preparation(h_val_df_src, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)
    h_test_ds_src = preparation(h_test_df_src, process_data_to_model_inputs, berttiny_tokenizer, encoder_max_length, decoder_max_length, n)



    print("")
    print("-----------------------------------------------------------")
    print("MODEL WITH HISTORY")
    print("")

    model_h = EncoderDecoderModel.from_pretrained("/content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist", local_files_only=True)

  # trainer
    trainer_h = Seq2SeqTrainer( 
      model=model_h,
      tokenizer=berttiny_tokenizer,
      args=training_args,
      compute_metrics=compute_metrics_report,
      )

    trainer_h.model_h = model_h.cuda()



    print("")
    print("evaluate m2 - TEST SET")
    # evaluate m2 - TEST SET
    eval_h_ts_src = trainer_h.evaluate(h_test_ds_src)
    print(eval_h_ts_src)

    print("")
    print("evaluate m2 -VAL SET")
    # evaluate m2 - VAL SET
    eval_h_vs_src = trainer_h.evaluate(h_val_ds_src)
    print(eval_h_vs_src)

   #return {}


####Source WIKIPEDIA

In [None]:
#SOURCE WIKI
report_m2(grwiki, grwiki_ts)

DIM train_df source=wikipedia : 5184
DIM val_df source=wikipedia : 1300


  0%|          | 0/78 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_


-----------------------------------------------------------
MODEL NO-HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1626
  Batch size = 256


evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5044
  Batch size = 256


Sorted list: [(0.0, 'five'), (0.0, 'new york city'), (0.0, 'new york'), (0.0, 'yes'), (0.0, 'in the southwest of the city'), (0.0, 'arthur kill and the kill van kull'), (0.0, '476, 015'), (0.0, 'no'), (0.0, 'non - hispanic white'), (0.0, 'the forgotten borough'), (0.0, 'because the inhabitants feel neglected by the city government'), (0.0, 'north shore'), (0.0, 'st. george, tompkinsville, clifton,'), (0.0, 'oclc'), (0.0, 'online computer library center'), (0.0, '1967'), (0.0, 'yes'), (0.0, 'ohio'), (0.0, 'ohio state university'), (0.0, 'frederick g. kilgour'), (0.0, 'he is not'), (0.0, 'medical school librarian'), (0.0, 'worldcat'), (0.0, 'july 5, 1967'), (0.0, 'ohio state university'), (0.0, 'alden library'), (0.0, 'ohio university'), (0.0, 'online cataloging'), (0.0, 'august 26, 1971'), (0.0, 'no'), (0.0, 'buckinghamshire'), (0.0, 'south east england'), (0.0, 'greater london'), (0.0, 'berkshire'), (0.0, 'oxfordshire'), (0.0, 'northamptonshire'), (0.0, 'hertfordshire'), (0.0, 'high wy

  0%|          | 0/78 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_ra


-----------------------------------------------------------
MODEL WITH HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1626
  Batch size = 256



evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5044
  Batch size = 256


Sorted list: [(0.0, 'five'), (0.0, 'new york city'), (0.0, 'new york'), (0.0, 'in the southwest of the city'), (0.0, 'arthur kill and the kill van kull'), (0.0, '476, 015'), (0.0, 'non - hispanic white'), (0.0, 'the forgotten borough'), (0.0, 'because the inhabitants feel neglected by the city government'), (0.0, 'north shore'), (0.0, 'st. george, tompkinsville, clifton,'), (0.0, 'oclc'), (0.0, 'online computer library center'), (0.0, '1967'), (0.0, 'ohio'), (0.0, 'ohio state university'), (0.0, 'frederick g. kilgour'), (0.0, 'he is not'), (0.0, 'medical school librarian'), (0.0, 'worldcat'), (0.0, 'july 5, 1967'), (0.0, 'ohio state university'), (0.0, 'alden library'), (0.0, 'ohio university'), (0.0, 'online cataloging'), (0.0, 'august 26, 1971'), (0.0, 'buckinghamshire'), (0.0, 'south east england'), (0.0, 'greater london'), (0.0, 'berkshire'), (0.0, 'oxfordshire'), (0.0, 'northamptonshire'), (0.0, 'hertfordshire'), (0.0, 'high wycombe, amersham, che'), (0.0, 'london commuter belt'),

On source type WIKIPEDIA, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model:
eval_squad_f1_precision'= 0.0023865781335645447

W-H model:
val_squad_f1_precision'= 0.005315890059511972

VAL SET

No-H model:
eval_squad_f1_precision'= 0.003207742227256361

W-H model:
eval_squad_f1_precision'= 0.005193049192316509

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, performs worse on multiple choices and counting type.

VAL SET

Appears some errors in fluency answer type.


####Source CNN

In [None]:
#SOURCE CNN
report_m2(grcnn, grcnn_ts)

DIM train_df source=wikipedia : 5444
DIM val_df source=wikipedia : 1364


  0%|          | 0/80 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_


-----------------------------------------------------------
MODEL NO-HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1649
  Batch size = 256


evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5145
  Batch size = 256


Sorted list: [(0.0, 'yes.'), (0.0, 'dennis farina'), (0.0, 'actor'), (0.0, 'no'), (0.0, 'yes'), (0.0, 'no'), (0.0, 'michael mann'), (0.0, '" thief "'), (0.0, 'cops or gangsters'), (0.0, 'he joined a tv show cast.'), (0.0, '" law & order "'), (0.0, 'detective joe fontana'), (0.0, 'no'), (0.0, 'an expensive car'), (0.0, 'no'), (0.0, 'flashy'), (0.0, 'no'), (0.0, 'no'), (0.0, 'a cop'), (0.0, 'gary giordano'), (0.0, 'gaithersburg'), (0.0, 'montgomery county'), (0.0, 'maryland'), (0.0, 'aruban jail'), (0.0, 'suspect in the recent disappearance of an american woman'), (0.0, 'fbi'), (0.0, '15'), (0.0, 'aruban solicitor general taco stein'), (0.0, 'monday'), (0.0, 'robyn gardne'), (0.0, 'snorkeling'), (0.0, 'giordano'), (0.0, 'no, gardner was nowhere to be found'), (0.0, '50'), (0.0, 'august 5'), (0.0, '2, giordano told authorities that he'), (0.0, 'der spiegel'), (0.0, 'germany'), (0.0, 'posing over the bodies of dead afghans'), (0.0, 'bloody'), (0.0, 'propped up, back to back'), (0.0, 'milit

  0%|          | 0/80 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_ra


-----------------------------------------------------------
MODEL WITH HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1649
  Batch size = 256



evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5145
  Batch size = 256


Sorted list: [(0.0, 'dennis farina'), (0.0, 'actor'), (0.0, 'michael mann'), (0.0, '" thief "'), (0.0, 'cops or gangsters'), (0.0, '" law & order "'), (0.0, 'detective joe fontana'), (0.0, 'an expensive car'), (0.0, 'flashy'), (0.0, 'a cop'), (0.0, 'gary giordano'), (0.0, 'gaithersburg'), (0.0, 'montgomery county'), (0.0, 'maryland'), (0.0, 'aruban jail'), (0.0, 'suspect in the recent disappearance of an american woman'), (0.0, 'fbi'), (0.0, '15'), (0.0, 'aruban solicitor general taco stein'), (0.0, 'monday'), (0.0, 'at least eight more days'), (0.0, 'robyn gardne'), (0.0, 'ast seen near baby beach'), (0.0, 'snorkeling'), (0.0, 'giordano'), (0.0, 'locals say is not a popular snorkel'), (0.0, '50'), (0.0, 'august 5'), (0.0, '2, giordano told authorities that he'), (0.0, 'der spiegel'), (0.0, 'germany'), (0.0, 'posing over the bodies of dead afghans'), (0.0, 'bloody'), (0.0, 'propped up, back to back'), (0.0, 'military vehicle.'), (0.0, 'taking or retaining individual souvenirs or trophi


On source type CNN, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model:
'eval_squad_f1_precision'= 0.0023112630044251044

W-H model:
eval_squad_f1_precision'= 0.004832385128292988

VAL SET

No-H model:
eval_squad_f1_precision'= 0.002447445505740101

W-H model:
'eval_squad_f1_precision'= 0.005215320080441891

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, performs worse on multiple choices and counting type.

VAL SET

Appears some errors in fluency answer type.


####Source Gutenberg

In [None]:
#SOURCE GR gutenberg
report_m2(grgut, grgut_ts)

DIM train_df source=wikipedia : 5168
DIM val_df source=wikipedia : 1292


  0%|          | 0/79 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_


-----------------------------------------------------------
MODEL NO-HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1630
  Batch size = 256


evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5027
  Batch size = 256


  LABEL      |     F1-SQUAD  
the _ ariel _     0.0 
lagoon     0.0 
no     0.0 
winters     0.0 
no     0.0 

{'eval_loss': 2.991503953933716, 'eval_squad_f1_precision': 0.0023275266492479795, 'eval_runtime': 231.8163, 'eval_samples_per_second': 7.031, 'eval_steps_per_second': 0.03}
evaluate m2 - VAL SET
  LABEL      |     F1-SQUAD  
leif ericsson     0.0 
biarne     0.0 
yes     0.0 
karlsefin     0.0 
olaf     0.0 

{'eval_loss': 2.9487688541412354, 'eval_squad_f1_precision': 0.002136517178278771, 'eval_runtime': 713.5283, 'eval_samples_per_second': 7.045, 'eval_steps_per_second': 0.028}


  0%|          | 0/79 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_ra


-----------------------------------------------------------
MODEL WITH HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1630
  Batch size = 256



evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5027
  Batch size = 256


  LABEL      |     F1-SQUAD  
the _ ariel _     0.0 
lagoon     0.0 
winters     0.0 
malaita     0.0 
harley kennan     0.0 

{'eval_loss': 3.013112783432007, 'eval_squad_f1_precision': 0.007089013621277525, 'eval_runtime': 242.9904, 'eval_samples_per_second': 6.708, 'eval_steps_per_second': 0.029}

evaluate m2 -VAL SET
  LABEL      |     F1-SQUAD  
leif ericsson     0.0 
biarne     0.0 
karlsefin     0.0 
olaf     0.0 
he tripped.     0.0 

{'eval_loss': 2.965400218963623, 'eval_squad_f1_precision': 0.007133606960826107, 'eval_runtime': 761.7328, 'eval_samples_per_second': 6.599, 'eval_steps_per_second': 0.026}



On source type Gutenberg, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model:
eval_squad_f1_precision'= 0.0023275266492479795

W-H model:
eval_squad_f1_precision'= 0.00708901362127752

VAL SET

No-H model:
eval_squad_f1_precision'= 0.002136517178278771

W-H model:
eval_squad_f1_precision'= 0.007133606960826107

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, performs worse on multiple choices and no type.

VAL SET

Appears some errors in  yes type but are very similar with Test set results.


####Source Race

In [None]:
#SOURCE GR race
report_m2(grrc, grrc_ts)

DIM train_df source=wikipedia : 5472
DIM val_df source=wikipedia : 1372


  0%|          | 0/80 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_


-----------------------------------------------------------
MODEL NO-HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1653
  Batch size = 256


evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5036
  Batch size = 256


Sorted list: [(0.0, 'yes'), (0.0, 'a paper carrier bag'), (0.0, 'yes'), (0.0, 'nicole'), (0.0, 'shanghai'), (0.0, 'mother'), (0.0, 'food'), (0.0, 'yes'), (0.0, 'i am having heart surgery soon, so'), (0.0, 'an ipad'), (0.0, 'i am now working on some more chinese'), (0.0, 'yes'), (0.0, '" thank you "'), (0.0, 'weather forecast'), (0.0, 'yes'), (0.0, 'firefighter'), (0.0, 'yes'), (0.0, 'flashlight'), (0.0, 'r. j.'), (0.0, 'joel'), (0.0, 'glass, wood, plaster, and maybe'), (0.0, 'no'), (0.0, 'eppes'), (0.0, 'the flashlight'), (0.0, 'great britain'), (0.0, 'india.'), (0.0, 'may be 30 feet tall'), (0.0, 'prune it'), (0.0, 'may prevent heart disease.'), (0.0, 'by accident'), (0.0, 'shen nong'), (0.0, 'about 2737 b. c'), (0.0, 'yes'), (0.0, 'unknown'), (0.0, 'no'), (0.0, 'they bought flowers.'), (0.0, "it's $ 15."), (0.0, 'no'), (0.0, "it doesn't look good."), (0.0, 'summer'), (0.0, '$ 15'), (0.0, 'no'), (0.0, 'a pen'), (0.0, 'she already has two blouses'), (0.0, "mother's birthday"), (0.0, 'a

  0%|          | 0/80 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_ra


-----------------------------------------------------------
MODEL WITH HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1653
  Batch size = 256



evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5036
  Batch size = 256


Sorted list: [(0.0, 'an elderly chinese lady and a little boy'), (0.0, 'a paper carrier bag'), (0.0, 'nicole'), (0.0, 'shanghai'), (0.0, 'mother'), (0.0, 'food'), (0.0, 'i am having heart surgery soon, so'), (0.0, 'an ipad'), (0.0, 'hot soup and a container with rice,'), (0.0, 'i am now working on some more chinese'), (0.0, '" thank you "'), (0.0, 'weather forecast'), (0.0, 'yes'), (0.0, 'firefighter'), (0.0, 'yes'), (0.0, 'flashlight'), (0.0, 'r. j.'), (0.0, 'joel'), (0.0, 'eppes'), (0.0, 'the flashlight'), (0.0, 'great britain'), (0.0, 'india.'), (0.0, 'may be 30 feet tall'), (0.0, 'prune it'), (0.0, 'may prevent heart disease.'), (0.0, 'by accident'), (0.0, 'shen nong'), (0.0, 'about 2737 b. c'), (0.0, 'yes'), (0.0, 'unknown'), (0.0, 'they bought flowers.'), (0.0, "it's $ 15."), (0.0, "it doesn't look good."), (0.0, 'summer'), (0.0, '$ 15'), (0.0, 'a pen'), (0.0, 'she already has two blouses'), (0.0, "mother's birthday"), (0.0, 'at least $ 500'), (0.0, 'the hospital had been bombed.


On source type Race, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model:
eval_squad_f1_precision'= 0.0026423082337127275

W-H model:
val_squad_f1_precision'= 0.0069145187714024035

VAL SET

No-H model:
eval_squad_f1_precision= 0.002843437803132904

eval_squad_f1_precision'= 0.005964042734507267

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both model with and whithout H, we have a majority of errors on multiple choices type and yes type.

VAL SET

For mod with H and without H we have equal answers.
Appears some errors in  multiple choice and yes type.

####Source Mctest

In [None]:
#SOURCE GR mctest
report_m2(grmct, grmct_ts)

DIM train_df source=wikipedia : 1760
DIM val_df source=wikipedia : 440


  0%|          | 0/24 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_


-----------------------------------------------------------
MODEL NO-HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_nohist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1425
  Batch size = 256


evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source. If source are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1531
  Batch size = 256


  LABEL      |     F1-SQUAD  
white     0.0 
no     0.0 
with her mommy and 5 sisters     0.0 
orange and white     0.0 
no     0.0 

{'eval_loss': 2.9716672897338867, 'eval_squad_f1_precision': 0.0028342343560611655, 'eval_runtime': 204.4027, 'eval_samples_per_second': 6.972, 'eval_steps_per_second': 0.029}
evaluate m2 - VAL SET
Sorted list: [(0.0, 'bicycle'), (0.0, 'now'), (0.0, 'grandma'), (0.0, 'ran right into'), (0.0, '8 candles'), (0.0, 'because he was turning 8'), (0.0, 'yes'), (0.0, 'kramer'), (0.0, 'she was crying?'), (0.0, 'she was so scared'), (0.0, 'kramer chased him around the room'), (0.0, 'yes'), (0.0, 'to play'), (0.0, 'hide and seek'), (0.0, 'no'), (0.0, 'the mouse was'), (0.0, 'a little fake squeaky mouse'), (0.0, 'no'), (0.0, 'he was still a baby'), (0.0, 'no'), (0.0, 'no'), (0.0, 'a kitten'), (0.0, 'play'), (0.0, 'sleep'), (0.0, 'waking from a nap'), (0.0, 'best friends'), (0.0, 'no'), (0.0, 'mean'), (0.0, 'no'), (0.0, 'the sidewalk, the swings,'), (0.0, "mary's dog

  0%|          | 0/24 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist/config.json
Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_ra


-----------------------------------------------------------
MODEL WITH HISTORY



The following encoder weights were not tied to the decoder ['bert/pooler']
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at /content/drive/MyDrive/Colab Notebooks/seed42/model_2_hist.
If your task is similar to the task the model of the checkpoint was trained on, you can already use EncoderDecoderModel for predictions without further training.
The following encoder weights were not tied to the decoder ['bert/pooler']
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1425
  Batch size = 256



evaluate m2 - TEST SET


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: source, history. If source, history are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1531
  Batch size = 256


  LABEL      |     F1-SQUAD  
white     0.0 
in a barn     0.0 
with her mommy and 5 sisters     0.0 
orange and white     0.0 
she painted herself     0.0 

{'eval_loss': 2.977224826812744, 'eval_squad_f1_precision': 0.006025909391010019, 'eval_runtime': 208.0806, 'eval_samples_per_second': 6.848, 'eval_steps_per_second': 0.029}

evaluate m2 -VAL SET
Sorted list: [(0.0, 'bicycle'), (0.0, 'now'), (0.0, 'grandma'), (0.0, 'ran right into'), (0.0, '8 candles'), (0.0, 'because he was turning 8'), (0.0, 'yes'), (0.0, 'kramer'), (0.0, 'she was crying?'), (0.0, 'she was so scared'), (0.0, 'kramer chased him around the room'), (0.0, 'yes'), (0.0, 'to play'), (0.0, 'hide and seek'), (0.0, 'the mouse was'), (0.0, 'a little fake squeaky mouse'), (0.0, 'he was still a baby'), (0.0, 'a kitten'), (0.0, 'the hole in the wall'), (0.0, 'play'), (0.0, 'sleep'), (0.0, 'waking from a nap'), (0.0, 'best friends'), (0.0, 'mean'), (0.0, 'the sidewalk, the swings,'), (0.0, "mary's dog"), (0.0, 'giggle'), (0.0


On source type Mctest, we do evaluation on test set and val set with model2 trained with seed=42. We ordered the answers respect to f1-squad precision metric. We print the worst 5 errors and analyze the answer type on which the model performs worse.

TEST SET

No-H model:
eval_squad_f1_precision'= 0.0028342343560611655

W-H model:
eval_squad_f1_precision'= 0.006025909391010019

VAL SET

No-H model:
eval_squad_f1_precision'=0.00250236439096673

W-H model:
eval_squad_f1_precision'= 0.006531577148539349

The model with H performs a quite better.

Anwer type analysis:

looking at table 6 in https://arxiv.org/pdf/1808.07042.pdf

TEST SET

Both models with and whithout H have a majority of errors on multiple choices type and no type. In model with H we have a little more fluency type.

VAL SET

For mod with H and without H we have equal answers. A majority of multipe choice answer type.


###Conclusion

We consider the Evaluation of model without H and with H on test set and val set for each source
on model 2 BERTTiny.

For each source, we have higher values of f1 SQUAD with models with history.

On test set and val set of model without H,
better values for RACE and Mctest.
The last 3 sources are quite similar.

On test set and val set of model with H, better value for Gutenberg followed by Race and Mctest.



