<a href="https://colab.research.google.com/github/giuseppe-tanzi/Question-Answering/blob/main/Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment 2

**Course** : Natural Language Processing

**Authors** : Yuri Noviello, Enrico Pallotta, Flavio Pinzarrone and Giuseppe Tanzi

## Installations and imports

In [None]:
import pandas as pd
import random
import numpy as np
from tqdm import tqdm

import os
import urllib.request
from tqdm import tqdm

from sklearn.model_selection import GroupShuffleSplit 

# used to avoid out of memory problems
from torch.cuda import empty_cache
import torch
import gc

try:
  import transformers
except:
  print('Installing transformers')
  !pip -qqq install transformers
  import transformers

!pip install accelerate # used to auto find batch size during training (if needed)
from accelerate import find_executable_batch_size

try:
  from datasets import Dataset
except:
  print('Installing datasets')
  !pip -qqq install datasets
  from datasets import Dataset

# Downloading the script for squad f1 score computing
!rm squad.py
!wget https://raw.githubusercontent.com/allenai/allennlp-models/main/allennlp_models/rc/tools/squad.py

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
from transformers import EncoderDecoderModel, AutoTokenizer
import squad

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
rm: cannot remove 'squad.py': No such file or directory
--2023-01-05 15:19:20--  https://raw.githubusercontent.com/allenai/allennlp-models/main/allennlp_models/rc/tools/squad.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2687 (2,6K) [text/plain]
Saving to: ‘squad.py’


2023-01-05 15:19:20 (52,4 MB/s) - ‘squad.py’ saved [2687/2687]



Function that guarantees reproducibility.

In [None]:
def set_reproducibility(seed : int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True   

seed = 42  # Tested seeds : 42, 2022, 1337
set_reproducibility(seed)

Set drive directory, here we'll save only the results of the evaluation of the different models.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

results_dir = "/content/drive/MyDrive/QA_results"
os.makedirs(results_dir, exist_ok = True)

Mounted at /content/drive


## Dataset Download


In [None]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [None]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')

Downloading CoQA train data split... (it may take a while)


coqa-train-v1.0.json: 49.0MB [00:42, 1.16MB/s]                              


Download completed!
Downloading CoQA test data split... (it may take a while)


coqa-dev-v1.0.json: 9.09MB [00:07, 1.20MB/s]                            

Download completed!





In [None]:
train = pd.read_json("./coqa/train.json").drop('version', axis = 1)
test = pd.read_json("./coqa/test.json").drop('version', axis = 1)

## Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs

In [None]:
train

Unnamed: 0,data
0,"{'source': 'wikipedia', 'id': '3zotghdk5ibi9ce..."
1,"{'source': 'cnn', 'id': '3wj1oxy92agboo5nlq4r7..."
2,"{'source': 'gutenberg', 'id': '3bdcf01ogxu7zdn..."
3,"{'source': 'cnn', 'id': '3ewijtffvo7wwchw6rtya..."
4,"{'source': 'gutenberg', 'id': '3urfvvm165iantk..."
...,...
7194,"{'source': 'gutenberg', 'id': '34j10vatjfyw0ao..."
7195,"{'source': 'cnn', 'id': '3vj40nv2qinjocrcy7k4z..."
7196,"{'source': 'race', 'id': '3rjsc4xj10uw0to3vq0v..."
7197,"{'source': 'wikipedia', 'id': '3gs6s824sqxty8v..."


In [None]:
test

Unnamed: 0,data
0,"{'source': 'mctest', 'id': '3dr23u6we5exclen4t..."
1,"{'source': 'mctest', 'id': '3azhrg4cu4ktme1zh7..."
2,"{'source': 'race', 'id': '3ioen3p9s7jsqm9zwse0..."
3,"{'source': 'cnn', 'id': '36v4q8r5zk0iwte84nbw2..."
4,"{'source': 'mctest', 'id': '3a1pq49wvhh8nbtgsb..."
...,...
495,"{'source': 'race', 'id': '3npfyt4izc42dgjyfy8t..."
496,"{'source': 'mctest', 'id': '3a7y0r2p2ooc4i9zn4..."
497,"{'source': 'wikipedia', 'id': '3o7l7bfshep737y..."
498,"{'source': 'race', 'id': '3i02618ya06g9pi2dcnt..."


### Conversion in DataFrame format

Let's fix the format of the dataset, passing from json to a structured pandas dataframe containing the useful information.

i.e. Context, Question, First answer.

The history will be generated after since it's simpler with a structured DataFrame and we first need to remove unanswerable questions.

To make it easy to split the train set and to retrieve the history later, we are also assigning an *id* to each dialogue.

In [None]:
def rearrange_df(df : pd.DataFrame) -> pd.DataFrame:
  #required columns in our dataframe
  cols = ["source","dialogue id","context","question","answer"]
  #list of lists to create our dataframe
  comp_list = []
  for index, row in tqdm(df.iterrows()):
      for i in range(len(row["data"]["questions"])):
          temp_list = []
          temp_list.append(row["data"]["source"])
          temp_list.append(index)
          temp_list.append(row["data"]["story"])
          temp_list.append(row["data"]["questions"][i]["input_text"])
          temp_list.append(row["data"]["answers"][i]["input_text"])
          comp_list.append(temp_list)
  new_df = pd.DataFrame(comp_list, columns=cols) 
  return new_df

In [None]:
print("Rearranging train..")
train_df = rearrange_df(train)
print("Rearranging test..")
test_df = rearrange_df(test)

Rearranging train..


7199it [00:01, 5777.27it/s]


Rearranging test..


500it [00:00, 6914.84it/s]


Let's see the final format

In [None]:
train_df

Unnamed: 0,source,dialogue id,context,question,answer
0,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project
...,...,...,...,...,...
108642,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was a sub?,Xabi Alonso
108643,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,Was it his first game this year?,Yes
108644,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,What position did the team reach?,third
108645,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was ahead of them?,Barca.


In [None]:
test_df

Unnamed: 0,source,dialogue id,context,question,answer
0,mctest,0,"Once upon a time, in a barn near a farm house,...",What color was Cotton?,white
1,mctest,0,"Once upon a time, in a barn near a farm house,...",Where did she live?,in a barn
2,mctest,0,"Once upon a time, in a barn near a farm house,...",Did she live alone?,no
3,mctest,0,"Once upon a time, in a barn near a farm house,...",Who did she live with?,with her mommy and 5 sisters
4,mctest,0,"Once upon a time, in a barn near a farm house,...",What color were her sisters?,orange and white
...,...,...,...,...,...
7978,wikipedia,499,"Las Vegas (, Spanish for ""The Meadows""), offic...","where does the nickname ""Sin City"" come from?",The city's tolerance for numerous forms of adu...
7979,wikipedia,499,"Las Vegas (, Spanish for ""The Meadows""), offic...",Which state is it in?,Nevada
7980,wikipedia,499,"Las Vegas (, Spanish for ""The Meadows""), offic...",Is it located in a desert?,Yes
7981,wikipedia,499,"Las Vegas (, Spanish for ""The Meadows""), offic...",what is the name of the desert?,Mojave Desert.


## [Task 1] Remove unaswerable QA pairs

Unanswerable question in the CoQA dataset are the one with answer equal to "unknown".

Let's check some of them.

In [None]:
train_df[train_df['answer'] == "unknown"]

Unnamed: 0,source,dialogue id,context,question,answer
13,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",what must be requested to view?,unknown
349,cnn,22,(CNN) -- The 54-year-old Michigan tree trimmer...,Were they the relatives of the kid?,unknown
352,cnn,22,(CNN) -- The 54-year-old Michigan tree trimmer...,Where did he go afterwards?,unknown
354,cnn,22,(CNN) -- The 54-year-old Michigan tree trimmer...,Who found them?,unknown
356,cnn,22,(CNN) -- The 54-year-old Michigan tree trimmer...,Did he have any siblings?,unknown
...,...,...,...,...,...
108230,gutenberg,7170,"CHAPTER 6 \n\nCan piety the discord heal, Or s...",Who wanted to end the conversation?,unknown
108408,race,7183,Edith Cavell was born in a little English vill...,What happened to him?,unknown
108581,gutenberg,7194,"CHAPTER XX \n\nFAST IN THE ICE \n\n""Well, ther...",What sort of official did they meet?,unknown
108632,wikipedia,7197,"Frankfurt, officially Frankfurt am Main (Liter...",Is it the world's biggest?,unknown


In [None]:
test_df[test_df['answer'] == "unknown"]

Unnamed: 0,source,dialogue id,context,question,answer
20,mctest,1,Once there was a beautiful fish named Asta. As...,What did they do with the note,unknown
22,mctest,1,Once there was a beautiful fish named Asta. As...,were they excited,unknown
130,race,8,Which country grows the most tea? The answer i...,How did his body react to the tea?,unknown
229,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",Were there cannibals?,unknown
287,gutenberg,19,CHAPTER FIFTY FIVE. \n\nWAITING. \n\nThe lengt...,Why was he excommunicated then?,unknown
...,...,...,...,...,...
6789,wikipedia,423,"Futsal, (literally ""mini-football"", ""futebol d...",Does it have the same number of players as reg...,unknown
6972,gutenberg,433,"CHAPTER XXIV: A Merry Home-Going \n\n""The Laug...",Which one was made of paper?,unknown
7449,wikipedia,466,Guam (i/ˈɡwɑːm/ or /ˈɡwɒm/; Chamorro: Guåhån;[...,Is it the highest?,unknown
7509,wikipedia,469,The 2008 Summer Olympics torch relay was run f...,And did they climb any mountains?,unknown


Remove them.

In [None]:
train_df = train_df.drop(train_df[train_df['answer'] == "unknown"].index)
test_df = test_df.drop(test_df[test_df['answer'] == "unknown"].index)

## [Task 2] Train, Validation and Test splits

Let's split the training set into train and validation, of course we won't split the Q&A related to the same story across the two sets, i.e. we'll split at dialogue level.

In [None]:
splitter = GroupShuffleSplit(test_size=.2, n_splits=1) # 80% train, 20% val
split = splitter.split(train_df, groups=train_df['dialogue id'])
train_inds, test_inds = next(split)

train_set_df = train_df.iloc[train_inds]
val_set_df = train_df.iloc[test_inds]

In [None]:
train_set_df

Unnamed: 0,source,dialogue id,context,question,answer
0,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",When was the Vat formally opened?,It was formally established in 1475
1,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",what is the library for?,research
2,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",for what subjects?,"history, and law"
3,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",and?,"philosophy, science and theology"
4,wikipedia,0,"The Vatican Apostolic Library (), more commonl...",what was started in 2014?,a project
...,...,...,...,...,...
108642,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was a sub?,Xabi Alonso
108643,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,Was it his first game this year?,Yes
108644,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,What position did the team reach?,third
108645,cnn,7198,(CNN) -- Cristiano Ronaldo provided the perfec...,Who was ahead of them?,Barca.


In [None]:
val_set_df

Unnamed: 0,source,dialogue id,context,question,answer
130,race,8,Local businessmen are increasingly facing comp...,What is a valuable service?,brick and mortar stores
131,race,8,Local businessmen are increasingly facing comp...,What is the issue?,competition from online retailers
132,race,8,Local businessmen are increasingly facing comp...,When did it begin?,with mail-order catalogues
133,race,8,Local businessmen are increasingly facing comp...,Is is happening quickly?,yes
134,race,8,Local businessmen are increasingly facing comp...,What is a safety issue?,online security
...,...,...,...,...,...
108596,cnn,7195,(CNN) -- The biological mother of a missing 7-...,How old was her son?,Seven
108597,cnn,7195,(CNN) -- The biological mother of a missing 7-...,Where was he from?,Oregon
108598,cnn,7195,(CNN) -- The biological mother of a missing 7-...,What was his name?,Kyron Horman
108599,cnn,7195,(CNN) -- The biological mother of a missing 7-...,When was the letter released?,Monday


## [Task 3] Model definition

Here there is the function that define the models that will be trained with CoQA:
* **[M1]** DistilRoBERTa (distilroberta-base)
* **[M2]** BERTTiny (prajjwal1/bert-tiny)

From now on, we will refer to these models as **M1** and **M2**.

The following function return the EncoderDecoderModel and its tokenizer as indicated in the input string. <br>

We fixed the maximum number of tokens that the decoder can generate as the 99% quantile value over the answers’ lengths, as well as the minimum lenght to the sortest possible answer (3 tokens i.e. start - word - end). <br>
Finally we enabled the beams search by setting as number of best partial solutions the value of 4 and set the maximum size of repeated n-grams to 3.

In [None]:
def get_model_tok(bert_name : str):
  """
  Arguments:
    bert_name : model name from huggingface
  Returns:
    the transformer model and its corresponding tokenizer
  """
  print("Downloading :", bert_name)
  model = EncoderDecoderModel.from_encoder_decoder_pretrained(bert_name, bert_name, use_cache = False)
  tokenizer = AutoTokenizer.from_pretrained(bert_name, use_cache = False)

  # set special tokens
  tokenizer.bos_token = tokenizer.cls_token
  tokenizer.eos_token = tokenizer.sep_token
  model.config.decoder_start_token_id = tokenizer.bos_token_id
  model.config.eos_token_id = tokenizer.eos_token_id
  model.config.pad_token_id = tokenizer.pad_token_id
  model.config.cls_token_id = tokenizer.cls_token_id

  # sensible parameters for beam search
  answers_len = [len(tokenizer(answer).input_ids) for answer in train_df['answer']]
  model.config.vocab_size = model.config.decoder.vocab_size
  model.config.max_new_tokens = model.config.max_length = int(np.quantile(answers_len, 0.99))
  model.config.min_new_tokens = model.config.min_length = int(np.min(answers_len))
  model.config.num_beams = 4
  model.config.no_repeat_ngram_size = 3
  model.config.early_stopping = True
    
  return model, tokenizer

model_name_M1 = 'distilroberta-base'

model_name_M2 = 'prajjwal1/bert-tiny'

### Tokenization and conversion steps

We define the maximum number of tokens that the encoder can process for each training sample (i.e. max input dimension).

In [None]:
encoder_max_length = 512

The following function (inspired by [this](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)) processes the elements of the dataset to be used by a Seq2Seq model. <br>
We decided to tokenize the questions firstly and then the contexts (as suggested by the Huggingface tutorial for Q&A). <br>
Finally the function substitutes the PAD tokens with the value -100. This is a special value that is automatically skipped by the trainer when the loss is computed.

In [None]:
def generate_input_output(batch,
                          tokenizer : AutoTokenizer,
                          model : EncoderDecoderModel, 
                          history : bool = False,
                          ):
  """
  Params:
    batch : a batch of data passed through the map function on a Dataset object.
    tokenizer : Tokenizer to use for processing data into input and output.
    model : Model that will use these data, used to access max output lenght.
    history : boolean value, use it to take into account also the history of the dialogue.
  Returns:
    batch of data processed by the tokenizer.
  """ 
  # removes possible left spaces.
  batch['question'] =[q.lstrip() for q in batch['question']] 
  
  if history:
    # concatenate the history to the context, separate by sep_token
    batch['context'] = [tokenizer.sep_token.join([context,history]) for context,history in zip(batch['context'],batch['history'])]
    inputs = tokenizer(batch['question'], batch['context'], padding="max_length", truncation="only_second", max_length=encoder_max_length)
  else:
    inputs = tokenizer(batch['question'], batch['context'], padding="max_length", truncation="only_second", max_length=encoder_max_length)
    
  outputs = tokenizer(batch["answer"], padding="max_length", truncation=True, max_length=model.config.max_new_tokens)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

   # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

The following function returns a copy of the input dataset adding the history attribute to it. <br>
Each part of the history ($Q_0, A_0, Q_1, A_1,..., Q_{N-1}, A_{N-1},$) is separated by the tokenizer's separator token.

In [None]:
def generate_history(df : pd.DataFrame,
                     sep_token : str) -> pd.DataFrame:
  """
  Function for history generation
  Params:
    df : CoQA DataFrame
    sep_token : string token used to separete Q&A in the history.
  Returns:
    new_df : new DataFrame with history.
  """
  new_df = df.copy()
  print("Generating history")
  dialogue_id = -1
  histories = [""] # first QA has void history
  for _, row in tqdm(new_df.iterrows()):
    if row['dialogue id'] == dialogue_id:
      # append the history of the next QA
      histories.append(sep_token.join([
          histories[-1],
          row['question'],
          row['answer']]))
    else:
      dialogue_id = row['dialogue id']
      histories[-1] = "" # set to "" the last history wrongly inserted, since it is related to the old dialogue
      histories.append(sep_token.join([
          histories[-1],
          row['question'],
          row['answer']]).replace(sep_token,"",1)) # last replace for removing sep_token from the start of the history
  histories.pop() # remove the last exceeding one.
  new_df['history'] = histories

  return new_df

In [None]:
def get_processed_datasets(train_set_df : pd.DataFrame,
                           val_set_df : pd.DataFrame,
                           model : EncoderDecoderModel,
                           tokenizer : AutoTokenizer,
                           history : bool):
  """
  Params:
    train_set_df : Training set as pandas DataFrame object. Won't be modified since we apply copy.
    val_set_df : Validation set as pandas DataFrame object. Won't be modified since we apply copy.
    model : model that will use the datasets, used to access to its configurations.
    tokenizer : Tokenizer to use for processing data.
    history : whether you want to consider history or not.
  Returns:
    train and val Dataset object processed with the generate_input_output function.
  """
  train_set = train_set_df.copy()
  val_set = val_set_df.copy()
  
  print("Processing training data")

  if history:
    train_set = generate_history(train_set_df, tokenizer.sep_token)

  train_set_ds = Dataset.from_pandas(train_set)

  train_data = train_set_ds.map(
      generate_input_output,
      fn_kwargs = {'tokenizer' : tokenizer,'model' : model, 'history' : history},
      batched=True,
      remove_columns=train_set_ds.column_names
  )

  print("Processing validation data")
  if history:
    val_set = generate_history(val_set_df, tokenizer.sep_token)
  
  val_set_ds = Dataset.from_pandas(val_set)

  val_data = val_set_ds.map(
      generate_input_output,
      fn_kwargs = {'tokenizer' : tokenizer,'model' : model, 'history' : history},
      batched=True,
      remove_columns=train_set_ds.column_names
  )

  return train_data, val_data


## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [None]:
def generate_answer(batch,
                    tokenizer : AutoTokenizer,
                    model : EncoderDecoderModel):
  """
  Generate answers without considering the history.
  Params:
    batch : batch of questions.
    tokenizer : used for tokenization.
    model : model that will generate the answers.
  """
  # truncate if needed only the context
  inputs = tokenizer(batch["question"], batch['context'], padding="max_length", truncation='only_second', max_length=encoder_max_length, return_tensors="pt")
  input_ids = inputs.input_ids.to("cuda")
  attention_mask = inputs.attention_mask.to("cuda")

  outputs = model.generate(input_ids, attention_mask=attention_mask)

  # all special tokens will be removed
  output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  batch["pred"] = output_str

  return batch

## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

In [None]:
def generate_answer_wHistory(batch,
                             tokenizer : AutoTokenizer,
                             model : EncoderDecoderModel):
  """
  Generate answers considering the history.
  Params:
    batch : batch of questions.
    tokenizer : used for tokenization.
    model : model that will generate the answers
  """
  # extend the context with the history
  batch['context'] = [tokenizer.sep_token.join([context,history]) for context,history in zip(batch['context'],batch['history'])]
  
  # truncate if needed only the context
  inputs = tokenizer(batch["question"], batch['context'], padding="max_length", truncation='only_second', max_length=encoder_max_length, return_tensors="pt")
  input_ids = inputs.input_ids.to("cuda")
  attention_mask = inputs.attention_mask.to("cuda")

  outputs = model.generate(input_ids, attention_mask=attention_mask)

  # all special tokens will be removed
  output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

  batch["pred"] = output_str

  return batch

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$


All the results of the evaluation on the validation set for each model can be visualized in the training process. <br>
The results of the evaluation on the test set for each model are summarized in the next task.

**DISCLAIMER** : You could see some visualization artifacts in the training outputs, these are due to long training time on colab that stops for a moment the process for the "not a robot" verification and restores it afterwards (re-printing the outputs).

Sorry for that but re-train everything for that would have been very annoying.

Now we'll define the output directories in which the models will be saved after the training.

In [None]:
M1_without_history = "trained_" + model_name_M1
M1_with_history = "trained_" + model_name_M1 + "_with_history"
M2_without_history = "trained_" + model_name_M2.replace("/","_")
M2_with_history = "trained_" + model_name_M2.replace("/","_") + "_with_history"

### Some utils definition

Definition of general training arguments used in each training. <br>
The $predict\_with\_generate$ argument is necessary to compute the F1 Squad score during the evaluation. <br>
After some test we decide to use a costant learning rate scheduler, it allows the model to gains around 6-7% points on the F1-score compared to the default scheduler.

In [None]:
batch_size = 16 # definition of batch size used for all models

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="epoch",  
    save_strategy = "epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16 = True, # speed up a bit the training
    output_dir="./", # this will be changed later for different model names
    save_steps=1,
    eval_steps=1, # evauluate at each epoch
    learning_rate=2e-5,
    auto_find_batch_size = True, # Starts from given batch size, decreases it if needed
    report_to = 'all',
    optim = 'adamw_torch', # avoids optimizer warnings
    lr_scheduler_type = 'constant',
    full_determinism = True,
)

Function used by the trainer to evaluate the validation set after each epoch

In [None]:
def compute_metrics(pred,
                    tokenizer : AutoTokenizer):
    """
    Function used during training to keep track of metrics on the validation set.

    Params:
      pred : batch of predictions
      tokenizer
    Returns:
      dict with average metrics computed on the batch
    """
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    exact_match = [squad.compute_exact(pred,label) for pred, label in zip(pred_str, label_str)]
    f1_score = [squad.compute_f1(pred,label) for pred, label in zip(pred_str, label_str)]

    return {
        "squad f1": np.mean(f1_score) * 100,
        "squad match": np.mean(exact_match) * 100
    }

The following function trains the given model_name and saves it in the given output_dir

In [None]:
from functools import partial

def exec_train_pipeline(model_name : str, 
                        output_dir : str,
                        history : bool):
  """
  Params:
    model_name : official name of the model
    output_dir : directory where the model will be saved
    history : if you want to consider also the history 
  """
  # removing cached file to avoid any possible conflict between consecutive model trainings
  print("Cleaning memory...")
  !rm -rf /root/.cache/huggingface/hub/model* 
  empty_cache()
  gc.collect()

  # load model and tokenizer
  model, tokenizer = get_model_tok(model_name) 
  
  # tokenization and conversion to dataset format
  train_data, val_data = get_processed_datasets(train_set_df, val_set_df, model, tokenizer, history = history)

  # set output dir to model name
  training_args.output_dir = output_dir
  
  # instantiate trainer
  trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    compute_metrics=partial(compute_metrics, tokenizer = tokenizer),
    train_dataset=train_data,
    eval_dataset=val_data,
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model), # auto convert arrays to torch tensors.
    tokenizer=tokenizer
  )
  trainer.train()

  trainer.save_model()
  
  # important to avoid out of memory errors
  del model
  del tokenizer

The following function evaluates the given DataFrame with the trained model and stores the result in output_dir as a 'csv' file

In [None]:
def exec_evaluation_pipeline(output_dir : str,
                             model_name : str,
                             test_set_df : pd.DataFrame,
                             history : bool)->pd.DataFrame :
  """
  Params:
    output_dir : directory where the model has been saved.
    model_name : official name of the model, used to load its tokenizer
    test_set_df : pandas DataFrame on which we want to compute metrics
    history : bool
  Returns:
    results : pandas DataFrame that contains the original dataset plus the columns of predictions and scores
    
  """
  # load trained model
  model_trained = EncoderDecoderModel.from_pretrained(output_dir)
  model_trained.to("cuda")
  
  # load tokenizer
  tokenizer = AutoTokenizer.from_pretrained(model_name, use_cache = False)
  # set special tokens
  tokenizer.bos_token = tokenizer.cls_token
  tokenizer.eos_token = tokenizer.sep_token

  test_set = test_set_df.copy() # avoids any possible conflict

  if history:
    test_set = generate_history(test_set, tokenizer.sep_token)
    test_set_ds = Dataset.from_pandas(test_set)

    results = test_set_ds.map(generate_answer_wHistory, 
                          batched=True,
                          batch_size=batch_size, # low number to avoid out of memory
                          fn_kwargs = {'tokenizer' : tokenizer, 'model' : model_trained},
                          )
  else:
    test_set_ds = Dataset.from_pandas(test_set)
    results = test_set_ds.map(generate_answer, 
                          batched=True,
                          batch_size=batch_size, # low number to avoid out of memory
                          fn_kwargs = {'tokenizer' : tokenizer, 'model' : model_trained},
                          )
  
  exact_match = [squad.compute_exact(result['pred'],result['answer']) for result in results]
  f1_score = [squad.compute_f1(result['pred'],result['answer']) for result in results]

  results = pd.DataFrame(results)
  results['squad match'] = exact_match
  results['squad f1'] = f1_score

  print("--------------------Evaluation results--------------------")
  print(f"Mean squad f1 : {round(results['squad f1'].mean() * 100,2)}%")
  print(f"Mean squad match : {round(results['squad match'].mean() * 100,2)}%")
  print("----------------------------------------------------------")
  # avoid out of memory problems
  del model_trained
  del tokenizer

  # write also the results to file
  results.to_csv(os.path.join(output_dir, "results.csv"))

  # if drive has been connected save also to it
  if 'results_dir' in globals() and os.path.exists(results_dir):
    os.makedirs(os.path.join(results_dir, output_dir), exist_ok = True)
    results.to_csv(os.path.join(results_dir, output_dir, "results.csv"))

  return results


### M1 - distrilroberta-base

#### M1 without history

##### Training

In [None]:
exec_train_pipeline(model_name_M1, M1_without_history, False)

Cleaning memory...
Downloading : distilroberta-base


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['roberta.encoder.layer.2.crossattention.self.query.weight', 'roberta.encoder.layer.4.crossattention.output.LayerNorm.bias', 'roberta.encoder.layer.1.crossatten

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Processing training data


  0%|          | 0/86 [00:00<?, ?ba/s]

Processing validation data


  0%|          | 0/22 [00:00<?, ?ba/s]

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 85806
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 16089
  Number of trainable parameters = 178472025
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Squad f1,Squad match
1,2.8294,2.625568,30.380594,21.867722
2,2.1919,2.09854,45.098377,32.687471


***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_distilroberta-base/checkpoint-5363
Configuration saved in trained_distilroberta-base/checkpoint-5363/config.json
Model weights saved in trained_distilroberta-base/checkpoint-5363/pytorch_model.bin
tokenizer config file saved in trained_distilroberta-base/checkpoint-5363/tokenizer_config.json
Special tokens file saved in trained_distilroberta-base/checkpoint-5363/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_distilroberta-base/checkpoint-10726
Configuration saved in trained_distilroberta-base/checkpoint-10726/config.json
Model weights saved in trained_distilroberta-base/checkpoint-10726/pytorch_model.bin
tokenizer config file saved in trained_distilroberta-base/checkpoint-10726/tokenizer_config.json
Special tokens file saved in trained_distilroberta-base/checkpoint-10726/special_tokens_map.json
***

Epoch,Training Loss,Validation Loss,Squad f1,Squad match
1,2.8294,2.625568,30.380594,21.867722
2,2.1919,2.09854,45.098377,32.687471
3,1.788,1.948756,49.443938,36.47415


Saving model checkpoint to trained_distilroberta-base/checkpoint-16089
Configuration saved in trained_distilroberta-base/checkpoint-16089/config.json
Model weights saved in trained_distilroberta-base/checkpoint-16089/pytorch_model.bin
tokenizer config file saved in trained_distilroberta-base/checkpoint-16089/tokenizer_config.json
Special tokens file saved in trained_distilroberta-base/checkpoint-16089/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to trained_distilroberta-base
Configuration saved in trained_distilroberta-base/config.json
Model weights saved in trained_distilroberta-base/pytorch_model.bin
tokenizer config file saved in trained_distilroberta-base/tokenizer_config.json
Special tokens file saved in trained_distilroberta-base/special_tokens_map.json


##### Evaluation

In [None]:
M1_results = exec_evaluation_pipeline(M1_without_history, model_name_M1, test_df, False)

loading configuration file trained_distilroberta-base/config.json
Model config EncoderDecoderConfig {
  "_commit_hash": null,
  "architectures": [
    "EncoderDecoderModel"
  ],
  "cls_token_id": 0,
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": 

  0%|          | 0/495 [00:00<?, ?ba/s]



--------------------Evaluation results--------------------
Mean squad f1 : 51.52%
Mean squad match : 39.04%
----------------------------------------------------------


#### M1 with history

##### Training

In [None]:
exec_train_pipeline(model_name_M1, M1_with_history, True)

Cleaning memory...
Downloading : distilroberta-base


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForCausalLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['roberta.encoder.layer.4.crossattention.self.query.weight', 'roberta.encoder.layer.0.crossattention.self.value.bias', 'roberta.encoder.layer.3.crossattention.s

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Processing training data
Generating history


85806it [00:03, 21770.19it/s]


  0%|          | 0/86 [00:00<?, ?ba/s]

Processing validation data
Generating history


21470it [00:00, 21599.40it/s]


  0%|          | 0/22 [00:00<?, ?ba/s]

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 85806
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 16089
  Number of trainable parameters = 178472025
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Squad f1,Squad match
1,2.8024,2.5715,30.287875,21.537028
2,2.042,1.932074,49.243726,35.589194
3,1.615,1.745483,54.20022,39.119702


***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_distilroberta-base_with_history/checkpoint-5363
Configuration saved in trained_distilroberta-base_with_history/checkpoint-5363/config.json
Model weights saved in trained_distilroberta-base_with_history/checkpoint-5363/pytorch_model.bin
tokenizer config file saved in trained_distilroberta-base_with_history/checkpoint-5363/tokenizer_config.json
Special tokens file saved in trained_distilroberta-base_with_history/checkpoint-5363/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_distilroberta-base_with_history/checkpoint-10726
Configuration saved in trained_distilroberta-base_with_history/checkpoint-10726/config.json
Model weights saved in trained_distilroberta-base_with_history/checkpoint-10726/pytorch_model.bin
tokenizer config file saved in trained_distilroberta-base_with_history/checkpoint-10726/token

##### Evaluation

In [None]:
M1_wHistory_results = exec_evaluation_pipeline(M1_with_history, model_name_M1, test_df, True)

loading configuration file trained_distilroberta-base_with_history/config.json
Model config EncoderDecoderConfig {
  "_commit_hash": null,
  "architectures": [
    "EncoderDecoderModel"
  ],
  "cls_token_id": 0,
  "decoder": {
    "_name_or_path": "distilroberta-base",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
   

Generating history


7917it [00:00, 23093.76it/s]


  0%|          | 0/495 [00:00<?, ?ba/s]



--------------------Evaluation results--------------------
Mean squad f1 : 55.77%
Mean squad match : 41.09%
----------------------------------------------------------


### M2 - Bert tiny

#### M2 without history

##### Training

In [None]:
exec_train_pipeline(model_name_M2, M2_without_history, False)

Cleaning memory...
Downloading : prajjwal1/bert-tiny


Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Downloading:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/pytorch_model.bin
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassif

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/vocab.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "abs

Processing training data


  0%|          | 0/86 [00:00<?, ?ba/s]

Processing validation data


  0%|          | 0/22 [00:00<?, ?ba/s]

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 85806
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 16089
  Number of trainable parameters = 8935226
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Squad f1,Squad match
1,3.837,3.682175,11.78206,11.085235
2,3.5546,3.51605,13.585272,11.346064
3,3.4457,3.450705,14.442088,11.844434


***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_prajjwal1_bert-tiny/checkpoint-5363
Configuration saved in trained_prajjwal1_bert-tiny/checkpoint-5363/config.json
Model weights saved in trained_prajjwal1_bert-tiny/checkpoint-5363/pytorch_model.bin
tokenizer config file saved in trained_prajjwal1_bert-tiny/checkpoint-5363/tokenizer_config.json
Special tokens file saved in trained_prajjwal1_bert-tiny/checkpoint-5363/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_prajjwal1_bert-tiny/checkpoint-10726
Configuration saved in trained_prajjwal1_bert-tiny/checkpoint-10726/config.json
Model weights saved in trained_prajjwal1_bert-tiny/checkpoint-10726/pytorch_model.bin
tokenizer config file saved in trained_prajjwal1_bert-tiny/checkpoint-10726/tokenizer_config.json
Special tokens file saved in trained_prajjwal1_bert-tiny/checkpoint-10726/special_tokens_ma

##### Evaluation

In [None]:
M2_results = exec_evaluation_pipeline(M2_without_history, model_name_M2, test_df, False)

loading configuration file trained_prajjwal1_bert-tiny/config.json
Model config EncoderDecoderConfig {
  "_commit_hash": null,
  "architectures": [
    "EncoderDecoderModel"
  ],
  "cls_token_id": 101,
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
      "0": "LABEL_

  0%|          | 0/495 [00:00<?, ?ba/s]



--------------------Evaluation results--------------------
Mean squad f1 : 14.92%
Mean squad match : 12.43%
----------------------------------------------------------


#### M2 with history

##### Training

In [None]:
exec_train_pipeline(model_name_M2, M2_with_history, True)

Cleaning memory...
Downloading : prajjwal1/bert-tiny


Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Downloading:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/pytorch_model.bin
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassif

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/vocab.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--prajjwal1--bert-tiny/snapshots/6f75de8b60a9f8a2fdf7b69cbd86d9e64bcb3837/config.json
Model config BertConfig {
  "_name_or_path": "prajjwal1/bert-tiny",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 2,
  "num_hidden_layers": 2,
  "pad_token_id": 0,
  "position_embedding_type": "abs

Processing training data
Generating history


85806it [00:03, 21708.53it/s]


  0%|          | 0/86 [00:00<?, ?ba/s]

Processing validation data
Generating history


21470it [00:00, 22914.48it/s]


  0%|          | 0/22 [00:00<?, ?ba/s]

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 85806
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 16089
  Number of trainable parameters = 8935226
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Squad f1,Squad match
1,3.8229,3.681584,11.85289,10.903586
2,3.5591,3.520884,13.494428,11.248253
3,3.4515,3.455813,14.416498,11.853749


***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_prajjwal1_bert-tiny_with_history/checkpoint-5363
Configuration saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-5363/config.json
Model weights saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-5363/pytorch_model.bin
tokenizer config file saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-5363/tokenizer_config.json
Special tokens file saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-5363/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 21470
  Batch size = 16
Saving model checkpoint to trained_prajjwal1_bert-tiny_with_history/checkpoint-10726
Configuration saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-10726/config.json
Model weights saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-10726/pytorch_model.bin
tokenizer config file saved in trained_prajjwal1_bert-tiny_with_history/checkpoint-10

##### Evaluation

In [None]:
M2_wHistory_results = exec_evaluation_pipeline(M2_with_history, model_name_M2, test_df, True)

loading configuration file trained_prajjwal1_bert-tiny_with_history/config.json
Model config EncoderDecoderConfig {
  "_commit_hash": null,
  "architectures": [
    "EncoderDecoderModel"
  ],
  "cls_token_id": 101,
  "decoder": {
    "_name_or_path": "prajjwal1/bert-tiny",
    "add_cross_attention": true,
    "architectures": null,
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 128,
    "id2label": {
     

Generating history


7917it [00:00, 21687.92it/s]


  0%|          | 0/495 [00:00<?, ?ba/s]



--------------------Evaluation results--------------------
Mean squad f1 : 14.61%
Mean squad match : 12.19%
----------------------------------------------------------


## [Task 7] Error Analysis

To evaluate our models we used the squad **f1-score** as instructed, in addition to this, we've also used the **exact match score** that gave us also a glimpse onto the general performance of the model, allowing us to know the **percentage of well predicted answers**.

Here we show only the results obtained using seed = 42 , in the report you will find the results averaged on the 3 seeds (42, 1337, 2022).

Before digging into models' errors, we report here the f1 score and exact match score results obtained on the test set:

---
|    |   No history   |       | 
|----|:----------:|:-----:|
|    |     F1     | Match |  
| M1 |     51.52     |   39.04  | 
| M2 |     14.92     |   12.43  | 

---

|    | With history |       | 
|----|:----------:|:-----:|
|    |     F1     | Match |  
| M1 |     55.77    |   41.09  | 
| M2 |     14.61     |   12.19  | 

*Reminder : M1 = DistilRoberta , M2 = Bert-tiny* 

Considering only the metrics we can say that M1 benefits from using the history, we can't say the same for M2 that seems to be indifferent to additional information. 

Loading stored results (in our case, we stored the results on google drive).

In [None]:
# import done to start the notebook from here , avoiding import everything just to check results.
import pandas as pd
import os

M1_results = pd.read_csv(os.path.join(results_dir, M1_without_history, "results.csv"))
M1_wHistory_results = pd.read_csv(os.path.join(results_dir, M1_with_history, "results.csv"))
M2_results = pd.read_csv(os.path.join(results_dir, M2_without_history, "results.csv"))
M2_wHistory_results = pd.read_csv(os.path.join(results_dir, M2_with_history, "results.csv"))

Now, by instructions, we will visualize the worst 5 model error (w.r.t. F1 score) for each source. \\


In [None]:
def worst_5_by_source(results_1 : pd.DataFrame, results_2 : pd.DataFrame):
  """
  Params:
    results_1 : results of an evaluation
    results_2 : results of an evaluation, used to report the answers to questions that have been found to be the worst error in results_1 
  Returns:
    pandas Dataframe with the worst 5 errors for each source
  """
  errors = results_1.sort_values(['source','squad f1']).groupby('source').head(5).reset_index().drop(columns = ["__index_level_0__","Unnamed: 0"])

  # add corresponding answers of results_2 to the one found in results_1
  errors.insert(len(errors.columns) - 2,'pred without history',list(results_2.iloc[errors['index']]['pred']))
  errors.drop(columns = ['index'], inplace = True)
  return errors

### DistilRoberta error analysis

Here we are seeing the worst 5 errors for each source committed by the DistilRoBERTa **using the history** as additional information, **beside** there are the answers given to the same questions but by the model that **do not use history**. 



In [None]:
worst_5_by_source(M1_wHistory_results, M1_results)

Unnamed: 0,source,dialogue id,context,question,answer,history,pred,pred without history,squad match,squad f1
0,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",What did he do?,Actor,Is someone in showbiz?</s>Yes.</s>Whom?</s>Den...,He was a cop,he was a consultant,0,0.0
1,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Was he in movies?,Yes,Is someone in showbiz?</s>Yes.</s>Whom?</s>Den...,No,Yes,0,0.0
2,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Anything recent?,No,Is someone in showbiz?</s>Yes.</s>Whom?</s>Den...,Yes,Yes,0,0.0
3,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Which one?,"""Law & Order""",Is someone in showbiz?</s>Yes.</s>Whom?</s>Den...,"""Get Out""",Miami Vice,0,0.0
4,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Did he have a beater for a car?,No,Is someone in showbiz?</s>Yes.</s>Whom?</s>Den...,Yes,yes,0,0.0
5,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",What worked her way northward?,The _Ariel_,,the _ _ ____ _,The _Ariel___,0,0.0
6,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",What lay between the shore-reefs and outer-reefs?,Lagoon,What worked her way northward?</s>The _Ariel_,the colour-yoonoonoon,blue-riboonoonoon,0,0.0
7,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",What coast did the Ariel work her way up leisu...,Malaita,What worked her way northward?</s>The _Ariel_<...,Northawake,Northward,0,0.0
8,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",Who argued to Villa that he must have a name?,Harley Kennan,What worked her way northward?</s>The _Ariel_<...,Her husband,Hannah,0,0.0
9,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",Who was the husband she retorted something to?,Harley Kennan,What worked her way northward?</s>The _Ariel_<...,Herself,Villa,0,0.0


Even if their squad f1 score is 0 we can notice that the model doesn't answer with purely random stuff. \\
It seems like it answers with at least something that fits with the Q&A patterns. \\
Where : location, What : object, has/did : yes/no, how many : numbers. \\
Examples: (A = true answer, G = generated answer)

---
- Q : What did he do?
- A : Actor
- G : He was a cop.
---
- Q : Where did she live?
- A : In a barn
- G : a farm house.
---
- Q : Was he currently enrolled at the University?
- A : He is not.
- G : No.
---
- Q : How many burroughs are there?
- A : five	
- G : 76	
---

And in some cases the answers are not so absolutely wrong,
like in the case of the "farm" mentioned above, "barn" and "farm" are similar concepts, or in the question about the University where the answers have basically the same meaning, or in this particular case:
- C : I know what is inside the bag--a thermos with hot soup and a stainless-steel container with rice, vegetables and either chicken, meat or shrimp, sometimes with a kind of pancake.
- Q : What is in the bag?
- A : food
- G :	a hot soup

We cannot appreciate so much the difference between the answers given considering or not the history, they look pretty similar even if there's a 4% gap.

### Bert-tiny error analysis

Here we are seeing the worst 5 errors for each source committed by the Bert-tiny **using the history** as additional information, **beside** there are the answers given to the same questions but by the model that **do not use history**. 



In [None]:
worst_5_by_source(M2_wHistory_results, M2_results)

Unnamed: 0,source,dialogue id,context,question,answer,history,pred,pred without history,squad match,squad f1
0,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Whom?,Dennis Farina,Is someone in showbiz?[SEP]Yes.,his father,his father,0,0.0
1,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",What did he do?,Actor,Is someone in showbiz?[SEP]Yes.[SEP]Whom?[SEP]...,he was a shot,he was a car accident,0,0.0
2,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Anything recent?,No,Is someone in showbiz?[SEP]Yes.[SEP]Whom?[SEP]...,yes,yes,0,0.0
3,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",Who cast him?,Michael Mann,Is someone in showbiz?[SEP]Yes.[SEP]Whom?[SEP]...,his father,his father,0,0.0
4,cnn,3,"(CNN) -- Dennis Farina, the dapper, mustachioe...",What was the title of the movie?,"""Thief""",Is someone in showbiz?[SEP]Yes.[SEP]Whom?[SEP]...,the first - time,the film was shot,0,0.0
5,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",What worked her way northward?,The _Ariel_,,a man,a man,0,0.0
6,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",What lay between the shore-reefs and outer-reefs?,Lagoon,What worked her way northward?[SEP]The _Ariel_,the river,the river,0,0.0
7,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",Who was the Captain?,Winters,What worked her way northward?[SEP]The _Ariel_...,his father,his father,0,0.0
8,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",Did he have red hair?,No,What worked her way northward?[SEP]The _Ariel_...,yes,yes,0,0.0
9,gutenberg,15,"CHAPTER XXII \n\nNorthward, along the leeward ...",Were Harley and Villa in a hurry?,No,What worked her way northward?[SEP]The _Ariel_...,yes,no,0,0.0


Also BERT-tiny seems to have understood the common Q&A patterns mentioned before, but with respect to DistilRoBERTa almost every wrong answer is actually completely wrong. \\
The stability of metric values seen before is also reflected in these answers that are basically the same with and without history.

We can thus conclude that those models are not well suited for this kind of task, the state-of-the-art reaches around 90% of f1 score while we were able to reach only 55.77%, which, by the way, we think that could still be considered a good result given the relatively simple model used and the limited computational power available.

Reading the literature we have seen that the most used model are T5, T0 and BART, testing these models could be interesting for further improvements in the future. 