# Assignment 2

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Keywords**: Transformers, Question Answering, CoQA

## Deadlines

* **December 11**, 2022: deadline for having assignments graded by January 11, 2023
* **January 11**, 2023: deadline for half-point speed bonus per assignment
* **After January 11**, 2023: assignments are still accepted, but there will be no speed bonus

## Overview

### Problem

Question Answering (QA) on [CoQA](https://stanfordnlp.github.io/coqa/) dataset: a conversational QA dataset.

### Task

Given a question $Q$, a text passage $P$, the task is to generate the answer $A$.<br>
$\rightarrow A$ can be: (i) a free-form text or (ii) unanswerable;

**Note**: an question $Q$ can refer to previous dialogue turns. <br>
$\rightarrow$ dialogue history $H$ may be a valuable input to provide the correct answer $A$.

### Models

We are going to experiment with transformer-based models to define the following models:

1.  $A = f_\theta(Q, P)$

2. $A = f_\theta(Q, P, H)$

where $f_\theta$ is the transformer-based model we have to define with $\theta$ parameters.

## The CoQA dataset

<center>
    <img src="https://drive.google.com/uc?export=view&id=16vrgyfoV42Z2AQX0QY7LHTfrgektEKKh" width="750"/>
</center>

For detailed information about the dataset, feel free to check the original [paper](https://arxiv.org/pdf/1808.07042.pdf).



## Rationales

Each QA pair is paired with a rationale $R$: it is a text span extracted from the given text passage $P$. <br>
$\rightarrow$ $R$ is not a requested output, but it can be used as an additional information at training time!

## Dataset Statistics

* **127k** QA pairs.
* **8k** conversations.
* **7** diverse domains: Children's Stories, Literature, Mid/High School Exams, News, Wikipedia, Reddit, Science.
* Average conversation length: **15 turns** (i.e., QA pairs).
* Almost **half** of CoQA questions refer back to **conversational history**.
* Only **train** and **validation** sets are available.

## Dataset snippet

The dataset is stored in JSON format. Each dialogue is represented as follows:

```
{
    "source": "mctest",
    "id": "3dr23u6we5exclen4th8uq9rb42tel",
    "filename": "mc160.test.41",
    "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. 
    Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. [...]" % <-- $P$
    "questions": [
        {
            "input_text": "What color was Cotton?",   % <-- $Q_1$
            "turn_id": 1
        },
        {
            "input_text": "Where did she live?",
            "turn_id": 2
        },
        [...]
    ],
    "answers": [
        {
            "span_start": 59,   % <-- $R_1$ start index
            "spand_end": 93,    % <-- $R_1$ end index
            "span_text": "a little white kitten named Cotton",   % <-- $R_1$
            "input_text" "white",   % <-- $A_1$      
            "turn_id": 1
        },
        [...]
    ]
}
```

### Simplifications

Each dialogue also contains an additional field ```additional_answers```. For simplicity, we **ignore** this field and only consider one groundtruth answer $A$ and text rationale $R$.

CoQA only contains 1.3% of unanswerable questions. For simplicity, we **ignore** those QA pairs.

## [Task 1] Remove unaswerable QA pairs

Write your own script to remove unaswerable QA pairs from both train and validation sets.

***Unknown is given in the data when it is an unanswerable question***

In [1]:
!pip install ktrain
!pip install datasets
!pip install transformers --upgrade

Collecting transformers==4.17.0
  Using cached transformers-4.17.0-py3-none-any.whl (3.8 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.25.1
    Uninstalling transformers-4.25.1:
      Successfully uninstalled transformers-4.25.1
Successfully installed transformers-4.17.0
Collecting transformers
  Using cached transformers-4.25.1-py3-none-any.whl (5.8 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.17.0
    Uninstalling transformers-4.17.0:
      Successfully uninstalled transformers-4.17.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ktrain 0.31.10 requires transformers==4.17.0, but you have transformers 4.25.1 which is incompatible.[0m[31m
[0mSuccessfully installed transformers-4.

In [2]:
import transformers

print(transformers.__version__)

  from .autonotebook import tqdm as notebook_tqdm


4.25.1


In [3]:
# imports and constants
from sklearn.model_selection import train_test_split
from datasets import load_dataset, load_metric
import json
import os
import urllib.request
from tqdm import tqdm
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras import callbacks
from transformers import AutoTokenizer, EncoderDecoderModel, TFAutoModelForQuestionAnswering, create_optimizer
from datasets import Dataset
import pandas as pd
import re
# from google.colab import drive
import collections
from datasets import DatasetDict

def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

rs = 42
set_reproducibility(rs)

max_answer_length = 30

num_train_epochs = 3
batch_size = 128
squad_v2 = False

max_items_in_set = 5000

model_checkpoint = "prajjwal1/bert-tiny"#"distilroberta-base"

In [4]:
# drive.mount("/content/drive/")

## Dataset Download


In [5]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(url, output_path):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=output_path, reporthook=t.update_to)

def download_data(data_path, url_path, suffix):    
    if not os.path.exists(data_path):
        os.makedirs(data_path)
        
    data_path = os.path.join(data_path, f'{suffix}.json')

    if not os.path.exists(data_path):
        print(f"Downloading CoQA {suffix} data split... (it may take a while)")
        download_url(url=url_path, output_path=data_path)
        print("Download completed!")

In [6]:
# Train data
train_url = "https://nlp.stanford.edu/data/coqa/coqa-train-v1.0.json"
download_data(data_path='coqa', url_path=train_url, suffix='train')

# Test data
test_url = "https://nlp.stanford.edu/data/coqa/coqa-dev-v1.0.json"
download_data(data_path='coqa', url_path=test_url, suffix='test')  # <-- Why test? See next slides for an answer!

In [7]:

pathtrain = "coqa/" + "train.json"
pathtest = "coqa/" + "test.json"

with open(pathtrain, "r") as train:
  train_data = json.load(train)

train_data = train_data["data"]

with open(pathtest, "r") as test:
  test_data = json.load(test)

test_data = test_data["data"]

###Unanswerable QA Pairs are being removed (Task 1)

In [8]:
def filter_unknowns(data):
  """This function removes unanswerable QA pairs by only adding the answers that are not 
     indicated as unknown to the data and then returns that dataset
  """
  for i in range(len(data)):
    am_questions = len(data[i]['questions'])

    omitted = 0
    temp_questions = dict()
    temp_answers = dict()

    for j in range(am_questions):
      if data[i]['answers'][j]['input_text'] != 'unknown':

        temp_questions[j - omitted] = data[i]['questions'][j]
        temp_answers[j - omitted] = data[i]['answers'][j]
      else:
        omitted += 1

    # assumes that there exist at least one question with an answer per context
    data[i]['questions'] = temp_questions
    data[i]['answers'] = temp_answers
  return data

In [9]:
train_data = filter_unknowns(train_data)
test_data  = filter_unknowns(test_data)

#### Data Inspection

Spend some time in checking accurately the dataset format and how to retrieve the tasks' inputs and outputs!

## [Task 2] Train, Validation and Test splits

CoQA only provides a train and validation set since the test set is hidden for evaluation purposes.

We'll consider the provided validation set as a test set. <br>
$\rightarrow$ Write your own script to:
* Split the train data in train and validation splits (80% train and 20% val)
* Perform splits such that a dialogue appears in one split only! (i.e., split at dialogue level)
* Perform splitting using the following seed for reproducibility: 42

#### Reproducibility Memo

Check back tutorial 2 on how to fix a specific random seed for reproducibility!

In [10]:
train, validation = train_test_split(train_data, test_size = 0.2, random_state=rs)
test = test_data

In [11]:
print(len(train))
print(len(validation))
print(len(test))
#As can be seen the data is split into train data (80%) and validation data (20%)
#STILL HAVE TO CHECK THAT A DIALOGUE ONLY APPEARS IN ONE SPLIT

5759
1440
500


In [12]:
def fancy_print_dialogue(dialogue):
  """This function helps the user to see the dialogues in a more clearer view
  """
  print(dialogue['story'])
  am_questions = len(dialogue['questions'])

  print()
  print("---")
  print()

  for i in range(am_questions):
    print(i)
    print("Q:", dialogue['questions'][i]['input_text'])
    print("A:", dialogue['answers'][i]['input_text'])
    print("R:", dialogue['answers'][i]['span_text'])
    print()

fancy_print_dialogue(train[0])

TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. 

On Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. 

"It's a wonderful day. It's the first time we can choose our own representatives," said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could "get used to freedom and democracy." 

Tunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. 

More than 60 political parties and thousands of independent candidates competed for 218 seats in a

# Reformating our data

In [13]:
def get_best_match(query, corpus):
  """Finding the best match as a string between the query and corpus which in this case
     is the answer and the answer in the context. This function returns the index
     of the answer in the context.
  """
  best_idx = 0
  best_am_mismatches = 0

  for j in range((len(query))):
    if j >= len(corpus):
      break
    if query[j] != corpus[j]:
      best_am_mismatches += 1

  for i in range(len(corpus)-len(query)):
    current_am_mismatches = 0
    for j in range((len(query))):
      if query[j] != corpus[j+i]:
        current_am_mismatches += 1
        if current_am_mismatches >= best_am_mismatches:
          break
    if current_am_mismatches <= best_am_mismatches:
      best_idx = i
      best_am_mismatches = current_am_mismatches
    
  return best_idx




In [14]:
def gen(data):
  """In this function the data is generated by giving the index of the answer in the context in the
     answer column.  
  """
  count = 0
    
  for i in range(len(data)):

    dialogue = data[i]

    am_questions = len(dialogue['questions'])

    context = dialogue['story']

    for i in range(am_questions):
      question_text = dialogue['questions'][i]['input_text']
      answer_text = dialogue['answers'][i]['input_text']

      #Finding the start of the answer string
      span_start = dialogue['answers'][i]['span_start']
      span_end = dialogue['answers'][i]['span_end']
      R = context[span_start:span_end]
      answer_start = get_best_match(answer_text, R) + span_start
      
      if answer_start == None:
        print(question_text)
        print(answer_text.lower())
        print(context)
        print(span_start, span_end)
        print(R)
        print(answer_start)

      assert answer_start != None

      answer = (dict({'text':[answer_text], 'answer_start' : [answer_start]}))

      yield context, question_text, answer,  dialogue["id"] + str(i), dialogue["source"], dialogue["filename"], dialogue["name"]

      context += ' ' + question_text + ' ' + answer_text

      count += 1

    if count >= max_items_in_set:
      break

#Using the function gen to generate the datafiles
df_train_final = pd.DataFrame(gen(train))
df_train_final.rename(columns = {0:'context', 1:'question', 2:'answers', 3:'id',4:'source',5:'filename',6:'name'}, inplace = True)

df_validation_final = pd.DataFrame(gen(validation))
df_validation_final.rename(columns = {0:'context', 1:'question', 2:'answers', 3:'id',4:'source',5:'filename',6:'name'}, inplace = True)

df_test_final = pd.DataFrame(gen(test))
df_test_final.rename(columns = {0:'context', 1:'question', 2:'answers', 3:'id',4:'source',5:'filename',6:'name'}, inplace = True)

train_dataset      = Dataset.from_pandas(df_train_final)
validation_dataset = Dataset.from_pandas(df_validation_final)
test_dataset       = Dataset.from_pandas(df_test_final)

#Making a dictionary consisting of the train, validation and test set.
datasets= DatasetDict({"train":train_dataset,"validation":validation_dataset,"test":test_dataset})
datasets.shuffle()

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answers', 'id', 'source', 'filename', 'name'],
        num_rows: 5013
    })
    validation: Dataset({
        features: ['context', 'question', 'answers', 'id', 'source', 'filename', 'name'],
        num_rows: 5005
    })
    test: Dataset({
        features: ['context', 'question', 'answers', 'id', 'source', 'filename', 'name'],
        num_rows: 5003
    })
})

In [15]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = [0,1,2,3,4,5] # I always want to see the first entry of the dataset
    for _ in range(num_examples-len(picks)):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(
                lambda x: [typ.feature.names[i] for i in x]
            )
    display(HTML(df.to_html()))

show_random_elements(datasets["train"])

Unnamed: 0,context,question,answers,id,source,filename,name
0,"TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. \n\n""It's a wonderful day. It's the first time we can choose our own representatives,"" said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could ""get used to freedom and democracy."" \n\nTunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political parties and thousands of independent candidates competed for 218 seats in a new Constitutional Assembly, which will be charged with writing a new constitution and laying the framework for a government system. \n\nVoters appeared jubilant on Sunday, taking photos of each other outside polling stations, some holding Tunisian flags. \n\n""It's a holiday,"" said housewife Maha Haubi, who had just taken her position at the end of the long line of more than 1,000 voters waiting outside an elementary school in Menzah. \n\n""Before we never even had the right to say 'yes' or 'no.'"" \n\nNearby, banker Aid Naghmaichi said she didn't mind the long wait to vote.",Where is this taking place?,"{'answer_start': [52], 'text': ['Tunisia']}",308q0pevb8dq8b7v262io567awb9is0,cnn,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story
1,"TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. \n\n""It's a wonderful day. It's the first time we can choose our own representatives,"" said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could ""get used to freedom and democracy."" \n\nTunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political parties and thousands of independent candidates competed for 218 seats in a new Constitutional Assembly, which will be charged with writing a new constitution and laying the framework for a government system. \n\nVoters appeared jubilant on Sunday, taking photos of each other outside polling stations, some holding Tunisian flags. \n\n""It's a holiday,"" said housewife Maha Haubi, who had just taken her position at the end of the long line of more than 1,000 voters waiting outside an elementary school in Menzah. \n\n""Before we never even had the right to say 'yes' or 'no.'"" \n\nNearby, banker Aid Naghmaichi said she didn't mind the long wait to vote. Where is this taking place? Tunisia",What is being voted on?,"{'answer_start': [504], 'text': ['Representatives are being chosen']}",308q0pevb8dq8b7v262io567awb9is1,cnn,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story
2,"TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. \n\n""It's a wonderful day. It's the first time we can choose our own representatives,"" said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could ""get used to freedom and democracy."" \n\nTunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political parties and thousands of independent candidates competed for 218 seats in a new Constitutional Assembly, which will be charged with writing a new constitution and laying the framework for a government system. \n\nVoters appeared jubilant on Sunday, taking photos of each other outside polling stations, some holding Tunisian flags. \n\n""It's a holiday,"" said housewife Maha Haubi, who had just taken her position at the end of the long line of more than 1,000 voters waiting outside an elementary school in Menzah. \n\n""Before we never even had the right to say 'yes' or 'no.'"" \n\nNearby, banker Aid Naghmaichi said she didn't mind the long wait to vote. Where is this taking place? Tunisia What is being voted on? Representatives are being chosen",What day of the week did they vote?,"{'answer_start': [42], 'text': ['Sunday']}",308q0pevb8dq8b7v262io567awb9is2,cnn,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story
3,"TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. \n\n""It's a wonderful day. It's the first time we can choose our own representatives,"" said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could ""get used to freedom and democracy."" \n\nTunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political parties and thousands of independent candidates competed for 218 seats in a new Constitutional Assembly, which will be charged with writing a new constitution and laying the framework for a government system. \n\nVoters appeared jubilant on Sunday, taking photos of each other outside polling stations, some holding Tunisian flags. \n\n""It's a holiday,"" said housewife Maha Haubi, who had just taken her position at the end of the long line of more than 1,000 voters waiting outside an elementary school in Menzah. \n\n""Before we never even had the right to say 'yes' or 'no.'"" \n\nNearby, banker Aid Naghmaichi said she didn't mind the long wait to vote. Where is this taking place? Tunisia What is being voted on? Representatives are being chosen What day of the week did they vote? Sunday",When was the last one held?,"{'answer_start': [427], 'text': ['1956']}",308q0pevb8dq8b7v262io567awb9is3,cnn,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story
4,"TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. \n\n""It's a wonderful day. It's the first time we can choose our own representatives,"" said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could ""get used to freedom and democracy."" \n\nTunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political parties and thousands of independent candidates competed for 218 seats in a new Constitutional Assembly, which will be charged with writing a new constitution and laying the framework for a government system. \n\nVoters appeared jubilant on Sunday, taking photos of each other outside polling stations, some holding Tunisian flags. \n\n""It's a holiday,"" said housewife Maha Haubi, who had just taken her position at the end of the long line of more than 1,000 voters waiting outside an elementary school in Menzah. \n\n""Before we never even had the right to say 'yes' or 'no.'"" \n\nNearby, banker Aid Naghmaichi said she didn't mind the long wait to vote. Where is this taking place? Tunisia What is being voted on? Representatives are being chosen What day of the week did they vote? Sunday When was the last one held? 1956",What else happened then?,"{'answer_start': [400], 'text': ['Country gained its independence']}",308q0pevb8dq8b7v262io567awb9is4,cnn,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story
5,"TUNIS, Tunisia (CNN) -- Polls closed late Sunday in Tunisia, the torchbearer of the so-called Arab Spring, but voters will not see results of national elections until Tuesday, officials said. \n\nOn Sunday, long lines of voters snaked around schools-turned-polling-stations in Tunis's upscale Menzah neighborhood, some waiting for hours to cast a vote in the nation's first national elections since the country's independence in 1956. \n\n""It's a wonderful day. It's the first time we can choose our own representatives,"" said Walid Marrakchi, a civil engineer who waited more than two hours, and who brought along his 3-year-old son Ahmed so he could ""get used to freedom and democracy."" \n\nTunisia's election is the first since a popular uprising in January overthrew long-time dictator Zine El Abidine Ben Ali and triggered a wave of revolutions -- referred to as the Arab Spring -- across the region. \n\nMore than 60 political parties and thousands of independent candidates competed for 218 seats in a new Constitutional Assembly, which will be charged with writing a new constitution and laying the framework for a government system. \n\nVoters appeared jubilant on Sunday, taking photos of each other outside polling stations, some holding Tunisian flags. \n\n""It's a holiday,"" said housewife Maha Haubi, who had just taken her position at the end of the long line of more than 1,000 voters waiting outside an elementary school in Menzah. \n\n""Before we never even had the right to say 'yes' or 'no.'"" \n\nNearby, banker Aid Naghmaichi said she didn't mind the long wait to vote. Where is this taking place? Tunisia What is being voted on? Representatives are being chosen What day of the week did they vote? Sunday When was the last one held? 1956 What else happened then? Country gained its independence",Where are people voting?,"{'answer_start': [291], 'text': ['Menzah neighborhood']}",308q0pevb8dq8b7v262io567awb9is5,cnn,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story,cnn_21eaf3eb9e3fc5140001e64d95533c88920bb425.story
6,"CHAPTER XXV \n\nA CLUE \n\nThere was a touch of frost in the still air and the light was fading. A yellow glow lingered in the southwest beyond Criffell's sloping shoulder, which ran up against it, tinged a deep violet. Masses of soft, gray cloud floated above the mountain's summit; but the sky was clear overhead, and a thin new moon grew brighter in the east This was why the murmur of the sea came out of the distance in a muffled roar, for the tides run fast when the moon is young. \n\nElsie, walking homeward, vacantly noticed how bright the crescent gleamed above the dusky firs, as she entered the gloom of a straggling wood at the foot of the hill on which Appleyard was built. She had been out all the afternoon and now she shrank from going home, for she felt that a shadow rested upon the house. Dick had returned from a cruise with Andrew, looking dejected and unwell; and she was glad that Whitney had taken both away again, on his motorcycle, because Dick had lately had fits of moody restlessness when he was at home. Still, she missed them badly, for her mother was silent and preoccupied; and when Andrew was away, she found it hard to banish the troubles that seemed to be gathering round. They were worse for being very vaguely defined, but she felt convinced that something sinister was going on. \n\nAs she thought of Andrew, her face grew gentle and she smiled. She knew his worth and his limitations, and loved him for both. He had his suspicions, too, and would follow where they led. Andrew was not the man to shirk a painful duty, but she could not openly help him yet. That might come, and in the meanwhile she would at least put no obstacle in his way. Still, if her fears were justified, the situation was daunting and she might need all her courage. Where was Elsie walking to? Elsie, walking homeward what did she notice? how bright the crescent gleamed above the dusky firs What was at the foot of the hill? a straggling wood What was there? Appleyard Was she there long? all the afternoon did she want to go home after that? she shrank from going home why? she felt that a shadow rested upon the house Did she go on a cruise? no who did? Dick had returned from a cruise with Andrew",Were they rested and healthy?,"{'answer_start': [520], 'text': ['no']}",3qiyre09y3h0x7frv90he7k5x6d1nw9,gutenberg,data/gutenberg/txt/Harold Bindloss___Johnstone of the Border.txt/CHAPTER XXV_f228b331c65617bd55b3fce5410db8cbef8b07df1c016193a683f51,data/gutenberg/txt/Harold Bindloss___Johnstone of the Border.txt/CHAPTER XXV_f228b331c65617bd55b3fce5410db8cbef8b07df1c016193a683f51
7,"Chapter 1 \n\nKidnapped \n\n""The entire affair is shrouded in mystery,"" said D'Arnot. ""I have it on the best of authority that neither the police nor the special agents of the general staff have the faintest conception of how it was accomplished. All they know, all that anyone knows, is that Nikolas Rokoff has escaped."" \n\nJohn Clayton, Lord Greystoke--he who had been ""Tarzan of the Apes""--sat in silence in the apartments of his friend, Lieutenant Paul D'Arnot, in Paris, gazing meditatively at the toe of his immaculate boot. \n\nHis mind revolved many memories, recalled by the escape of his arch-enemy from the French military prison to which he had been sentenced for life upon the testimony of the ape-man. \n\nHe thought of the lengths to which Rokoff had once gone to compass his death, and he realized that what the man had already done would doubtless be as nothing by comparison with what he would wish and plot to do now that he was again free. \n\nTarzan had recently brought his wife and infant son to London to escape the discomforts and dangers of the rainy season upon their vast estate in Uziri--the land of the savage Waziri warriors whose broad African domains the ape-man had once ruled. \n\nHe had run across the Channel for a brief visit with his old friend, but the news of the Russian's escape had already cast a shadow upon his outing, so that though he had but just arrived he was already contemplating an immediate return to London. Who is known as Tarzan? John Clayton,",What did he do recently?,"{'answer_start': [973], 'text': ['brought his wife and infant son to London']}",30jnvc0or9kw4fdxdqvjaovhkhdqhb1,gutenberg,data/gutenberg/txt/Edgar Rice Burroughs___The Beasts of Tarzan.txt/Chapter 1_9ee2200982a44379fc652c479e5faffdef7fb0d07d07d51ab15c237,data/gutenberg/txt/Edgar Rice Burroughs___The Beasts of Tarzan.txt/Chapter 1_9ee2200982a44379fc652c479e5faffdef7fb0d07d07d51ab15c237
8,"The United States Department of Agriculture (USDA), also known as the Agriculture Department, is the U.S. federal executive department responsible for developing and executing federal laws related to farming, agriculture, forestry, and food. It aims to meet the needs of farmers and ranchers, promote agricultural trade and production, work to assure food safety, protect natural resources, foster rural communities and end hunger in the United States and internationally. \n\nApproximately 80% of the USDA's $140 billion budget goes to the Food and Nutrition Service (FNS) program. The largest component of the FNS budget is the Supplemental Nutrition Assistance Program (formerly known as the Food Stamp program), which is the cornerstone of USDA's nutrition assistance. \n\nAfter the resignation of Tom Vilsack on January 13, 2017, the Secretary of Agriculture is Sonny Perdue. \n\nMany of the programs concerned with the distribution of food and nutrition to people of America and providing nourishment as well as nutrition education to those in need are run and operated under the USDA Food and Nutrition Service. Activities in this program include the Supplemental Nutrition Assistance Program, which provides healthy food to over 40 million low-income and homeless people each month. USDA is a member of the United States Interagency Council on Homelessness, where it is committed to working with other agencies to ensure these mainstream benefits are accessed by those experiencing homelessness. What does USDA stand for? United States Department of Agriculture What percentage of the USDA budget goes to FNS? Approximately 80% What is the USDA also known as? the Agriculture Department is it responsible for executing federeal laws? yes relating to what? farming, agriculture, forestry, and food Whose needs does it try to meet? farmers and ranchers do they try to maintain the safety of food? yes Do they try to end hunger in the US? yes",How much is the budget?,"{'answer_start': [507], 'text': ['$140 billion']}",3o7l7bfshep737ycahi4gj7i1qliez8,wikipedia,United_States_Department_of_Agriculture.txt,United_States_Department_of_Agriculture.txt
9,"(CNN)His voice, his posture and his threats are menacingly familiar. \n\nThe black-clad ISIS militant shown in a video demanding a $200 million ransom to spare the lives of two Japanese citizens looks and sounds similar to the man who has appeared in at least five previous hostage videos. \n\nThe knife-wielding masked man with a London accent, nicknamed ""Jihadi John,"" has issued threats and overseen the beheadings of American and British captives. \n\n""You now have 72 hours to pressure your government in making a wise decision, by paying the $200 million to save the lives of your citizens,"" the man in the video that appeared Tuesday says in comments addressed to Japanese citizens. ""Otherwise, this knife will become your nightmare."" \n\nQ&A: Harsh realities of kidnappings, ransom \n\nThe amount of money is the same as that recently pledged by Japanese Prime Minister Shinzo Abe in humanitarian aid to Middle East countries that are affected by ISIS' bloody campaign in Iraq and Syria. \n\nJapan believes the deadline arrives Friday at 12:50 a.m. ET. And Chief Cabinet Minister Yoshihide Suga said Wednesday the country will do its best to communicate with ISIS through a third-party nation. \n\nBut mystery and confusion still surround the identity of Jihadi John. \n\nU.S. and British officials have said they believe they know who he is, but they haven't disclosed the information publicly. \n\nThat could be because Western intelligence agencies believe they have more to gain from keeping quiet, says Aki Peritz, a former CIA officer. \n\n""They can put pressure on his family, put pressure on his friends,"" he told CNN. ""Maybe they have a line to him. Maybe they know who his cousins are who are going to Syria who can identify him. However, if you publicly tell everybody who he is, his real identity, then maybe he'll go to ground and he'll disappear."" who had a accent ? Jihadi John how much time did he give ? 72 hours to do what ? pressure your government how much ramson ? $200 million hoe manu lives were at stake ? two from where ? Japan how many videos did he make before ? five of what kind ? hostage what is john group called ? ISIS what will become a bad dream ? knife who is the prime minister ? Shinzo last name ? Abe overseeing aid what areas where ? Middle East countries how many countrys are in danger ? Two when is the deadline ? Friday what time ? 12:50 who is the chief minister ? Yoshihide last name ? Suga does anyone beleive ibn his wearabouts ? they believe they know who he",who ?,"{'answer_start': [1264], 'text': ['U.S. and British officials']}",3dhe4r9ocwb1c0g1r9n0t6ldpd82g019,cnn,cnn_bb34e4f4e29ffa3fd7065c60739a95a8b659341a.story,cnn_bb34e4f4e29ffa3fd7065c60739a95a8b659341a.story


## [Task 3] Model definition

Write your own script to define the following transformer-based models from [huggingface](https://HuggingFace.co/).

* [M1] DistilRoBERTa (distilberta-base)
* [M2] BERTTiny (bert-tiny)

**Note**: Remember to install the ```transformers``` python package!

**Note**: We consider small transformer models for computational reasons!

## [Task 4] Question generation with text passage $P$ and question $Q$

We want to define $f_\theta(P, Q)$. 

Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$ and $Q_i$ and generate $A_i$.

In [16]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
#Assuring that our tokenizer is a fast tokenizer
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

A long example in our dataset

In [17]:
# Max length could be 512 (as supported by the networks) and then stride could maybe be 256

max_length = 512 #384  # The maximum length of a feature (question and context)
doc_stride = 256 #128  # The allowed overlap between two part of the context when splitting is performed.

In [18]:
pad_on_right = tokenizer.padding_side == "right"
#Used since the model expects padding on the left

In [19]:
def prepare_train_features(examples):
    """An example is tokenized while also using truncation for the only second and padding with max length. This gives features
        overlapping a bit of each others context 
    """
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    #The sample mapping gives a map from a feature from its example 
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    
  #The offset mapping will compute the start position and end position of the answer
    offset_mapping = tokenized_examples["offset_mapping"]

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):

        input_ids = tokenized_examples["input_ids"][i]
        #Index with the CLS token
        cls_index = input_ids.index(tokenizer.cls_token_id)

        #Here the sequence is assigned to a variable to understand the context and question
        sequence_ids = tokenized_examples.sequence_ids(i)

        
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        #The index of cls is given as answer if there is no answer yet 
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            #Start and end index of the answer in the context 
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            #Start token index of the context 
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            #End token index of the context 
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            #Check whether the feature is labeled with CLS index
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                #If feature is not labeled with CLS index
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    #Same example id is being used here 
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        #The sequence is used to understand the context and the question
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        #Index of the span of the context 
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

       #the offset mapping is set to none if it is not part of the context. This helps to know whether in the context a token position is part of it or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [20]:
#Having the whole dataset being prepared for the train features 
tokenized_datasets = datasets.map(
    prepare_train_features, batched=True, remove_columns=datasets["train"].column_names
)

100%|██████████| 6/6 [00:03<00:00,  1.69ba/s]
100%|██████████| 6/6 [00:03<00:00,  1.90ba/s]
100%|██████████| 6/6 [00:02<00:00,  2.06ba/s]


## [Task 5] Question generation with text passage $P$, question $Q$ and dialogue history $H$

We want to define $f_\theta(P, Q, H)$. Write your own script to implement $f_\theta$ for each model: M1 and M2.

#### Formulation

Consider a dialogue on text passage $P$. 

For each question $Q_i$ at dialogue turn $i$, your model should take $P$, $Q_i$, and $H = \{ Q_0, A_0, \dots, Q_{i-1}, A_{i-1} \}$ to generate $A_i$.

In [21]:
model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint, from_pt=True)

Metal device set to: Apple M1 Pro

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB



2022-12-06 18:03:37.429175: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-12-06 18:03:37.429723: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForQuestionAnswering: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForQuestionAnswering from a PyTorch model that you ex

## [Task 6] Train and evaluate $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$

Write your own script to train and evaluate your $f_\theta(P, Q)$ and $f_\theta(P, Q, H)$ models.

### Instructions

* Perform multiple train/evaluation seed runs: [42, 2022, 1337].$^1$
* Evaluate your models with the following metrics: SQUAD F1-score.$^2$
* Fine-tune each transformer-based models for **3 epochs**.
* Report evaluation SQUAD F1-score computed on the validation and test sets.

$^1$ Remember what we said about code reproducibility in Tutorial 2!

$^2$ You can use ```allennlp``` python package for a quick implementation of SQUAD F1-score: ```from allennlp_models.rc.tools import squad```. 

In [22]:
#Preparing the datasets for the model putting it in a tensorflow pipeline
train_set = model.prepare_tf_dataset(
    tokenized_datasets["train"],
    shuffle=True,
    batch_size=batch_size,
)

validation_set = model.prepare_tf_dataset(
    tokenized_datasets["validation"],
    shuffle=False,
    batch_size=batch_size,
)

test_set = model.prepare_tf_dataset(
    tokenized_datasets["test"],
    shuffle=False,
    batch_size=batch_size,
)

In [31]:
#Using the model with an optimizer from Adam
model.compile(optimizer=tf.keras.optimizers.Adam(), metrics=["accuracy"])

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [32]:
model.summary()

Model: "tf_bert_for_question_answering"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  4369408   
                                                                 
 qa_outputs (Dense)          multiple                  258       
                                                                 
Total params: 4,369,666
Trainable params: 4,369,666
Non-trainable params: 0
_________________________________________________________________


In [33]:
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

In [34]:
#Training the model on the training dataset with a certain amount of epochs 
model.fit(
    train_set,
    validation_data=validation_set,
    epochs=num_train_epochs,
    callbacks=[]#model_checkpoint_callback],
)

Epoch 1/3


InvalidArgumentError: Cannot assign a device for operation tf_bert_for_question_answering/bert/embeddings/Gather: Could not satisfy explicit device specification '' because the node {{colocation_node tf_bert_for_question_answering/bert/embeddings/Gather}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. 
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
AssignSubVariableOp: GPU CPU 
RealDiv: GPU CPU 
Sqrt: GPU CPU 
UnsortedSegmentSum: GPU CPU 
AssignVariableOp: GPU CPU 
ReadVariableOp: GPU CPU 
StridedSlice: CPU 
NoOp: GPU CPU 
Mul: GPU CPU 
Shape: GPU CPU 
_Arg: GPU CPU 
Unique: GPU CPU 
ResourceScatterAdd: GPU CPU 
AddV2: GPU CPU 
ResourceGather: GPU CPU 
Const: GPU CPU 

Colocation members, user-requested devices, and framework assigned devices, if any:
  tf_bert_for_question_answering_bert_embeddings_gather_resource (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  adam_adam_update_readvariableop_resource (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  adam_adam_update_readvariableop_2_resource (_Arg)  framework assigned device=/job:localhost/replica:0/task:0/device:GPU:0
  tf_bert_for_question_answering/bert/embeddings/Gather (ResourceGather) 
  Adam/Adam/update/Unique (Unique) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/Shape (Shape) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/strided_slice/stack (Const) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/strided_slice/stack_1 (Const) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/strided_slice/stack_2 (Const) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/strided_slice (StridedSlice) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/UnsortedSegmentSum (UnsortedSegmentSum) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/mul (Mul) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/ReadVariableOp (ReadVariableOp) 
  Adam/Adam/update/mul_1 (Mul) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/AssignVariableOp (AssignVariableOp) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/ResourceScatterAdd (ResourceScatterAdd) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/ReadVariableOp_1 (ReadVariableOp) 
  Adam/Adam/update/mul_2 (Mul) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/mul_3 (Mul) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/ReadVariableOp_2 (ReadVariableOp) 
  Adam/Adam/update/mul_4 (Mul) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/AssignVariableOp_1 (AssignVariableOp) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/ResourceScatterAdd_1 (ResourceScatterAdd) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/ReadVariableOp_3 (ReadVariableOp) 
  Adam/Adam/update/Sqrt (Sqrt) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/mul_5 (Mul) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/add (AddV2) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/truediv (RealDiv) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/AssignSubVariableOp (AssignSubVariableOp) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/group_deps/NoOp (NoOp) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/group_deps/NoOp_1 (NoOp) /job:localhost/replica:0/task:0/device:GPU:0
  Adam/Adam/update/group_deps (NoOp) /job:localhost/replica:0/task:0/device:GPU:0

	 [[{{node tf_bert_for_question_answering/bert/embeddings/Gather}}]] [Op:__inference_train_function_18197]

# Model evaluation

In [None]:
#Getting the predictions of the test set with the model
raw_predictions = model.predict(test_set)

In [None]:
def postprocess_qa_predictions(
    examples,
    features,
    all_start_logits,
    all_end_logits,
    n_best_size=20,
    max_answer_length=30,
    with_score=False,
):
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(
        f"Post-processing {len(examples)} example predictions split into {len(features)} features."
    )

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None  # Only used if squad_v2 is True.
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(
                tokenizer.cls_token_id
            )
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[
                -1 : -n_best_size - 1 : -1
            ].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or not offset_mapping[start_index]
                        or not offset_mapping[end_index]
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue
                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char:end_char],
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[
                0
            ]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}

        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            if not with_score:
              predictions[example["id"]] = best_answer["text"]
            else:
              predictions[example["id"]] = (best_answer["text"], best_answer["score"])
        else:
            answer = (
                best_answer["text"] if best_answer["score"] > min_null_score else ""
            )
            predictions[example["id"]] = answer

    return predictions

In [None]:
final_predictions = postprocess_qa_predictions(
    datasets["test"],
    tokenized_datasets["test"],
    raw_predictions["start_logits"],
    raw_predictions["end_logits"],
)

In [None]:
metric = load_metric("squad_v2" if squad_v2 else "squad")

formatted_predictions = [
        {"id": k, "prediction_text": v} for k, v in final_predictions.items()
]
references = [
    {"id": ex["id"], "answers": ex["answers"]} for ex in datasets["test"]
]
metric.compute(predictions=formatted_predictions, references=references)

## [Task 7] Error Analysis

Perform a simple and short error analysis as follows:
* Group dialogues by ```source``` and report the worst 5 model errors for each source (w.r.t. SQUAD F1-score).
* Inspect observed results and try to provide some comments (e.g., do the models make errors when faced with a particular question type?)$^1$

$^1$ Check the [paper](https://arxiv.org/pdf/1808.07042.pdf) for some valuable information about question/answer types (e.g., Table 6, Table 8) 

In [None]:
final_predictions_score = postprocess_qa_predictions(
    datasets["test"],
    tokenized_datasets["test"],
    raw_predictions["start_logits"],
    raw_predictions["end_logits"],
    with_score=True
)
final_predictions_score = list(final_predictions_score.items())

In [None]:
final_predictions_score.sort(key = lambda final_predictions : final_predictions[1][1])
print(final_predictions_score)
print(len(final_predictions_score))

In [None]:
x = 5
i = 0
sources = []

while len(sources) < x:

  j = 0
  
  while True:
    
    # I guess this loop takes a lot of time but I don't know how to do it faster
    if final_predictions_score[i][0] == datasets["test"]['id'][j]:
      break

    j += 1

  i += 1

  # If we already had a question from this source, we have to skip it
  seen = False
  for name in sources:
    if name == datasets["test"]["source"][j]:
      seen = True
      break
  
  if seen:
    continue
  else:
    sources.append(datasets["test"]["source"][j])

  print("Data:")
  print(datasets["test"]['context'][j])
  print(datasets["test"]['question'][j])
  print(datasets["test"]['answers'][j]['text'])

  print("Answer:")
  print(final_predictions_score[i][1][0], final_predictions_score[i][1][1])
  print("End")
  print("-----------")


# Assignment Evaluation

The following assignment points will be awarded for each task as follows:

* Task 1, Pre-processing $\rightarrow$ 0.5 points.
* Task 2, Dataset Splitting $\rightarrow$ 0.5 points.
* Task 3 and 4, Models Definition $\rightarrow$ 1.0 points.
* Task 5 and 6, Models Training and Evaluation $\rightarrow$ 2.0 points.
* Task 7, Analysis $\rightarrow$ 1.0 points.
* Report $\rightarrow$ 1.0 points.

**Total** = 6 points <br>

We may award an additional 0.5 points for outstanding submissions. 
 
**Speed Bonus** = 0.5 extra points <br>

# Report

We apply the rules described in Assignment 1 regarding the report.
* Write a clear and concise report following the given overleaf template (**max 2 pages**).
* Report validation and test results in a table.$^1$
* **Avoid reporting** code snippets or copy-paste terminal outputs $\rightarrow$ **Provide a clean schema** of what you want to show

# Comments and Organization

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

Structure your code for readability and maintenance. If you work with Colab, use sections. 

This allows you to build clean and modular code, as well as easy to read and to debug (notebooks can be quite tricky time to time).

# FAQ (READ THIS!)

---

**Question**: Does Task 3 also include data tokenization and conversion step?

**Answer:** Yes! These steps are usually straightforward since ```transformers``` also offers a specific tokenizer for each model.

**Example**: 

```
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_text = tokenizer(text)
%% Alternatively
inputs = tokenizer.tokenize(text, add_special_tokens=True, max_length=min(max_length, 512))
input_ids, attention_mask = inputs['input_ids'], inputs['attention_mask']
```

**Suggestion**: Hugginface's documentation is full of tutorials and user-friendly APIs.

---
---

**Question**: I'm hitting **out of memory error** when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length

---
---

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# The End!

Questions?