<a href="https://colab.research.google.com/github/pedro-pauletti/nlp-with-transformers/blob/main/Question_and_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Requirements

In [34]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
Colle

### Intro to SQuAD 2.0

The SQuAD (Stanford Question and Answering Dataset) is a hugely popular dataset containing question and answer pairs scraped from Wikipedia, covering topics ranging from Beyonce, to Physics. As one of the most comprehensive Q&A datasets available, it's only natural that we will be making use of it. So let's explore it.

First, we'll need to download the data. There are two JSON files that we are interested in - train and dev, which we can downloaded from http. Here we will be storing the SQuAD data in the ../../data/squad directory, so we must check if this already exists and if not create the directory.

In [1]:
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
files = ['train-v2.0.json', 'dev-v2.0.json']

In [2]:
import os

squad_dir = './data/squad'

In [3]:
os.makedirs(squad_dir)

In [4]:
import requests

In [8]:
for file in files:
  res = requests.get(url+file)
  with open(os.path.join(squad_dir, file), 'wb') as fp:
    for chunk in res.iter_content(chunk_size=40):
      fp.write(chunk)

In [9]:
import json

with open(os.path.join(squad_dir, files[0]), 'rb') as f:
  squad = json.load(f)

In [10]:
squad['data'][0]['paragraphs'][0]

{'qas': [{'question': 'When did Beyonce start becoming popular?',
   'id': '56be85543aeaaa14008c9063',
   'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
   'is_impossible': False},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'id': '56be85543aeaaa14008c9065',
   'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
   'is_impossible': False},
  {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
   'id': '56be85543aeaaa14008c9066',
   'answers': [{'text': '2003', 'answer_start': 526}],
   'is_impossible': False},
  {'question': 'In what city and state did Beyonce  grow up? ',
   'id': '56bf6b0f3aeaaa14008c9601',
   'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
   'is_impossible': False},
  {'question': 'In which decade did Beyonce become famous?',
   'id': '56bf6b0f3aeaaa14008c9602',
   'answers': [{'text': 'late 1990s', 'answer_start': 276}],
   'is_impossible': False},
  {'q

### Processing SQuAD Training Data

In [11]:
new_squad = []

In [12]:
for group in squad['data']:
  for paragraph in group['paragraphs']:
    context = paragraph['context']
    for qa_pair in paragraph['qas']:
      question = qa_pair['question']
      if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
        answer =  qa_pair['answers'][0]['text']
      elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
        answer =  qa_pair['plausible_answers'][0]['text']
      else:
        answer = None
      new_squad.append({
          'question': question,
          'answer': answer,
          'context': context
      })

In [13]:
new_squad[:2], new_squad[-2:]

([{'question': 'When did Beyonce start becoming popular?',
   'answer': 'in the late 1990s',
   'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'answer': 'singing and dancing',
   'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born

In [14]:
with open(os.path.join(squad_dir, 'train.json'), 'w') as f:
  json.dump(new_squad, f)

### (Optional) Processing SQuAD Training Data with Match-Case

In [15]:
match "test":
  case "test":
    print(True)

True


In [None]:
for group in squad['data']:
  for paragraph in group['paragraphs']:
    context = paragraph['context']
    for qa_pair in paragraph['qas']:
      question = qa_pair['question']
      match qa_pair:
        case {'answers':[{'text': answer}]}:
          pass
        case {'plausible_answers':[{'text': answer}]}:
          pass
        case _:
          answer = None
      new_squad.append({
          'question': question,
          'answer': answer,
          'context': context
      })

### Task dev-v2.0.json

In [16]:
import json

with open(os.path.join(squad_dir, files[1]), 'rb') as f:
  squad = json.load(f)

In [17]:
squad['data'][0]['paragraphs'][0]

{'qas': [{'question': 'In what country is Normandy located?',
   'id': '56ddde6b9a695914005b9628',
   'answers': [{'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159}],
   'is_impossible': False},
  {'question': 'When were the Normans in Normandy?',
   'id': '56ddde6b9a695914005b9629',
   'answers': [{'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': 'in the 10th and 11th centuries', 'answer_start': 87},
    {'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': '10th and 11th centuries', 'answer_start': 94}],
   'is_impossible': False},
  {'question': 'From which countries did the Norse originate?',
   'id': '56ddde6b9a695914005b962a',
   'answers': [{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_star

In [24]:
new_squad = []

In [27]:
for group in squad['data']:
  for paragraph in group['paragraphs']:
    context = paragraph['context']
    for qa_pair in paragraph['qas']:
      question = qa_pair['question']
      answers = []
      if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
        for answer in qa_pair['answers']:
          if answer['text'] not in answers:
            answers.append(answer['text'])
      elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
        for answer in qa_pair['plausible_answers']:
          if answer['text'] not in answers:
            answers.append(answer['text'])
      else:
        answers = []
      new_squad.append({
          'question': question,
          'answer': answers,
          'context': context
      })

In [28]:
new_squad[:2], new_squad[-2:]

([{'question': 'In what country is Normandy located?',
   'answer': ['France'],
   'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'},
  {'question': 'When were the Normans in Normandy?',
   'answer': ['10th and 11th centuries', 'in the 10th and 11th centuries'],
   'context': 'The Normans (No

#### Solution

In [30]:
new_squad = []

for group in squad['data']:
  for paragraph in group['paragraphs']:
    context = paragraph['context']
    for qa_pair in paragraph['qas']:
      question = qa_pair['question']
      if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
        answer_list =  qa_pair['answers']
      elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
        answer_list =  qa_pair['plausible_answers']
      else:
        answer_list = []

      answer_list = [item['text'] for item in answer_list]
      #Remove duplicates
      answer_list = list(set(answer_list))

      for answer in answer_list:
        new_squad.append({
            'question': question,
            'answer': answer,
            'context': context
        })

In [32]:
new_squad[:3]

[{'question': 'In what country is Normandy located?',
  'answer': 'France',
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'},
 {'question': 'When were the Normans in Normandy?',
  'answer': '10th and 11th centuries',
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: No

In [33]:
with open(os.path.join(squad_dir, 'dev.json'), 'w') as f:
  json.dump(new_squad, f)

### Q&A Model

In [38]:
import json


with open('data/squad/dev.json', 'r') as f:
  squad = json.load(f)

In [36]:
from transformers import BertTokenizer, BertForQuestionAnswering

modelName = 'deepset/bert-base-cased-squad2'

tokenizer = BertTokenizer.from_pretrained(modelName)
model = BertForQuestionAnswering.from_pretrained(modelName)

Downloading (…)okenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [40]:
from transformers import pipeline

In [42]:
qa = pipeline('question-answering', model=model, tokenizer=tokenizer)

In [43]:
squad[:2]

[{'question': 'In what country is Normandy located?',
  'answer': 'France',
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'},
 {'question': 'When were the Normans in Normandy?',
  'answer': '10th and 11th centuries',
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: No

In [44]:
qa({
    'question': 'In what country is Normandy located?',
    'context' : 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
})

{'score': 0.9995271563529968, 'start': 159, 'end': 166, 'answer': 'France.'}

In [46]:
answers = []

for pair in squad[:5]:
  ans = qa({
      'question': pair['question'],
      'context': pair['context']
  })
  answers.append({
      'predicted': ans['answer'],
      'true': pair['answer']
  })

In [47]:
answers

[{'predicted': 'France.', 'true': 'France'},
 {'predicted': '10th and 11th centuries', 'true': '10th and 11th centuries'},
 {'predicted': '10th and 11th centuries',
  'true': 'in the 10th and 11th centuries'},
 {'predicted': 'Denmark, Iceland and Norway',
  'true': 'Denmark, Iceland and Norway'},
 {'predicted': 'Rollo,', 'true': 'Rollo'}]