<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 49%; text-align: right;">
        <a >1</a>
        <a href="General_preprocessing.ipynb">2</a>
        <a href="QandA_data_processing.ipynb">3</a>
        <a href="Exercise.ipynb">4</a>
        <a href="Summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 49%; text-align: right;"><a href="General_preprocessing.ipynb">Next Notebook</a></span>
</div>

#  Data Preprocessing Lab 
---

<p>This Lab is part of the End-to-End approach solution to NLP with a focus on data processing in Question Answering(QA) domain. The content includes how to use libraries and automated software to process or preprocess text data, and conversion to SQUAD dataset format. Also, common text preprocessing methods like tokenization, normalization, noise removal, and stemming were highlighted. The goal of the lab is to teach learners how to preprocess text data for Question Answering domain.</p>

### Expected Knowledge
At the end of this lab, it is expected that learners would know how to build a text dataset in `SQUAD` data format for any task in the `Question Answering` domain. Furthermore, learners would also have acquired the skill to preprocess raw text using well-known techniques.

## Overview of Question Answering Dataset 

The goal of this notebook is to explore three major large-scale QA datasets and learn about their internal structure format

### Introduction to NLP Question Answering  System  

NLP Question Answering (QA) is about information retrieval whereby a question is posed to the system and a corresponding answer is replied in return. The QA system does this by retrieving the answer from a given context such as text or document. QA can be seen as an `open domain` when the context (text dataset) covers several domains like Entertainment, Art and Culture, Legal documents, weather information, etc.  When it is restricted to a single domain, it is then regarded as a `closed domain`.


<img src="images/QA_illustration.png" width="800px" height="800px"/>
<center>Figure 1: An example of extractive QA</center>

Based on the inputs and output pattern, there are 3 different types of QA:
- Extractive QA - which extracts answers from a text or document referred to as context.
- Open Generative QA - that generates direct text using the context given
- Closed Generative QA - generates answers without any given context 

However, our focus would be on Extractive QA including examples of such datasets, and how to build custom datasets for extractive QA. Before we proceed further, let’s look at various existing QA datasets.

Please run the cell below to download the dataset for this lab


In [None]:
!python3 ../source_code/dataset.py

## Brief on QA Dataset

### Stanford Question Answering Dataset (SQuAD) 
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset that contains questions posed by crowdworker on a set of Wikipedia articles. These questions are answerable within a text paragraph known as context. Answers to some of the questions may not exist within the context therefore, those questions remain unanswerable. The previous version of SQuAD dataset is known `SQuAD 1.1` and contains 100k+ question-answer pairs on 500+ articles. The latest version `(SQuAD 2.0)` combines questions from SQuAD 1.1 with more than 50k unanswerable questions written by crowdworkers in an adversarial manner to look similar to answerable ones.  The official `SQuAD 2.0` dataset is split into train, dev, and test. Only the train and dev sets are publicly available. It is distributed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode) and can be downloaded [here](https://rajpurkar.github.io/SQuAD-explorer/). The article written by P. Rajpurkar et al. on SQuAD is displayed in Figure 2 by running the cell below.


In [None]:
from IPython.display import IFrame
IFrame("../source_code/data/1806.03822.pdf", 960,900)

<center><a href="https://rajpurkar.github.io/SQuAD-explorer/"> Figure 2: Download SQuAD 2.0 dataset and article here </a></center>

**Extract from SQuAD 2.0 dataset**

```python
{'qas': [{'question': 'What fraction of New Yorkers in the private sector are employed by foreign companies?',
   'id': '56cf4722aab44d1400b88f06',
   'answers': [{'text': 'One out of ten', 'answer_start': 113}],
   'is_impossible': False},
  {'question': 'What publication ranked New York first in the 2013 American Cities of the Future rankings?',
   'id': '56cf4722aab44d1400b88f07',
   'answers': [{'text': 'FDi Magazine', 'answer_start': 372}],
   'is_impossible': False}],
 'context': 'Many Fortune 500 corporations are headquartered in New York City, as are a large number of foreign corporations. One out of ten private sector jobs in the city is with a foreign company. New York City has been ranked first among cities across the globe in attracting capital, business, and tourists. This ability to attract foreign investment helped New York City top the FDi Magazine American Cities of the Future ranking for 2013.'}
```


#### Data format
- **version**: represents the version of the SQuAD JSON dataset
- **data**: contains the actual data that includes titles and `paragraphs`
- **title**: represents domain/topic of discussion or documents or webpage title where the text for `paragraphs` are being drawn
- **paragraphs**: contains a list of `qas` and `context`
- **qas**: defines a list that contains questions `(question)`, a unique id for each question `(id)`, corresponding answers `(answers)` to the questions. If a question is impossible to answer then, the `is_impoosible` flag is set to True otherwise, it’s set to False. In the answers list, the `text` represents the answer to the question while the `answer_start` denotes the index where the answer starts within the context.
- **context**: represents a sentence or group sentences where the answer(s) to the question(s) lies. It is possible for a single context to have one to two or more questions. In the examples given above, there are questions drawn from a single context.

Run the cell below to load the training `JSON` dataset for `SQuAD 2.0`. Details on this dataset will be discussed in the next notebook.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import json # to read json

verbose = 1
input_file_path = '../source_code/data/train-v2.0.json'   
if verbose:
    print("Reading the json file")    
file = json.loads(open(input_file_path).read())
if verbose:
    print("processing...")

data_row = [row for topic in file['data'] for row in topic['paragraphs']]
print("total qas & context: ", len(data_row))
data_row[500]


In [None]:
# test with index 0 & 100: data_row[0] & data_row[100]
data_row[0]

**List of Titles**

There are 442 topics/domains and 422 paragraphs covered in the SQuAD 2.0 JSON dataset.  The dataset contains 19,035 `qas` (questions and answers) and `context`(text).

```python
['Beyoncé',
 'Frédéric_Chopin',
 'Genome',
 'Comprehensive_school',
 'Republic_of_the_Congo',
 'Prime_minister',
 'Institute_of_technology',
 'Wayback_Machine',
 'Dutch_Republic',
 'Symbiosis',
 'Canadian_Armed_Forces',
 'Cardinal_(Catholicism)',
 ............................. 
 'House_music',
 'Letter_case',
 'Chihuahua_(state)',
 'Imamah_(Shia_doctrine)',
 'Pitch_(music)',
 .....................
 'Infection',
 'Hunting',
 'Kathmandu',
 'Myocardial_infarction',
 'Matter']
```

In [None]:
titles = [row['title'] for row in file['data']]
paragraphs = [row['paragraphs'] for row in file['data']]

print("total nos of paragraphs:", len(paragraphs) )
print("total nos of titles:", len(titles) )
titles
#see titles last tow
#file['data'][440:]

### Natural Questions (NQ)
The Natural Questions is a large-scale corpus dataset from google that target `open-domain question answering system`. It contains questions issued to google search engines and long and short answers that were annotated from Wikipedia pages. The full dataset is 42GB including HTML of Wikipedia pages, and contains 307k training examples, 8k examples each for test and development respectively. The simplified version of NQ training dataset is 4GB and can be downloaded [here](https://ai.google.com/research/NaturalQuestions/download). Article on Natural Question by `T. Kwiatkowski et al.` can be found [here](https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html).  


<img src="images/nq_paper.PNG" width="650px" height="300px" />

Figure 3: Google AI Blog Natural Questions is released under the [Creative Commons Share-Alike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license


Each examples of NQ contains document paragraph `(document_text)`, long answer candidates `(long_answer_candidates)` , question `(question_text)`, annotations, document_url , and example_id. Training examples from the simplified version `(v1.0-simplified_simplified-nq-train.jsonl.gz)` is shown below: 


```python
{'document_text': "Email marketing - Wikipedia <H1> Email marketing </H1> Jump to : navigation , search <Table> <Tr> <Td> </Td> <Td> ( hide ) This article has multiple issues . Please help improve it or discuss these issues on the talk page . ( Learn how and when to remove these template messages ) <Table> <Tr> <Td> </Td> <Td> This article needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed . ( September 2014 ) ( Learn how and when to remove this template message ) </Td> </Tr> </Table> <Table> <Tr> <Td> </Td> <Td> This article possibly contains original research . Please improve it by verifying the claims made and adding inline citations . Statements consisting only of original research should be removed . ( January 2015 ) ( Learn how and when to remove this template message ) </Td> </Tr> </Table> ( Learn how and when to remove this template message ) </Td> </Tr> </Table> <Table> <Tr> <Td> Part of a series on </Td> </Tr> <Tr> <Th> Internet marketing </Th> </Tr> <Tr> <Td> <Ul> <Li> Search engine optimization </Li> <Li> Local search engine optimisation </Li> <Li> Social media marketing </Li>........
This email resulted in $13 million worth of sales in DEC products , and highlighted the potential of marketing through mass emails . However , as email marketing developed as an effective means of direct communication , users began blocking out content from emails with filters and blocking programs . In order to effectively communicate a message through email , marketers had to develop a way of pushing content through to the end user , without being cut out by automatic filters and spam removing software .....   
</Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <Ul> <Li> </Li> <Li> </Li> </Ul>",

'long_answer_candidates': [{'start_token': 14, 'top_level': True, 'end_token': 170}, {'start_token': 15, 'top_level': False, 'end_token': 169}, {'start_token': 52, 'top_level': False, 'end_token': 103}, {'start_token': 53, 'top_level': False, 'end_token': 102}, {'start_token': 103, 'top_level': False, 'end_token': 156}, {'start_token': 104, 'top_level': False, 'end_token': 155}, {'start_token': 170, 'top_level': True, 'end_token': 321}, {'start_token': 171, 'top_level': False, 'end_token': 180}, {'start_token': 180, 'top_level': False, 'end_token': 186}, {'start_token': 186, 'top_level': False, 'end_token': 224}, {'start_token': 188, 'top_level': False, 'end_token': 222}, {'start_token': 189, 'top_level': False,.... }], 

'question_text': 'which is the most common use of opt-in e-mail marketing', 
'annotations': [{'yes_no_answer': 'NONE', 'long_answer': {'start_token': 1952, 'candidate_index': 54, 'end_token': 2019}, 'short_answers': [{'start_token': 1960, 'end_token': 1969}], 'annotation_id': 593165450220027640}],

'document_url': 'https://en.wikipedia.org//w/index.php?title=Email_marketing&amp;oldid=814071202',

'example_id': 5655493461695504401}

```

Please run the cell below to explore the dataset. You can increase the value of the count to see more samples. 


In [None]:
import json
import gzip
import pandas as pd


count =2 
num_row = 0
input_file_path = '../source_code/data/v1.0-simplified_simplified-nq-train.jsonl.gz'

data= []
with gzip.open(input_file_path, 'rb') as file: 
    for l in file:
        utf8_in = l.decode("utf8", "strict")
        data_rows = json.loads(utf8_in)               

        print(data_rows)
        num_row +=1
        if(num_row ==count): break   #increase the value of count to see more rows 
            
#column = ['document_text','long_answer_candidates','question_text','annotations', 'document_url','example_id']
#df = pd.DataFrame(data=data, columns=column)
#df.head() 

#### Data format
- **document_text**: represents paragraph or context used to infer answers to questions. The document_text may contain text, html bounding box, and table rows from Wikipedia.
- **long_answer_candidates**: represents a list of nested long answers with start and end tokens that indicate the span of the possible answers. A token can be a word or an HTML tag that defines a heading, paragraph, table, or list. The `top_level` flag takes a Boolean value to indicate whether a candidate's answer is nested below another or not. When it is nested, the value is set to false `('top_level': False)`, and when not, it is set to True `('top_level': True)`.
- **question_text**: represent the question that requires an answer
- **annotations**: defines if the question requires a `yes` or `no` answer, otherwise the flag `yes_or_no` would be set to None. It also defines the start and end tokens for short and long answers, while the `candidate_index` represents the index in the `long_answer_candidate`.   
- **document_url**: indicates the Wikipedia URL where the answers were drawn.
- **example_id**: denotes the unique id for the examples within the dataet.



You can learn more by visiting the [page here](https://pythonlang.dev/repo/google-research-datasets-natural-questions/).


### Conversational Question Answering (CoQA)

Conversational Question Answering (CoQA) is a large-scale dataset for building conversational question-answering systems. The goal is to have a dataset that can measure the ability of machines to comprehend a text passage and correctly respond to a series of interconnected questions within a conversation. The dataset contains 127,000+ questions with answers collected from 8000+ conversations. In CoQA, the questions are conversational and the answers are in free-form text, and the passages are collected from seven different domains. You can read more and download the CoQA dataset from [here](https://stanfordnlp.github.io/coqa/)

<center><img src="images/coqa_tigram_prefixes.png" width="500px" height="500px" /></center>
<center> Figure 4: Distribution of trigram prefixes of questions in CoQA.  <a href="https://arxiv.org/pdf/1808.07042.pdf">  download source paper </a> by S. Reddy et al. 2019</center>

#### Data format
- **source**: represents the actual source of the story where the conversation was made.
- **id**: this is a unique identifier for each source or row in the dataset.
- **filename**: the name of the file that contains the source.
- **story**: the passage or paragraph where the conversation was initiated.
- **questions**: contains a list of text questions and their unique id `(turn_id)` mapping  to corresponding answers’ `(turn_id)`.
- **answers**: contains a list of answers to questions with an indication of where each of the answers starts `(span_start)` and ends `(span_end)` and the actual text `(span_text)` within the `story`.  
- **name**: corresponds to fiename



```python

{'source': 'wikipedia',
 
 'id': '3zotghdk5ibi9cex97fepx7jetpso7',
 
 'filename': 'Vatican_Library.txt',
 
 'story': 'The Vatican Apostolic Library (), more commonly called the Vatican Library or simply the Vat, is the library of the Holy See, located in Vatican City. Formally established in 1475, although it is much older, it is one of the oldest libraries in the world and contains one of the most significant collections of historical texts. It has 75,000 codices from throughout history, as well as 1.1 million printed books, which include some 8,500 incunabula. \n\nThe Vatican Library is a research library for history, law, philosophy, science and theology. The Vatican Library is open to anyone who can document their qualifications and research needs. Photocopies for private study of pages from books published between 1801 and 1990 can be requested in person or by mail. \n\nIn March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts, to be made available online. \n\nThe Vatican Secret Archives were separated from the library at the beginning of the 17th century; they contain another 150,000 items. \n\nScholars have traditionally divided the history of the library into five periods, Pre-Lateran, Lateran, Avignon, Pre-Vatican and Vatican. \n\nThe Pre-Lateran period, comprising the initial days of the library, dated from the earliest days of the Church. Only a handful of volumes survive from this period, though some are very significant.',
 'questions': [{'input_text': 'When was the Vat formally opened?',
   'turn_id': 1},
  {'input_text': 'what is the library for?', 'turn_id': 2},
  {'input_text': 'for what subjects?', 'turn_id': 3},
  {'input_text': 'and?', 'turn_id': 4},
  {'input_text': 'what was started in 2014?', 'turn_id': 5},
  {'input_text': 'how do scholars divide the library?', 'turn_id': 6},
  ....................................................................             
  {'input_text': 'what will this allow?', 'turn_id': 20}],
 
 'answers': [{'span_start': 151,
   'span_end': 179,
   'span_text': 'Formally established in 1475',
   'input_text': 'It was formally established in 1475',
   'turn_id': 1},
  {'span_start': 454,
   'span_end': 494,
   'span_text': 'he Vatican Library is a research library',
   'input_text': 'research',
   'turn_id': 2},
  {'span_start': 457,
   'span_end': 511,
   'span_text': 'Vatican Library is a research library for history, law',
   'input_text': 'history, and law',
   'turn_id': 3},
  {'span_start': 457,
   'span_end': 545,
   'span_text': 'Vatican Library is a research library for history, law, philosophy, science and theology',
   'input_text': 'philosophy, science and theology',
   'turn_id': 4},
  {'span_start': 769,
   'span_end': 879,
   'span_text': 'March 2014, the Vatican Library began an initial four-year project of digitising its collection of manuscripts',
   'input_text': 'a  project',
   'turn_id': 5},
  {'span_start': 1048,
   'span_end': 1127,
   'span_text': 'Scholars have traditionally divided the history of the library into five period',
   'input_text': 'into periods',
   'turn_id': 6},
 ..................................................................................
  {'span_start': 868,
   'span_end': 910,
   'span_text': 'manuscripts, to be made available online. ',
   'input_text': 'them to be viewed online.',
   'turn_id': 20}],
 
 'name': 'Vatican_Library.txt'}

```

Please run the cell below to explore the CoQa training dataset `(coqa-train-v1.0.json)`.

The CoQA dataset is licensed under the following licenses:

- Literature and Wikipedia passages are shared under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license.
- Children's stories are collected from MCTest which comes with [MSR-LA](https://github.com/mcobzarenco/mctest/blob/master/data/MCTest/LICENSE.pdf) license.
- Middle/High school exam passages are collected from [RACE](https://arxiv.org/abs/1704.04683) which comes with its [own](http://www.cs.cmu.edu/~glai1/data/race/) license.
- News passages are collected from the DeepMind CNN dataset which comes with [Apache](https://github.com/deepmind/rc-data/blob/master/LICENSE) license.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import json # to read json

verbose = 1
input_file_path = '../source_code/data/coqa-train-v1.0.json'   
if verbose:
    print("Reading the json file")    
file = json.loads(open(input_file_path).read())
if verbose:
    print("processing...")

data_row = [row for row in file['data']]
print("total rows: ", len(data_row))
data_row[0]

### Other Datasets



|QA Dataset Name|Download Link|Paper Link|
|-|-|-|
|Explain Like I’m Five (ELI5)| https://github.com/facebookresearch/ELI5 | Long Form Question Answering:  https://arxiv.org/abs/1907.09190 |
| TriviaQA |http://nlp.cs.washington.edu/triviaqa/  | TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension: https://aclanthology.org/P17-1147.pdf |
|Question Answering in Context (QuAC)| https://quac.ai/ | Question Answering in Context: https://arxiv.org/abs/1808.07036 |
| TWEETQA  | https://aclanthology.org/P19-1496/  | TWEETQA: A Social Media Focused Question Answering Dataset: https://aclanthology.org/P19-1496.pdf |

<br/><br/>


For more on large and small Question Answering datasets, see `10 Question-Answering Datasets To Build Robust Chatbot Systen` by [Ambika Choudhury, 2019](https://analyticsindiamag.com/10-question-answering-datasets-to-build-robust-chatbot-systems/) and `University of Freiburg: Algorithms and Data Structures Group` [large-qa-datasets GitHub page](https://github.com/ad-freiburg/large-qa-datasets).


**Having explored three major QA datasets, we proceed to the next notebook to learn some basic techniques on how to preprocess text data.** 


## References

- https://rajpurkar.github.io/SQuAD-explorer/
- https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
- https://ai.google.com/research/NaturalQuestions/download
- https://stanfordnlp.github.io/coqa/
---
## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 51%; text-align: right;">
        <a >1</a>
        <a href="General_preprocessing.ipynb">2</a>
        <a href="QandA_data_processing.ipynb">3</a>
        <a href="Exercise.ipynb">4</a>
        <a href="Summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 49%; text-align: right;"><a href="General_preprocessing.ipynb">Next Notebook</a></span>
</div>

<br>
<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>