<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/expanded_datasets_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Datasets

## Downloading Datasets
Download using HuggingFace's `datasets` library

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import json

from pprint import pprint

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
! pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 5.2 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 77.4 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 52.2 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 57.5 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 57.8 MB/s 
Installing coll

In [4]:
from datasets import list_datasets, load_dataset_builder, get_dataset_config_names, load_dataset, load_from_disk

In [5]:
def summarize_dataset (dataset, config=None):
    builder = load_dataset_builder(dataset, config)
    print(f"Description:\n {builder.info.description}")
    print(f"Features:")
    pprint(builder.info.features)
    return

In [6]:
dataset_list = list_datasets()

In [7]:
[dataset for dataset in dataset_list if "squad_v2" in dataset]

['squad_v2',
 'GEM/squad_v2',
 'caltonji/harrypotter_squad_v2',
 'caltonji/harrypotter_squad_v2_2',
 'susumu2357/squad_v2_sv',
 'sichenzhong/squad_v2_back_trans_aug',
 'sichenzhong/squad_v2_synonym_aug',
 'sichenzhong/squad_v2_word2vec_aug',
 'sichenzhong/squad_v2_context_aug',
 'sichenzhong/squad_v2_back_trans_synonym_aug',
 'sichenzhong/squad_v2_back_trans_possib_aug',
 'sichenzhong/squad_v2_back_trans_synonym_possib_aug',
 'pragnakalp/squad_v2_french_translated',
 'wiselinjayajos/squad_v2_modified_for_t5_qg',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-7c1a5e5f-11505530',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-6f87823c-11575537',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-6f87823c-11575535',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-94d8b010-11595541',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-94d8b010-11595542',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-94d8b010-11595543',
 'autoevaluate/autoeval-staging-eval-p

In [8]:
[dataset for dataset in dataset_list if "trivia_qa" in dataset]

['trivia_qa']

In [9]:
[dataset for dataset in dataset_list if "natural" in dataset]

['natural_questions',
 'vasudevgupta/bigbird-tokenized-natural-questions',
 'vasudevgupta/natural-questions-validation',
 'LeboNLP/toxic-natural-utterances',
 'pscotti/naturalscenesdataset',
 'Veldrovive/split-webdataset-naturalscenes']

In [10]:
[dataset for dataset in dataset_list if "quac" in dataset]

['quac', 'Zaid/quac_expanded']

### SQuAD 2.0

In [11]:
print (get_dataset_config_names("squad_v2"))

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

['squad_v2']


In [12]:
summarize_dataset("squad_v2")

Description:
 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers
 to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but
 also determine when no answer is supported by the paragraph and abstain from answering.

Features:
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}


In [13]:
# SQuAD 2.0 is 40 MB and quick to download
# data_squad = load_dataset("squad_v2")

data_squad = load_from_disk("/content/drive/MyDrive/Colab Data/squad_v2.hf")

In [14]:
(type (data_squad))

datasets.dataset_dict.DatasetDict

In [15]:
#data_squad.save_to_disk("/content/drive/MyDrive/Colab Data/squad_v2.hf")

⚠️ WARNING ⚠️ — TriviaQA is a large dataset

### TriviaQA*

In [16]:
pprint(get_dataset_config_names("trivia_qa"))

Downloading builder script:   0%|          | 0.00/13.4k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/41.3k [00:00<?, ?B/s]

['rc',
 'rc.nocontext',
 'unfiltered',
 'unfiltered.nocontext',
 'rc.web',
 'rc.web.nocontext',
 'rc.wikipedia',
 'rc.wikipedia.nocontext']


In [17]:
summarize_dataset("trivia_qa", "rc")

Description:
 TriviaqQA is a reading comprehension dataset containing over 650K
question-answer-evidence triples. TriviaqQA includes 95K question-answer
pairs authored by trivia enthusiasts and independently gathered evidence
documents, six per question on average, that provide high quality distant
supervision for answering the questions.

Features:
{'answer': {'aliases': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
            'matched_wiki_entity_name': Value(dtype='string', id=None),
            'normalized_aliases': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
            'normalized_matched_wiki_entity_name': Value(dtype='string', id=None),
            'normalized_value': Value(dtype='string', id=None),
            'type': Value(dtype='string', id=None),
            'value': Value(dtype='string', id=None)},
 'entity_pages': Sequence(feature={'doc_source': Value(dtype='string', id=None), 'filename': Value(dtype='string', id=None), '

In [18]:
# TriviaQA downloads and pre-processes to generate 17.4 GB -- it takes 25 mins total
# data_trivia = load_dataset("trivia_qa", "rc")

data_trivia = load_from_disk("/content/drive/MyDrive/Colab Data/trivia_qa_rc.hf")

In [19]:
(type (data_trivia))

datasets.dataset_dict.DatasetDict

In [20]:
#data_trivia.save_to_disk("/content/drive/MyDrive/Colab Data/trivia_qa_rc.hf")

### Natural Questions**

In [21]:
pprint(get_dataset_config_names("natural_questions"))

Downloading builder script:   0%|          | 0.00/9.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

['default', 'dev']


In [22]:
summarize_dataset("natural_questions", "default")

Description:
 
The NQ corpus contains questions from real users, and it requires QA systems to
read and comprehend an entire Wikipedia article that may or may not contain the
answer to the question. The inclusion of real user questions, and the
requirement that solutions should read an entire page to find the answer, cause
NQ to be a more realistic and challenging task than prior QA datasets.

Features:
{'annotations': Sequence(feature={'id': Value(dtype='string', id=None), 'long_answer': {'start_token': Value(dtype='int64', id=None), 'end_token': Value(dtype='int64', id=None), 'start_byte': Value(dtype='int64', id=None), 'end_byte': Value(dtype='int64', id=None), 'candidate_index': Value(dtype='int64', id=None)}, 'short_answers': Sequence(feature={'start_token': Value(dtype='int64', id=None), 'end_token': Value(dtype='int64', id=None), 'start_byte': Value(dtype='int64', id=None), 'end_byte': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None)}, length=-1, id=None), '

In [23]:
# Natural Questions is too big for memory at 138 GB
# data_nq = load_dataset("natural_questions")

### QuAC

In [24]:
pprint(get_dataset_config_names("quac"))

Downloading builder script:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.90k [00:00<?, ?B/s]

['plain_text']


In [25]:
summarize_dataset("quac")

Description:
 Question Answering in Context is a dataset for modeling, understanding,
and participating in information seeking dialog. Data instances consist
of an interactive dialog between two crowd workers: (1) a student who
poses a sequence of freeform questions to learn as much as possible
about a hidden Wikipedia text, and (2) a teacher who answers the questions
by providing short excerpts (spans) from the text. QuAC introduces
challenges not found in existing machine comprehension datasets: its
questions are often more open-ended, unanswerable, or only meaningful
within the dialog context.

Features:
{'answers': Sequence(feature={'texts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'answer_starts': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None)}, length=-1, id=None),
 'background': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'dialogue_id': Value(dtype='string', id=None),
 'followups': Sequence(featur

In [26]:
# Question Answering in Context downloads 136 MB and is quick
# data_quac = load_dataset("quac")

data_quac = load_from_disk("/content/drive/MyDrive/Colab Data/quac.hf")

In [27]:
(type (data_quac))

datasets.dataset_dict.DatasetDict

In [28]:
# data_quac.save_to_disk("/content/drive/MyDrive/Colab Data/quac.hf")

## Getting Familiar

### SQuAD 2.0

In [29]:
data_squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [30]:
data_squad['train'].info.features

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

In [31]:
# Look at first example (queen bey)
pprint(data_squad['train'][0])

{'answers': {'answer_start': [269], 'text': ['in the late 1990s']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born '
            'September 4, 1981) is an American singer, songwriter, record '
            'producer and actress. Born and raised in Houston, Texas, she '
            'performed in various singing and dancing competitions as a child, '
            'and rose to fame in the late 1990s as lead singer of R&B '
            "girl-group Destiny's Child. Managed by her father, Mathew "
            "Knowles, the group became one of the world's best-selling girl "
            "groups of all time. Their hiatus saw the release of Beyoncé's "
            'debut album, Dangerously in Love (2003), which established her as '
            'a solo artist worldwide, earned five Grammy Awards and featured '
            'the Billboard Hot 100 number-one singles "Crazy in Love" and '
            '"Baby Boy".',
 'id': '56be85543aeaaa14008c9063',
 'question': 'When did

### TriviaQA

In [32]:
data_trivia

DatasetDict({
    train: Dataset({
        features: ['question', 'question_id', 'question_source', 'entity_pages', 'search_results', 'answer'],
        num_rows: 138384
    })
    validation: Dataset({
        features: ['question', 'question_id', 'question_source', 'entity_pages', 'search_results', 'answer'],
        num_rows: 17944
    })
    test: Dataset({
        features: ['question', 'question_id', 'question_source', 'entity_pages', 'search_results', 'answer'],
        num_rows: 17210
    })
})

In [33]:
data_trivia['train'].info.features

{'question': Value(dtype='string', id=None),
 'question_id': Value(dtype='string', id=None),
 'question_source': Value(dtype='string', id=None),
 'entity_pages': Sequence(feature={'doc_source': Value(dtype='string', id=None), 'filename': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'wiki_context': Value(dtype='string', id=None)}, length=-1, id=None),
 'search_results': Sequence(feature={'description': Value(dtype='string', id=None), 'filename': Value(dtype='string', id=None), 'rank': Value(dtype='int32', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None), 'search_context': Value(dtype='string', id=None)}, length=-1, id=None),
 'answer': {'aliases': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
  'normalized_aliases': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
  'matched_wiki_entity_name': Value(dtype='string', id=None),
  'normalized_matched_wiki_entity_name': Value(dtyp

In [34]:
# Look at first example
pprint(data_trivia['train'][5])

{'answer': {'aliases': ['Chicago Bears',
                        'Chicago Staleys',
                        'Decatur Staleys',
                        'Chicago Bears football',
                        'Chicago bears',
                        'Save Da Planet',
                        'Chicago Gators'],
            'matched_wiki_entity_name': '',
            'normalized_aliases': ['chicago bears',
                                   'chicago staleys',
                                   'chicago gators',
                                   'decatur staleys',
                                   'save da planet',
                                   'chicago bears football'],
            'normalized_matched_wiki_entity_name': '',
            'normalized_value': 'chicago bears',
            'type': 'WikipediaEntity',
            'value': 'Chicago Bears'},
 'entity_pages': {'doc_source': ['TagMe'],
                  'filename': ['Super_Bowl_XX.txt'],
                  'title': ['Super Bowl XX'],
 

In [35]:
# Look at another example
pprint(data_trivia['train'][1962])

{'answer': {'aliases': ['Hydrogen carbide',
                        'Metane',
                        'Carbon tetrahydride',
                        'CH₄',
                        'Liquid methane',
                        'CH4 (disambiguation)',
                        'Methane plume',
                        'Marsh Gas',
                        'Methane gas',
                        'Carburetted hydrogen',
                        'Ch4',
                        'Liquid methane rocket fuel',
                        'Methyl hydride',
                        'Methan',
                        'CH4',
                        'Marsh gas,firedamp',
                        'Methane'],
            'matched_wiki_entity_name': '',
            'normalized_aliases': ['ch₄',
                                   'methyl hydride',
                                   'ch4 disambiguation',
                                   'metane',
                                   'carburetted hydrogen',
               

### Natural Questions

### QuAC

In [36]:
data_quac

DatasetDict({
    train: Dataset({
        features: ['dialogue_id', 'wikipedia_page_title', 'background', 'section_title', 'context', 'turn_ids', 'questions', 'followups', 'yesnos', 'answers', 'orig_answers'],
        num_rows: 11567
    })
    validation: Dataset({
        features: ['dialogue_id', 'wikipedia_page_title', 'background', 'section_title', 'context', 'turn_ids', 'questions', 'followups', 'yesnos', 'answers', 'orig_answers'],
        num_rows: 1000
    })
})

In [37]:
data_quac['train']

Dataset({
    features: ['dialogue_id', 'wikipedia_page_title', 'background', 'section_title', 'context', 'turn_ids', 'questions', 'followups', 'yesnos', 'answers', 'orig_answers'],
    num_rows: 11567
})

In [38]:
# Look at first example
pprint(data_quac['train'][12])

{'answers': {'answer_starts': [[46], [391], [2887], [2887]],
             'texts': [['May 25, 1803,'],
                       ['John Clarke,'],
                       ['CANNOTANSWER'],
                       ['CANNOTANSWER']]},
 'background': 'Ralph Waldo Emerson (May 25, 1803 - April 27, 1882) was an '
               'American essayist, lecturer, philosopher and poet who led the '
               'transcendentalist movement of the mid-19th century. He was '
               'seen as a champion of individualism and a prescient critic of '
               'the countervailing pressures of society, and he disseminated '
               'his thoughts through dozens of published essays and more than '
               '1,500 public lectures across the United States. Emerson '
               'gradually moved away from the religious and social beliefs of '
               'his contemporaries, formulating and expressing the philosophy '
               'of transcendentalism in his 1836 essay "Nature". 