# Exploring Datasets

## Downloading Datasets
Download using HuggingFace's `datasets` library

In [1]:
import numpy as np
import pandas as pd

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
! pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 20.9 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 71.3 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 44.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 50.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 63.2 MB/s 
Installi

In [10]:
from datasets import list_datasets, load_dataset

In [11]:
dataset_list = list_datasets()

In [13]:
[dataset for dataset in dataset_list if "squad_v2" in dataset]

['squad_v2',
 'GEM/squad_v2',
 'caltonji/harrypotter_squad_v2',
 'caltonji/harrypotter_squad_v2_2',
 'susumu2357/squad_v2_sv',
 'sichenzhong/squad_v2_back_trans_aug',
 'sichenzhong/squad_v2_synonym_aug',
 'sichenzhong/squad_v2_word2vec_aug',
 'sichenzhong/squad_v2_context_aug',
 'sichenzhong/squad_v2_back_trans_synonym_aug',
 'sichenzhong/squad_v2_back_trans_possib_aug',
 'sichenzhong/squad_v2_back_trans_synonym_possib_aug',
 'pragnakalp/squad_v2_french_translated',
 'wiselinjayajos/squad_v2_modified_for_t5_qg',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-7c1a5e5f-11505530',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-6f87823c-11575537',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-6f87823c-11575535',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-94d8b010-11595541',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-94d8b010-11595542',
 'autoevaluate/autoeval-staging-eval-project-squad_v2-94d8b010-11595543',
 'autoevaluate/autoeval-staging-eval-p

In [14]:
[dataset for dataset in dataset_list if "trivia_qa" in dataset]

['trivia_qa']

In [15]:
[dataset for dataset in dataset_list if "quac" in dataset]

['quac', 'Zaid/quac_expanded']

In [16]:
[dataset for dataset in dataset_list if "natural" in dataset]

['natural_questions',
 'vasudevgupta/bigbird-tokenized-natural-questions',
 'vasudevgupta/natural-questions-validation',
 'LeboNLP/toxic-natural-utterances',
 'pscotti/naturalscenesdataset',
 'Veldrovive/split-webdataset-naturalscenes']

### SQuAD 2.0

In [6]:
# SQuAD 2.0 is 40 MB and quick to download
data_squad = load_dataset("squad_v2")

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
(type (data_squad))

datasets.dataset_dict.DatasetDict

In [8]:
data_squad.save_to_disk("/content/drive/MyDrive/Colab Data/squad_v2.hf")

⚠️ WARNING ⚠️ — TriviaQA is a large dataset

### TriviaQA*

In [18]:
# TriviaQA downloads and pre-processes to generate 17.4 GB -- it takes 25 mins total
data_trivia = load_dataset("trivia_qa", "rc")

Downloading and preparing dataset trivia_qa/rc (download: 2.48 GiB, generated: 14.89 GiB, post-processed: Unknown size, total: 17.37 GiB) to /root/.cache/huggingface/datasets/trivia_qa/rc/1.2.0/e73c5e47a8704744fa9ded33504b35a6c098661813d1c2a09892eb9b9e9d59ae...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/138384 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/17944 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/17210 [00:00<?, ? examples/s]

Dataset trivia_qa downloaded and prepared to /root/.cache/huggingface/datasets/trivia_qa/rc/1.2.0/e73c5e47a8704744fa9ded33504b35a6c098661813d1c2a09892eb9b9e9d59ae. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [19]:
(type (data_trivia))

datasets.dataset_dict.DatasetDict

In [20]:
data_trivia.save_to_disk("/content/drive/MyDrive/Colab Data/trivia_qa_rc.hf")

### Natural Questions**

In [None]:
# Natural Questions is too big for memory at 138 GB
# data_nq = load_dataset("natural_questions")

### QuAC

In [21]:
# Question Answering in Context downloads 136 MB and is quick
data_quac = load_dataset("quac")

Downloading builder script:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.90k [00:00<?, ?B/s]

Downloading and preparing dataset quac/plain_text (download: 73.47 MiB, generated: 62.51 MiB, post-processed: Unknown size, total: 135.99 MiB) to /root/.cache/huggingface/datasets/quac/plain_text/1.1.0/4170258e7e72d7c81bd6441b3f3489ea1544f0ff226ce61e22bb00c6e9d01fb6...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/68.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.93M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/11567 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset quac downloaded and prepared to /root/.cache/huggingface/datasets/quac/plain_text/1.1.0/4170258e7e72d7c81bd6441b3f3489ea1544f0ff226ce61e22bb00c6e9d01fb6. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [22]:
(type (data_quac))

datasets.dataset_dict.DatasetDict

In [23]:
data_quac.save_to_disk("/content/drive/MyDrive/Colab Data/quac.hf")

## Getting Familiar

### SQuAD 2.0

In [None]:
data_squad

In [None]:
data_squad['train'].info.features

In [None]:
# Look at first example (queen bey)
data_squad['train'][0]

### TriviaQA

In [None]:
data_trivia

In [None]:
data_trivia['train'].info.features

In [None]:
# Look at first example
data_trivia['train'][5]

### Natural Questions

### QuAC

In [None]:
data_quac

In [None]:
data_quac['train']

In [None]:
# Look at first example
data_quac['train'][12]