# HF Datasets

HF Datasets is an essential tool for NLP practitioners — hosting over 1.3K *(mostly)* high-quality language-focused datasets, and an easy-to-use set of functionalities for building efficient pre-processing pipelines.

Let's install and import `datasets`, then list all of the datasets available to us. We'll take a look at all of the `squad` datasets too.

```
!pip install datasets
```

In [1]:
import datasets

ds_list = datasets.list_datasets()
ds_list[:5]

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus']

In [2]:
len(ds_list)  # the number of datasets increases almost every day

1435

In [3]:
[ds for ds in ds_list if 'squad' in ds.lower()]

['iapp_wiki_qa_squad',
 'squad',
 'squad_adversarial',
 'squad_es',
 'squad_it',
 'squad_kor_v1',
 'squad_kor_v2',
 'squad_v1_pt',
 'squad_v2',
 'squadshifts',
 'thaiqa_squad',
 'Gabriel/squad_v2_sv',
 'Wikidepia/IndoSQuAD',
 'lhoestq/custom_squad',
 'lhoestq/squad',
 'piEsposito/squad_20_ptbr',
 'qwant/squad_fr',
 'susumu2357/squad_v2_sv',
 'vershasaxena91/squad_multitask']

We have English, Thai \[`iapp_wiki_qa_squad`, `thaiqa_squad`\], Korean \[`squad_kor_v1`, `squad_kor_v2`\], Italian \[`squad_it`\], Spanish \[`squad_es`\], French \[`qwant/squad_fr`\], and more.

In [4]:
dataset = datasets.load_dataset('squad', streaming=True)
dataset

{'train': <datasets.iterable_dataset.IterableDataset at 0x7fb0701ee130>,
 'validation': <datasets.iterable_dataset.IterableDataset at 0x7fb0701ee7c0>}

In [5]:
datasets.load_dataset('squad', split='train', streaming=True)

<datasets.iterable_dataset.IterableDataset at 0x7fb072819940>

In [6]:
dataset.keys()

dict_keys(['train', 'validation'])

In [7]:
# both 'train' and 'validation' will output the same info
print(dataset['train'].dataset_size)
print(dataset['train'].description)
print(dataset['train'].features)

89819400
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}


To access a single *record* we can use list indexing **only** when `streaming=False`, otherwise we return `TypeError: 'IterableDataset' object is not subscriptable`.

In [8]:
dataset['train'][0]  # only when 'streaming=False'

TypeError: 'IterableDataset' object is not subscriptable

To access a single record when `streaming=True` we need to iterate through the dataset:

In [9]:
for sample in dataset['train']:
    print(sample)
    break

{'id': '5733be284776f41900661182', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}, 'title': 'University_of_Notre_Dame'}


## Processing the data

We can use a variety of data processing functions provided by `datasets`. We'll start by modifying the `answers` feature, which contains both the `answer_start` positions and the answer `text`, but no `answer_end` position.

In [10]:
for i, sample in enumerate(dataset['train']):
    print(sample['answers'])
    if i > 4: break

{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}
{'text': ['a copper statue of Christ'], 'answer_start': [188]}
{'text': ['the Main Building'], 'answer_start': [279]}
{'text': ['a Marian place of prayer and reflection'], 'answer_start': [381]}
{'text': ['a golden statue of the Virgin Mary'], 'answer_start': [92]}
{'text': ['September 1876'], 'answer_start': [248]}


We use the `map` method when modifying existing or creating new features. When using `streaming=True` we *must* include every feature in our index (otherwise they will be removed). If we were to use `streaming=False`, the below `map` would only need to include the `'answers'` part.

In [11]:
dataset['train'] = dataset['train'].map(
    lambda x: {
        'id': x['id'],
        'context': x['context'],
        'answers': {
            **x['answers'],
            **{'answer_end': [x['answers']['answer_start'][0] + len(x['answers']['text'][0])]}
        },
        'question': x['question'],
        'title': x['title']
    }
)

This will *lazily* load and perform the transformations on our dataset when it is needed.

In [12]:
for sample in dataset['train']:
    print(sample)
    break

{'id': '5733be284776f41900661182', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous'], 'answer_end': [541]}, 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}


In [13]:
for i, sample in enumerate(dataset['train']):
    print(sample['answers'])
    if i > 4: break

{'answer_start': [515], 'text': ['Saint Bernadette Soubirous'], 'answer_end': [541]}
{'answer_start': [188], 'text': ['a copper statue of Christ'], 'answer_end': [213]}
{'answer_start': [279], 'text': ['the Main Building'], 'answer_end': [296]}
{'answer_start': [381], 'text': ['a Marian place of prayer and reflection'], 'answer_end': [420]}
{'answer_start': [92], 'text': ['a golden statue of the Virgin Mary'], 'answer_end': [126]}
{'answer_start': [248], 'text': ['September 1876'], 'answer_end': [262]}


We can see this is loaded lazily if we purposely create an error in the `map` function, the error will only appear once we `enumerate` through the dataset.

In [14]:
dataset['train'] = dataset['train'].map(
    lambda x: {'random': x['I do not exist']}
)

Although the feature `'I do not exist'` does not exist, we return no error, yet...

In [15]:
for i, sample in enumerate(dataset['train']):
    print(sample['answers'])
    if i > 4: break

KeyError: 'I do not exist'

But as soon as we begin iterating through the data, the error pops up - this is due to lazy loading!

I want to show off a few non-streaming features too, so we'll reload our dataset with `streaming=False` and also add the `answer_end` to the `answers` feature again - note that this time we don't need to include every other feature in the `map` method.

In [16]:
dataset = datasets.load_dataset('squad', streaming=False)

dataset['train'] = dataset['train'].map(
    lambda x: {
        'answers': {
            **x['answers'],
            **{'answer_end': [x['answers']['answer_start'][0] + len(x['answers']['text'][0])]}
        }
    }
)

Reusing dataset squad (/Users/jamesbriggs/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
Loading cached processed dataset at /Users/jamesbriggs/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-491ac45325087252.arrow


And we can now access entries like so:

In [17]:
dataset['train'][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'answer_end': [541],
  'answer_start': [515],
  'text': ['Saint Bernadette Soubirous']}}

Another operation that we will find ourselves needing to perform on the SQuAD dataset is the tokenization of our questions and contexts into tensors for our Q&A models. We'll want to *add a new* feature for this, we can call this the `token_ids` feature.

Typically tokenization is done in *batches* rather than row-by-row, as this usually speeds up the process. Fortunately, we can add batching to our `map` function with `batched=True` and even specify `batch_size`.

*(Note that this is the same syntax as used when `streaming=True`, but we would need to include mappings for all other features too)*

In [18]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

dataset['train'] = dataset['train'].map(
    lambda x: tokenizer(
            x['question'], x['context'],
            max_length=512, padding='max_length',
            truncation=True
        ), batched=True, batch_size=32
)

Loading cached processed dataset at /Users/jamesbriggs/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-9b14a74000c87a8d.arrow


Again, we can take a look at this by iterating through the first sample only.

In [19]:
for sample in dataset['train']:
    print(sample)
    break

{'answers': {'answer_end': [541], 'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

And we can see that each of our tensors `input_ids`, `token_type_ids`, and `attention_mask` have been added to the dataset.

We may also want to remove some sample (ideally we would probably remove records first to save some processing time). We can do this via the `filter` method. Let's say we don't want 'Beyoncé' samples anymore (sorry anyone who likes Beyoncé).

*(Note that we cannot use `filter` when `streaming=True`)*

We could write:

In [20]:
dataset['train'] = dataset['train'].filter(
    lambda x: x['title'] != 'Beyonce'
)

Loading cached processed dataset at /Users/jamesbriggs/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-c457888fffdc15fa.arrow


We no longer have Beyoncé samples:

In [21]:
for sample in dataset['train']:
    print(sample['title'])
    break

University_of_Notre_Dame


We might also decide we'd like to rename a column, which we can do with `rename_column`.

In [22]:
dataset['train'] = dataset['train'].rename_column('title', 'topic')
dataset['train']

Dataset({
    features: ['answers', 'attention_mask', 'context', 'id', 'input_ids', 'question', 'topic', 'token_type_ids'],
    num_rows: 87599
})

Finally, we may also want to remove certain columns - for Q&A we would need nothing more than a question and context (although for training we would also want token start and end positions). So let's go ahead and remove everything *but* our tokenized `input_ids`, `token_type_ids`, and `attention_mask`.

In [23]:
dataset['train'] = dataset['train'].remove_columns([
    'answers', 'context', 'id', 'question', 'topic'
])
dataset['train']

Dataset({
    features: ['attention_mask', 'input_ids', 'token_type_ids'],
    num_rows: 87599
})

And there we are, an introduction to HF `datasets`!