# Hugging Face Datasets

Provide easy access to the transformer datasets e.g. GLUE. Similar to TFDS Datasets with splits ```(train, validation, test)``` and return a dictionary with splits by default.


* [Github - Hugging Face Datasets](https://github.com/huggingface/datasets)

> ```
> pip install datasets
> ```
> * ```datasets.list_datasets()``` to list the available datasets
> * ```datasets.load_dataset(dataset_name, **kwargs)``` to instantiate a dataset

* [Hugging Face Datasets Introduction](https://huggingface.co/course/chapter5/1?fw=pt)

* [Hugging Face Datasets - Quick Start](https://huggingface.co/docs/datasets/quickstart)

<img src="../image/huggingface_datasets.png" align="left" width=700/>

In [3]:
!pip install datasets --quiet

In [1]:
import numpy as np
import pandas as pd
import datasets
from datasets import (
    list_datasets, 
    load_dataset,
    get_dataset_split_names,
)


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option("max_colwidth", None)
pd.set_option("max_seq_items", None)

# List available datasets

In [2]:
list_datasets()[:5]

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus']

----

# Random access vs Iterable 

* [IterableDataset](https://huggingface.co/docs/datasets/access#iterabledataset)

> An IterableDataset is loaded when you set the streaming parameter to True in ```load_dataset()```:
> ```
> from datasets import load_dataset
> iterable_dataset = load_dataset("food101", split="train", streaming=True)
> ```


# Loading CSV, JSON, Parquet, SQL result from local and URL

Like Pandas read methods, Dataset can load them from local disks or URLs.

* [Local and remote files](https://huggingface.co/docs/datasets/loading#local-and-remote-files)

> Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a csv, json, txt or parquet file. The load_dataset() function can load each of these file types.

---
# Methods

* [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) movie review dataset

> Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

<img src="../image/huggingface_dataset_rotten_tomatos.png" align="left" width=700/>

In [3]:
dataset_name = "rotten_tomatoes"

## Get the names of splits

In [4]:
get_dataset_split_names(dataset_name)

['train', 'validation', 'test']

## Load a dataset

* [datasets.load_dataset(path: str, split: str, streaming: bool, num_proc: int)](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset)

Returns [datasets.DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict) if ```split``` is ```None``` or [datasets.Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset). Similar to TFDS load method.

### Load multiple splits as Dataset

In [5]:
combined: datasets.Dataset = datasets.load_dataset(
    path=dataset_name, 
    split="train+test",
    num_proc=8
)

Found cached dataset rotten_tomatoes (/Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


In [6]:
type(combined)

datasets.arrow_dataset.Dataset

In [7]:
print(len(combined))
del combined

9596


### Load part of split

In [8]:
partial: datasets.Dataset = datasets.load_dataset(
    path=dataset_name, 
    split="train[10:20]",
    num_proc=8
)

Found cached dataset rotten_tomatoes (/Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


In [9]:
x = partial.map(lambda x: {"text": x['text'][:20], "label": x['label']})
print(x[0])
del x, partial

Loading cached processed dataset at /Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-b579144e7150ad6b.arrow


{'text': 'this is a film well ', 'label': 1}


### Load entire dataset as DatasetDict

In [10]:
dataset: datasets.DatasetDict = load_dataset(dataset_name, streaming=False, num_proc=8)
dataset.keys()

Found cached dataset rotten_tomatoes (/Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


  0%|          | 0/3 [00:00<?, ?it/s]

dict_keys(['train', 'validation', 'test'])

## Attributes
### Column names

In [11]:
dataset['train'].column_names

['text', 'label']

### Shape

In [12]:
dataset['train'].shape

(8530, 2)

### Features

In [14]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

## Random access

In [7]:
train: datasets.Dataset = dataset['train']

In [58]:
df = pd.DataFrame(train.shuffle()[:5])

In [59]:
df

Unnamed: 0,text,label
0,"phillip noyce and all of his actors -- as well as his cinematographer , christopher doyle -- understand the delicate forcefulness of greene's prose , and it's there on the screen in their version of the quiet american .",1
1,nettelbeck . . . has a pleasing way with a metaphor .,1
2,"the issues are presented in such a lousy way , complete with some of the year's ( unintentionally ) funniest moments , that it's impossible to care .",0
3,"as a rumor of angels reveals itself to be a sudsy tub of supernatural hokum , not even ms . redgrave's noblest efforts can redeem it from hopeless sentimentality .",0
4,mostly martha could have used a little trimming -- 10 or 15 minutes could be cut and no one would notice -- but it's a pleasurable trifle . the only pain you'll feel as the credits roll is your stomach grumbling for some tasty grub .,1


## Select rows

* [select(indices (range, list, iterable, ndarray or Series)](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.select)

select() returns rows according to a list of indices.

In [73]:
selected: datasets.Dataset = train.shuffle().select([1,3,5,7,9])
print(selected[0])
del selected

{'text': 'too smart to ignore but a little too smugly superior to like , this could be a movie that ends up slapping its target audience in the face by shooting itself in the foot .', 'label': 0}


## map

* [Dataset.map(function: Callable, with_indices: bool, num_proc: int, remove_columns: Optional)](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.map)

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path='bigscience/bloom-560m', 
    use_fast=True
)

In [8]:
mapped: datasets.Dataset = train.shuffle().select([1,3,5,7,9]).map(
    lambda row: {
        "tokens": tokenizer(row['text']),
        "label": row['label']
    }, 
    num_proc=8,
    remove_columns=['text']
)
pd.DataFrame(mapped)

num_proc must be <= 5. Reducing num_proc to 5 for dataset of size 5.


Map (num_proc=5):   0%|          | 0/5 [00:00<?, ? examples/s]

Unnamed: 0,label,tokens
0,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 8874, 632, 12229, 461, 85753, 503]}"
1,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 8874, 632, 22566, 19639, 98804, 630, 530, 368, 93082, 461, 368, 20500, 47664, 1306, 105708, 999, 74950, 386, 503]}"
2,0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [26676, 368, 198139, 1068, 471, 108401, 375, 23242, 368, 89430, 7025, 125347, 1163, 22646, 861, 368, 8874, 632, 11705, 197888, 630, 1640, 11168, 47768, 1256, 181629, 154854, 265, 226608, 632, 89122, 17303, 148424, 530, 3478, 90, 350, 11252, 630, 16997, 368, 212022, 78700, 18425, 1955, 3825, 92, 17303, 4340, 22285, 2926, 503]}"
3,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [2606, 632, 5067, 267, 101944, 99455, 51651, 4618, 368, 77016, 24824, 530, 368, 12792, 386, 1306, 1427, 43752, 630, 5268, 126042, 1320, 368, 26143, 1256, 267, 10512, 32352, 503]}"
4,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [32112, 267, 178584, 275, 88222, 504, 1859, 2040, 3816, 54645, 727, 6355, 427, 7010, 1119, 2592, 1800, 5908, 6648, 35895, 2256, 152901, 135056, 9066, 10749]}"


In [None]:
del mapped

## Flatten

* [flatten](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.flatten)

In [9]:
mapped: datasets.Dataset = train.shuffle().select([1,3,5]).map(
    lambda row: {
        "tokens": tokenizer(row['text']),
        "label": row['label']
    }, 
    num_proc=8
)
pd.DataFrame(mapped)

num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.


Map (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]

Unnamed: 0,text,label,tokens
0,"one problem with the movie , directed by joel schumacher , is that it jams too many prefabricated story elements into the running time .",0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [1738, 6875, 1002, 368, 51651, 630, 61961, 1331, 4985, 343, 10257, 180190, 630, 632, 861, 718, 8501, 86, 10136, 7112, 45325, 107922, 1790, 26143, 12829, 3727, 368, 16118, 3509, 503]}"
1,"pipe dream does have its charms . the leads are natural and lovely , the pace is serene , the humor wry and sprightly .",1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [115211, 42452, 4636, 1542, 3776, 5343, 3611, 503, 368, 40265, 1306, 8390, 530, 138799, 630, 368, 113737, 632, 959, 3479, 630, 368, 45554, 372, 3979, 530, 35752, 1295, 999, 503]}"
2,that it'll probably be the best and most mature comedy of the 2002 summer season speaks more of the season than the picture,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [19562, 114101, 15927, 722, 368, 7733, 530, 6084, 149939, 171895, 461, 368, 8122, 46693, 27700, 78739, 3172, 461, 368, 27700, 4340, 368, 33777]}"


In [10]:
mapped.flatten()

Dataset({
    features: ['text', 'label', 'tokens.attention_mask', 'tokens.input_ids'],
    num_rows: 3
})

In [11]:
del mapped

## Remove columns

* [Dataset.remove_columns(column_names: List) ](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.remove_columns)

In [12]:
mapped: datasets.Dataset = train.shuffle().select([1,3]).map(
    lambda row: {
        "tokens": tokenizer(row['text']),
        "label": row['label']
    }, 
    num_proc=8
)
pd.DataFrame(mapped)

num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Map (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

Unnamed: 0,text,label,tokens
0,"the issues are presented in such a lousy way , complete with some of the year's ( unintentionally ) funniest moments , that it's impossible to care .",0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 17327, 1306, 33296, 361, 5067, 267, 275, 1188, 92, 4676, 630, 21196, 1002, 3331, 461, 368, 206321, 375, 447, 966, 8909, 2194, 1163, 7793, 5137, 388, 38532, 630, 861, 6648, 28698, 427, 12963, 503]}"
1,"often gruelling and heartbreaking to witness , but seldahl and wollter's sterling performances raise this far above the level of the usual maudlin disease movie .",1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [235154, 1186, 88, 21230, 530, 23724, 7608, 17444, 427, 53134, 630, 1965, 2118, 71, 37970, 530, 203961, 562, 1256, 49609, 8827, 93082, 29955, 1119, 8372, 9468, 368, 6626, 461, 368, 47305, 196340, 7253, 26700, 51651, 503]}"


In [13]:
pd.DataFrame(mapped.remove_columns(column_names=['text']))

Unnamed: 0,label,tokens
0,0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 17327, 1306, 33296, 361, 5067, 267, 275, 1188, 92, 4676, 630, 21196, 1002, 3331, 461, 368, 206321, 375, 447, 966, 8909, 2194, 1163, 7793, 5137, 388, 38532, 630, 861, 6648, 28698, 427, 12963, 503]}"
1,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [235154, 1186, 88, 21230, 530, 23724, 7608, 17444, 427, 53134, 630, 1965, 2118, 71, 37970, 530, 203961, 562, 1256, 49609, 8827, 93082, 29955, 1119, 8372, 9468, 368, 6626, 461, 368, 47305, 196340, 7253, 26700, 51651, 503]}"


In [14]:
del mapped

---
# Stream DataSet 

* [Stream](https://huggingface.co/docs/datasets/stream)


In [30]:
stream = load_dataset('ag_news', streaming=True)
stream['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}

In [31]:
train = stream['train']

In [23]:
df = pd.DataFrame(list(train.take(1)))
df

Unnamed: 0,text,label
0,"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.",2


In [24]:
type(stream)

datasets.dataset_dict.IterableDatasetDict

In [25]:
type(stream['train'])

datasets.iterable_dataset.IterableDataset

In [27]:
del stream