# Hugging Face Datasets

Provide easy access to the transformer datasets e.g. GLUE. Similar to TFDS Datasets with splits ```(train, validation, test)``` and return a dictionary with splits by default.


* [Github - Hugging Face Datasets](https://github.com/huggingface/datasets)

> ```
> pip install datasets
> ```
> * ```datasets.list_datasets()``` to list the available datasets
> * ```datasets.load_dataset(dataset_name, **kwargs)``` to instantiate a dataset

* [Hugging Face Datasets Introduction](https://huggingface.co/course/chapter5/1?fw=pt)

* [Hugging Face Datasets - Quick Start](https://huggingface.co/docs/datasets/quickstart)

<img src="../image/huggingface_datasets.png" align="left" width=700/>

In [3]:
!pip install datasets --quiet

In [26]:
import numpy as np
import pandas as pd
import datasets
from datasets import (
    list_datasets, 
    load_dataset,
    get_dataset_split_names,
)


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option("max_colwidth", None)
pd.set_option("max_seq_items", None)

# List available datasets

In [4]:
list_datasets()[:5]

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus']

----

# Random access vs Iterable 

* [IterableDataset](https://huggingface.co/docs/datasets/access#iterabledataset)

> An IterableDataset is loaded when you set the streaming parameter to True in ```load_dataset()```:
> ```
> from datasets import load_dataset
> iterable_dataset = load_dataset("food101", split="train", streaming=True)
> ```


# Loading CSV, JSON, Parquet, SQL result from local and URL

Like Pandas read methods, Dataset can load them from local disks or URLs.

* [Local and remote files](https://huggingface.co/docs/datasets/loading#local-and-remote-files)

> Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a csv, json, txt or parquet file. The load_dataset() function can load each of these file types.

---
# Methods

* [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) movie review dataset

> Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

<img src="../image/huggingface_dataset_rotten_tomatos.png" align="left" width=700/>

In [16]:
dataset_name = "rotten_tomatoes"

## Get the names of splits

In [12]:
get_dataset_split_names(dataset_name)

['train', 'validation', 'test']

## Load a dataset

* [datasets.load_dataset(path: str, split: str, streaming: bool, num_proc: int)](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset)

Returns [datasets.DatasetDict](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict) if ```split``` is ```None``` or [datasets.Dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset). Similar to TFDS load method.

### Load multiple splits as Dataset

In [33]:
combined: datasets.Dataset = datasets.load_dataset(
    path=dataset_name, 
    split="train+test",
    num_proc=8
)

Found cached dataset rotten_tomatoes (/Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


In [34]:
type(combined)

datasets.arrow_dataset.Dataset

In [35]:
print(len(combined))
del combined

9596


### Load part of split

In [45]:
partial: datasets.Dataset = datasets.load_dataset(
    path=dataset_name, 
    split="train[10:20]",
    num_proc=8
)

Found cached dataset rotten_tomatoes (/Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


In [50]:
x = partial.map(lambda x: {"text": x['text'][:20], "label": x['label']})
print(x[0])
del x, partial

Loading cached processed dataset at /Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/cache-b579144e7150ad6b.arrow


{'text': 'this is a film well ', 'label': 1}


### Load entire dataset as DatasetDict

In [27]:
dataset: datasets.DatasetDict = load_dataset(dataset_name, streaming=False, num_proc=8)
dataset.keys()

Found cached dataset rotten_tomatoes (/Users/oonisim/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


  0%|          | 0/3 [00:00<?, ?it/s]

dict_keys(['train', 'validation', 'test'])

## Attributes
### Column names

In [65]:
dataset['train'].column_names

['text', 'label']

### Shape

In [66]:
dataset['train'].shape

(8530, 2)

## Random access

In [57]:
train: datasets.Dataset = dataset['train']

In [58]:
df = pd.DataFrame(train.shuffle()[:5])

In [59]:
df

Unnamed: 0,text,label
0,"phillip noyce and all of his actors -- as well as his cinematographer , christopher doyle -- understand the delicate forcefulness of greene's prose , and it's there on the screen in their version of the quiet american .",1
1,nettelbeck . . . has a pleasing way with a metaphor .,1
2,"the issues are presented in such a lousy way , complete with some of the year's ( unintentionally ) funniest moments , that it's impossible to care .",0
3,"as a rumor of angels reveals itself to be a sudsy tub of supernatural hokum , not even ms . redgrave's noblest efforts can redeem it from hopeless sentimentality .",0
4,mostly martha could have used a little trimming -- 10 or 15 minutes could be cut and no one would notice -- but it's a pleasurable trifle . the only pain you'll feel as the credits roll is your stomach grumbling for some tasty grub .,1


## Select rows

* [select(indices (range, list, iterable, ndarray or Series)](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.select)

select() returns rows according to a list of indices.

In [73]:
selected: datasets.Dataset = train.shuffle().select([1,3,5,7,9])
print(selected[0])
del selected

{'text': 'too smart to ignore but a little too smugly superior to like , this could be a movie that ends up slapping its target audience in the face by shooting itself in the foot .', 'label': 0}


## map

* [Dataset.map(function: Callable, with_indices: bool, num_proc: int, remove_columns: Optional)](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.map)

In [36]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path='bigscience/bloom-560m', 
    use_fast=True
)

In [64]:
mapped: datasets.Dataset = train.shuffle().select([1,3,5,7,9]).map(
    lambda row: {
        "tokens": tokenizer(row['text']),
        "label": row['label']
    }, 
    num_proc=8
)
pd.DataFrame(mapped)

num_proc must be <= 5. Reducing num_proc to 5 for dataset of size 5.


Map (num_proc=5):   0%|          | 0/5 [00:00<?, ? examples/s]

Unnamed: 0,text,label,tokens
0,"donovan . . . squanders his main asset , jackie chan , and fumbles the vital action sequences .",0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [24027, 135454, 10749, 18453, 392, 525, 3868, 4291, 67704, 630, 98467, 641, 49708, 630, 530, 319, 3677, 1336, 368, 35085, 9066, 60041, 503]}"
1,"outside of burger's desire to make some kind of film , it's really unclear why this project was undertaken",0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [1199, 14812, 461, 5967, 3514, 1256, 64141, 427, 5219, 3331, 11596, 461, 8874, 630, 6648, 9780, 128183, 11257, 1119, 6671, 1620, 91763]}"
2,intelligent and moving .,1,"{'attention_mask': [1, 1, 1, 1, 1], 'input_ids': [966, 222114, 530, 34071, 503]}"
3,"the reason to see "" sade "" lay with the chemistry and complex relationship between the marquis ( auteil ) and emilie ( le besco ) .",1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 13729, 427, 4913, 567, 272, 2160, 567, 8661, 1002, 368, 114138, 530, 11235, 23556, 5299, 368, 125004, 375, 1808, 154621, 1163, 530, 766, 167997, 375, 578, 8481, 1594, 28620]}"
4,"expect no major discoveries , nor any stylish sizzle , but the film sits with square conviction and touching good sense on the experience of its women .",1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [68586, 654, 9718, 72595, 1071, 630, 11895, 2914, 97076, 1776, 272, 40245, 327, 630, 1965, 368, 8874, 161113, 1002, 31708, 84321, 530, 165502, 7220, 9482, 664, 368, 24575, 461, 3776, 14216, 503]}"


In [None]:
del mapped

## Flatten

* [flatten](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.flatten)

In [69]:
mapped: datasets.Dataset = train.shuffle().select([1,3,5]).map(
    lambda row: {
        "tokens": tokenizer(row['text']),
        "label": row['label']
    }, 
    num_proc=8
)
pd.DataFrame(mapped)

num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.


Map (num_proc=3):   0%|          | 0/3 [00:00<?, ? examples/s]

Unnamed: 0,text,label,tokens
0,an elegant and sly deadpan comedy .,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [256, 104189, 530, 272, 999, 19329, 27168, 171895, 503]}"
1,a soulless jumble of ineptly assembled cliches and pabulum that plays like a 95-minute commercial for nba properties .,0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [68, 1427, 2347, 738, 42134, 1611, 461, 361, 25506, 999, 180056, 51624, 9971, 530, 269, 465, 37501, 861, 60846, 3269, 267, 21602, 178387, 22912, 613, 294, 2825, 18792, 503]}"
2,too simple for its own good .,0,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'input_ids': [105961, 6353, 613, 3776, 9016, 7220, 503]}"


In [70]:
mapped.flatten()

Dataset({
    features: ['text', 'label', 'tokens.attention_mask', 'tokens.input_ids'],
    num_rows: 3
})

In [71]:
del mapped

## Remove columns

* [Dataset.remove_columns(column_names: List) ](https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.Dataset.remove_columns)

In [74]:
mapped: datasets.Dataset = train.shuffle().select([1,3]).map(
    lambda row: {
        "tokens": tokenizer(row['text']),
        "label": row['label']
    }, 
    num_proc=8
)
pd.DataFrame(mapped)

num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.


Map (num_proc=2):   0%|          | 0/2 [00:00<?, ? examples/s]

Unnamed: 0,text,label,tokens
0,"a fascinating , bombshell documentary that should shame americans , regardless of whether or not ultimate blame finally lies with kissinger . should be required viewing for civics classes and would-be public servants alike .",1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [68, 74101, 4128, 630, 15204, 98049, 194996, 861, 3403, 124115, 27684, 703, 630, 76744, 461, 14600, 791, 1130, 127949, 97021, 31139, 53935, 1002, 101444, 24443, 503, 3403, 722, 13869, 132762, 613, 8187, 3958, 17733, 530, 3276, 55078, 2470, 154313, 164554, 503]}"
1,"the script is smart , not cloying .",1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 15820, 632, 30479, 630, 1130, 1146, 2184, 386, 503]}"


In [76]:
pd.DataFrame(mapped.remove_columns(column_names=['text']))

Unnamed: 0,label,tokens
0,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [68, 74101, 4128, 630, 15204, 98049, 194996, 861, 3403, 124115, 27684, 703, 630, 76744, 461, 14600, 791, 1130, 127949, 97021, 31139, 53935, 1002, 101444, 24443, 503, 3403, 722, 13869, 132762, 613, 8187, 3958, 17733, 530, 3276, 55078, 2470, 154313, 164554, 503]}"
1,1,"{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [5984, 15820, 632, 30479, 630, 1130, 1146, 2184, 386, 503]}"


In [77]:
del mapped