# Loading and transforming HuggingFace datasets

HuggingFace (HF) platform provides a wide variety of ML models, datasets, and transformers for the worldwide community.
An easy access to these assets is guaranteed thanks to Python packages such as [datasets](https://pypi.org/project/datasets/) or [transformers](https://pypi.org/project/transformers/), available on PyPI.

In this tutorial you will learn how to utilize HF datasets and tools with Grain: How to load HF datasets and how to use HF transformers in your Grain pipeline.

## Setup

To run the notebook you need to have a few packages installed in your environment: `grain`, `numpy`, and Two HF packages: `datasets` and `transformers`.

In [None]:
# @test {"output": "ignore"}
!pip install grain
# @test {"output": "ignore"}
!pip install numpy datasets transformers

In [2]:
# Python standard library
from pprint import pprint
from dateutil.parser import parse

import grain
import numpy as np

# HF imports
from datasets import load_dataset
from transformers import AutoTokenizer

## Loading dataset

Let's first import an HF dataset. For the sake of simplicity let's proceed with [lhoestq/demo1](https://huggingface.co/datasets/lhoestq/demo1) - a minimal dataset comprised of five rows and six columns.

In [3]:
hf_dataset = load_dataset("lhoestq/demo1")
hf_train, hf_test = hf_dataset["train"], hf_dataset["test"]
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
        num_rows: 5
    })
    test: Dataset({
        features: ['id', 'package_name', 'review', 'date', 'star', 'version_id'],
        num_rows: 5
    })
})

Each sample is a Python dictionary with string or integer data.

In [4]:
hf_train[0]

{'id': '7bd227d9-afc9-11e6-aba1-c4b301cdf627',
 'package_name': 'com.mantz_it.rfanalyzer',
 'review': "Great app! The new version now works on my Bravia Android TV which is great as it's right by my rooftop aerial cable. The scan feature would be useful...any ETA on when this will be available? Also the option to import a list of bookmarks e.g. from a simple properties file would be useful.",
 'date': 'October 12 2016',
 'star': 4,
 'version_id': 1487}

## Preprocessing

Let's assume that for our preprocessing pipeline we want the string `date` field to become a timestamp and the whole sample - a NumPy array.

In [5]:
def process_date(sample: dict) -> dict:
    sample["date"] = parse(sample["date"]).timestamp()
    return sample

def process_sample_to_np(sample: dict) -> np.ndarray:
    return np.array([*sample.values()], dtype=object)

Building a pipeline is as simple as chaining `map` calls. HF dataset supports random access so we can pass it directly to a `source` method. The resulting object is of type `grain.MapDataset` with random access support.

In [6]:
dataset = (
    grain.MapDataset.source(hf_train)
    .shuffle(seed=10)  # shuffles globally
    .map(process_date)  # maps each element
    .map(process_sample_to_np)  # maps each element
)

In [7]:
list(dataset)

[array(['7bd22aba-afc9-11e6-8293-c4b301cdf627', 'com.mantz_it.rfanalyzer',
        'Works well with my Hackrf Hopefully new updates will arrive for extra functions',
        1469145600.0, 5, 1487], dtype=object),
 array(['7bd227d9-afc9-11e6-aba1-c4b301cdf627', 'com.mantz_it.rfanalyzer',
        "Great app! The new version now works on my Bravia Android TV which is great as it's right by my rooftop aerial cable. The scan feature would be useful...any ETA on when this will be available? Also the option to import a list of bookmarks e.g. from a simple properties file would be useful.",
        1476230400.0, 4, 1487], dtype=object),
 array(['7bd22905-afc9-11e6-a5dc-c4b301cdf627', 'com.mantz_it.rfanalyzer',
        "Great It's not fully optimised and has some issues with crashing but still a nice app  especially considering the price and it's open source.",
        1471910400.0, 4, 1487], dtype=object),
 array(['7bd22a26-afc9-11e6-9309-c4b301cdf627', 'com.mantz_it.rfanalyzer',
        'The 

## Tokenizer

Next we would like to tokenize the `review` field. LLM models operate on integers (encoded words) rather than raw strings. `AutoTokenizer` generic class ships `from_pretrained` method - accessor to models and tokenizers hosted on HF services.

Let's use `bert-base-uncased`, a case-insensitive BERT-based transformers model.

In [8]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Transforming a single review string yields a dictionary with three keys. We're only interested in `input_ids` since that is the encoded review.

In [9]:
review = hf_train[0]["review"]
pprint(review)
print("\n", tokenizer(review).keys(), "\n")
pprint(tokenizer(review)["input_ids"])

('Great app! The new version now works on my Bravia Android TV which is great '
 "as it's right by my rooftop aerial cable. The scan feature would be "
 'useful...any ETA on when this will be available? Also the option to import a '
 'list of bookmarks e.g. from a simple properties file would be useful.')

 dict_keys(['input_ids', 'token_type_ids', 'attention_mask']) 

[101,
 2307,
 10439,
 999,
 1996,
 2047,
 2544,
 2085,
 2573,
 2006,
 2026,
 11655,
 9035,
 11924,
 2694,
 2029,
 2003,
 2307,
 2004,
 2009,
 1005,
 1055,
 2157,
 2011,
 2026,
 23308,
 9682,
 5830,
 1012,
 1996,
 13594,
 3444,
 2052,
 2022,
 6179,
 1012,
 1012,
 1012,
 2151,
 27859,
 2006,
 2043,
 2023,
 2097,
 2022,
 2800,
 1029,
 2036,
 1996,
 5724,
 2000,
 12324,
 1037,
 2862,
 1997,
 2338,
 27373,
 1041,
 1012,
 1043,
 1012,
 2013,
 1037,
 3722,
 5144,
 5371,
 2052,
 2022,
 6179,
 1012,
 102]


Plugging the selected transformer is as easy as before. We implement the `process_transformer` function and pass it to the `map` method.

In [10]:
def process_transformer(sample: dict) -> dict:
    sample["review"] = np.array(tokenizer(sample["review"])["input_ids"])
    return sample

dataset = (
    grain.MapDataset.source(hf_train)
    .shuffle(seed=10)
    .map(process_date)
    .map(process_transformer)
)

Now samples are less human- but more machine-friendly.

In [11]:
dataset[1]

{'id': '7bd227d9-afc9-11e6-aba1-c4b301cdf627',
 'package_name': 'com.mantz_it.rfanalyzer',
 'review': array([  101,  2307, 10439,   999,  1996,  2047,  2544,  2085,  2573,
         2006,  2026, 11655,  9035, 11924,  2694,  2029,  2003,  2307,
         2004,  2009,  1005,  1055,  2157,  2011,  2026, 23308,  9682,
         5830,  1012,  1996, 13594,  3444,  2052,  2022,  6179,  1012,
         1012,  1012,  2151, 27859,  2006,  2043,  2023,  2097,  2022,
         2800,  1029,  2036,  1996,  5724,  2000, 12324,  1037,  2862,
         1997,  2338, 27373,  1041,  1012,  1043,  1012,  2013,  1037,
         3722,  5144,  5371,  2052,  2022,  6179,  1012,   102]),
 'date': 1476230400.0,
 'star': 4,
 'version_id': 1487}

## Complete Pipeline

Time to build our final pipeline! The pipeline doesn't need to be restricted to `shuffle` and `map`. Grain has a rich API and hands us multiple functionalities such as: `filter`, `random_map`, `repeat`. Check out [Grain API](../../grain.rst) page to learn more.

On top of the transformer we want to discard reviews that are rated three stars or less. It's crucial to mention that filtering changes the number of samples in the following steps so random access is no longer available. To perform `batching` as the final step we plug `.to_iter_dataset()` converting `MapDataset` to `IterDataset` - a dataset that gives us an iterator-like interface.

In [12]:
dataset = (
    grain.MapDataset.source(hf_train)
    .shuffle(seed=10)
    .map(process_date)
    .map(process_transformer)
    .filter(lambda x: x["star"] > 3)  # filters samples
    .map(process_sample_to_np)
    .to_iter_dataset()
    .batch(batch_size=2)  # batches consecutive elements
)

With `IterDataset` we can use Python built-ins, `iter` and `next`, to interact with the dataset.

In [13]:
next(iter(dataset))

array([['7bd22aba-afc9-11e6-8293-c4b301cdf627',
        'com.mantz_it.rfanalyzer',
        array([  101,  2573,  2092,  2007,  2026, 20578, 12881, 11504,  2047,
               14409,  2097,  7180,  2005,  4469,  4972,   102])             ,
        1469145600.0, 5, 1487],
       ['7bd227d9-afc9-11e6-aba1-c4b301cdf627',
        'com.mantz_it.rfanalyzer',
        array([  101,  2307, 10439,   999,  1996,  2047,  2544,  2085,  2573,
                2006,  2026, 11655,  9035, 11924,  2694,  2029,  2003,  2307,
                2004,  2009,  1005,  1055,  2157,  2011,  2026, 23308,  9682,
                5830,  1012,  1996, 13594,  3444,  2052,  2022,  6179,  1012,
                1012,  1012,  2151, 27859,  2006,  2043,  2023,  2097,  2022,
                2800,  1029,  2036,  1996,  5724,  2000, 12324,  1037,  2862,
                1997,  2338, 27373,  1041,  1012,  1043,  1012,  2013,  1037,
                3722,  5144,  5371,  2052,  2022,  6179,  1012,   102])      ,
        1476230400.0

And that's it! We ended up with a batch with processed date, tokenized review, and filtered rating.