# Everything you need to know to work with the 🤗 Datasets library

This notebooks essentially compiles my notes about the most relevant parts of the 🤗 Datasets documentation for text datasets creation, processing, and sharing in the context of NLP and LLMs applications.

There are many copy-pasted snippets from the documentation, sometimes with my own edits.
So to read the official and full documentation, go [here](https://huggingface.co/docs/datasets/index)

In [1]:
# install
!pip install datasets --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

>[Everything you need to know to work with the 🤗 Datasets library](#scrollTo=ruV-FbZWfKrB)

>[Metadata](#scrollTo=kku8mTL_XfYR)

>[Dataset Loading](#scrollTo=9IMbg5ejXkUj)

>>[Know the dataset](#scrollTo=cqpMOKDAnsUx)

>>>[IterableDataset](#scrollTo=224XV36uoRni)

>[Preprocess](#scrollTo=YXnJ-kQqzaIS)

>>[3.1 Sort, Shuffle, Select, Split](#scrollTo=EKrM-5uTYXqu)

>>[3.2 Rename, Remove, Cast and Flatten](#scrollTo=2M5Ub5KLdHJ_)

>>[3.3 Transformations with map()](#scrollTo=zu--1ofcmGoT)

>>>[3.3.1 Tokenization example](#scrollTo=OzUKwc5rmbSD)

>>>[3.3.2 Adding some prefix](#scrollTo=8nsZ2W0OmxlU)

>>>[3.3.3 Creating new columns](#scrollTo=uIGMXSnWpMgO)

>>>[3.3.4 Using multiple processors](#scrollTo=hQ0WlE9npQ5v)

>>>[3.3.5 Example of Spliting long text](#scrollTo=dpglRoN8pIPp)

>>[3.4 Creating Batches](#scrollTo=31JGlYg5qC_G)

>>[3.5 Concatenate](#scrollTo=18ZiY_WqqpUs)

>>>[3.6 Format Columns](#scrollTo=719iGWXdt4Pu)

>>[3.7 Align a dataset label id with label name](#scrollTo=9h4vLvqUY8fn)

>[Creating a dataset](#scrollTo=ELvWfNvR1IkG)

>[Slice splits](#scrollTo=9dLeDHGeTAb-)

>[Dataset features](#scrollTo=YHf6zdzeh5VW)

>[Share a dataset to the Hub](#scrollTo=XWWhPsJCwNVm)

>>[7.1 Push dataset to the hub](#scrollTo=yfEOSzfCcoQO)

>>[7.2 Creating a dataset card](#scrollTo=JKFNnh2lcp3j)



A dataset’s information is stored inside DatasetInfo and can include information such as the dataset description, features, and dataset size.

Use the `load_dataset_builder()` function to load a dataset builder and inspect a dataset’s attributes without committing to downloading it:

# 1. Metadata

In [2]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder('PolyAI/minds14', 'en-GB')
ds_builder.info

README.md:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

minds14.py:   0%|          | 0.00/5.83k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


DatasetInfo(description='MINDS-14 is a dataset for the intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.', citation="@article{gerz2021multilingual,\n  title={Multilingual and cross-lingual intent detection from spoken data},\n  author={Gerz, Daniela and Su, Pei-Hao and Kusztos, Razvan and Mondal, Avishek and Lis, Michal and Singhal, Eshan and Mrk{\x0b{s}}i{'c}, Nikola and Wen, Tsung-Hsien and Vuli{'c}, Ivan},\n  journal={arXiv preprint arXiv:2104.08524},\n  year={2021}\n}\n", homepage='https://arxiv.org/abs/2104.08524', license='', features={'path': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None), 'transcription': Value(dtype='string', id=None), 'english_transcription': Value(dtype='string', id=None), 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'busines

In [3]:
ds_builder.info.description

'MINDS-14 is a dataset for the intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.'

In [4]:
ds_builder.info.citation

"@article{gerz2021multilingual,\n  title={Multilingual and cross-lingual intent detection from spoken data},\n  author={Gerz, Daniela and Su, Pei-Hao and Kusztos, Razvan and Mondal, Avishek and Lis, Michal and Singhal, Eshan and Mrk{\x0b{s}}i{'c}, Nikola and Wen, Tsung-Hsien and Vuli{'c}, Ivan},\n  journal={arXiv preprint arXiv:2104.08524},\n  year={2021}\n}\n"

In [5]:
ds_builder.info.features

{'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None)}

A split is a specific subset of a dataset like train and test. List a dataset’s split names with the get_dataset_split_names() function

In [6]:
from datasets import get_dataset_split_names
get_dataset_split_names("glue", "mrpc")

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

['train', 'validation', 'test']

# 2. Dataset Loading

Text files are one of the most common file types for storing a dataset. By default, 🤗 Datasets samples a text file **line by line** to build the dataset.
```python
from datasets import load_dataset
dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"})

dataset = load_dataset("text", data_dir="path/to/text/dataset")
```

You can choose to sample by paragraph or entire documents:
```python
dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="paragraph")

dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="document")
```

**CACHING**

When you download a dataset from Hugging Face, the data are stored locally on your computer. At ~/.cache/huggingface/hub by default.
```python
dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets")
```

Then you can load a specific split with the split parameter. Loading a dataset split returns [Dataset](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset) object:

In [None]:
# load the train split from the mrpc subset of the GLUE dataset
dataset = load_dataset("glue", "mrpc", split="train")
dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

If you don’t specify a split, 🤗 Datasets returns a [DatasetDict](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.DatasetDict) object instead

In [None]:
dataset = load_dataset("glue", "mrpc")
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

Some datasets contain several sub-datasets: called *subsets* or *configuration*.
- You must explicitly provide a configuration name when loading a dataset which has several subsets.
- If you don’t, 🤗 Datasets will raise a ValueError and remind you to choose a configuration.

Use the **`get_dataset_config_names()`** function to retrieve a list of all the possible configurations available to your dataset:

In [None]:
from datasets import get_dataset_config_names
# example: this dataset contain audio conversation in different languages
get_dataset_config_names("glue")

['ax',
 'cola',
 'mnli',
 'mnli_matched',
 'mnli_mismatched',
 'mrpc',
 'qnli',
 'qqp',
 'rte',
 'sst2',
 'stsb',
 'wnli']

In [None]:
dataset = load_dataset("glue", "mrpc")

## Know the dataset

In [None]:
dataset = load_dataset("glue", "mrpc", split="train")

In [None]:
# Get the first row in the dataset
dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [None]:
# Get the last row in the dataset
dataset[-1]

{'sentence1': "The 30-year bond US30YT = RR rose 22 / 32 for a yield of 4.31 percent , versus 4.35 percent at Wednesday 's close .",
 'sentence2': 'The 30-year bond US30YT = RR grew 1-3 / 32 for a yield of 4.30 percent , down from 4.35 percent late Wednesday .',
 'label': 0,
 'idx': 4075}

In [None]:
# Access a row within a column
dataset[0]["sentence1"]

'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .'

### IterableDataset

An IterableDataset is loaded when you set the streaming parameter to True in **`load_dataset()`**. An IterableDataset progressively iterates over a dataset one example at a time, so you don’t have to wait for the whole dataset to download before you can use it.

In [None]:
iterable_dataset = load_dataset("glue", "mrpc", split="train", streaming=True)
for example in iterable_dataset:
    print(example)
    break

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


 You don’t get random access to examples in an IterableDataset. Instead, you should iterate over its elements, for example, by calling `next(iter())` or with a for loop to return the next item from the IterableDataset

In [None]:
next(iter(iterable_dataset))

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [None]:
list(iterable_dataset.take(3))

[{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  'label': 1,
  'idx': 0},
 {'sentence1': "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'sentence2': "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  'label': 0,
  'idx': 1},
 {'sentence1': 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'sentence2': "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'label': 1,
  'idx': 2}]

# 3. Preprocess
In addition to loading datasets, 🤗 Datasets other main goal is to offer a diverse set of preprocessing functions to get a dataset into an appropriate format for training with your machine learning framework.

Here are some techniques for:
- Reorder rows and split dataset.
- Rename / Remove columns amd common column operations.
- Apply processing functions like tokenization to each example in a dataset.
- Concatenate datasets
- Apply custom formatting transform
- Save and export processed datasets.


⚠️ All processing methods in this guide return a new Dataset object. Modification is not done in-place. Be careful about overriding your previous dataset!

## 3.1 Sort, Shuffle, Select, Split

In [None]:
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")

In [None]:
# Use sort() to sort column values according to their numerical values or alphabetical order
sorted_dataset = dataset.sort("label")
dataset = dataset.sort('sentence1')

# The shuffle() function randomly rearranges the column values.
shuffled_dataset = sorted_dataset.shuffle(seed=42).flatten_indices() # shuffle dataset + flatten indices to restore speed

# select() returns rows according to a list of indices:
small_dataset = dataset.select([0, 10, 20, 30, 40, 50])

# filter() returns rows that match a specified condition:
start_with_ar = dataset.filter(lambda example: example["sentence1"].lower().startswith("fr"))

# The train_test_split() function creates train and test splits if your dataset doesn’t already have them: returns a DatasetDict
dataset_dict = dataset.train_test_split(train_size=1000, test_size=200, shuffle=True, seed=2025)

Filter:   0%|          | 0/3668 [00:00<?, ? examples/s]

## 3.2 Rename, Remove, Cast and Flatten

In [None]:
dataset = load_dataset("glue", "mrpc", split="train")

# Provide rename_column() with the name of the original column, and the new column name:
dataset = dataset.rename_column("sentence1", "sentenceA")
dataset = dataset.rename_column("sentence2", "sentenceB")

# Provide the column(s) name to remove to the remove_columns() function
dataset = dataset.remove_columns(["sentenceA"])

# Conversely, select_columns() selects one or more columns to keep and removes the rest
dataset = dataset.select_columns(['sentenceB', 'idx'])
dataset

Dataset({
    features: ['sentenceB', 'idx'],
    num_rows: 3668
})

The `cast()` function transforms the feature type of one or more columns. This function accepts your new Features as its argument. The example below demonstrates how to change the ClassLabel and Value features:

In [None]:
dataset.features

{'sentenceB': Value(dtype='string', id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
from datasets import ClassLabel, Value
new_features = dataset.features.copy()
new_features["idx"] = Value("int64")
# We could rename the labels if we had some
# new_features["label"] = ClassLabel(names=["negative", "positive"])
dataset = dataset.cast(new_features)

Casting the dataset:   0%|          | 0/3668 [00:00<?, ? examples/s]

**Flatten**

Sometimes a column can be a nested structure of several types. Take a look at the nested structure below from the SQuAD dataset.
- The answers field contains two subfields: text and answer_start. Use the flatten() function to extract the subfields into their own separate columns.

In [38]:
from datasets import load_dataset
dataset = load_dataset("squad", split="train")
dataset.features

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

In [39]:
dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

In [40]:
flat_dataset = dataset.flatten()
flat_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
    num_rows: 87599
})

## 3.3 Transformations with `map()`

Some of the more powerful applications of 🤗 Datasets come from using the map() function. The primary purpose of map() is to speed up processing functions. It allows you to apply a processing function to each example in a dataset, independently or in batches. This function can even create new rows and columns.

### 3.3.1 Tokenization example

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("glue", "mrpc", split="train")

def tokenization(example):
    return tokenizer(example["sentence1"])

dataset = dataset.map(tokenization, batched=True) # batch to make it faster

### 3.3.2 Adding some prefix

In [None]:
def add_prefix(example):
    example["sentence1"] = 'My sentence: ' + example["sentence1"]
    return example

updated_dataset = small_dataset.map(add_prefix)
updated_dataset["sentence1"][:5]

['My sentence: " 30 years waiting for you , " read one banner hoisted over the crowd , as many wept tears of happiness at finally seeing and hearing their idol .',
 'My sentence: " And if that ain \'t a Democrat , then I must be at the wrong meeting . "',
 'My sentence: " At this point , Mr. Brando announced : \' Somebody ought to put a bullet \' " through her head , the motion continued .',
 'My sentence: " By its actions , the Bush administration threatens to give a bad name to a just war , " the Connecticut Democrat told a Capitol Hill news conference .',
 'My sentence: " During the investigation , Bryant was cooperative with investigators and remains cooperative with authorities , " the sheriff \'s office said .']

### 3.3.3 Creating new columns

In [None]:
# Specify the column to remove with the remove_columns parameter in map():
updated_dataset = dataset.map(lambda example: {"new_sentence": "new_prefix: ---> " + example["sentence1"]}, remove_columns=["sentence1"])
updated_dataset.column_names

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

['sentence2',
 'label',
 'idx',
 'input_ids',
 'token_type_ids',
 'attention_mask',
 'new_sentence']

In [None]:
updated_dataset["new_sentence"][:5]

['new_prefix: ---> Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 "new_prefix: ---> Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
 'new_prefix: ---> They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
 'new_prefix: ---> Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
 'new_prefix: ---> The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .']

### 3.3.4 Using multiple processors

In [None]:
# Multiprocessing significantly speeds up processing by parallelizing processes on the CPU.
# Set the num_proc parameter in map() to set the number of processes to use
updated_dataset = dataset.map(lambda example, idx: {"sentence2": f"{idx}: " + example["sentence2"]}, with_indices=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/3668 [00:00<?, ? examples/s]

### 3.3.5 Example of Spliting long text
When examples are too long, you may want to split them into several smaller chunks. Begin by creating a function that:

Splits the sentence1 field into chunks of 50 characters.

Stacks all the chunks together to create the new dataset.

In [None]:
def chunk_examples(examples):
    chunks = []
    for sentence in examples["sentence1"]:
        chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
    return {"chunks": chunks}

In [None]:
chunked_dataset = dataset.map(chunk_examples, batched=True, remove_columns=dataset.column_names)
chunked_dataset[:10]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

{'chunks': ['Amrozi accused his brother , whom he called " the ',
  'witness " , of deliberately distorting his evidenc',
  'e .',
  "Yucaipa owned Dominick 's before selling the chain",
  ' to Safeway in 1998 for $ 2.5 billion .',
  'They had published an advertisement on the Interne',
  't on June 10 , offering the cargo for sale , he ad',
  'ded .',
  'Around 0335 GMT , Tab shares were up 19 cents , or',
  ' 4.4 % , at A $ 4.56 , having earlier set a record']}

In [None]:
dataset, chunked_dataset

(Dataset({
     features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 3668
 }),
 Dataset({
     features: ['chunks'],
     num_rows: 10470
 }))

## 3.4 Creating Batches

The `batch()` method allows you to group samples from the dataset into batches. This is particularly useful when you want to create batches of data for training or evaluation, especially when working with deep learning models.

Here’s an example of how to use the batch() method:

In [None]:
batched_dataset = dataset.batch(batch_size=4)

Batching examples:   0%|          | 0/3668 [00:00<?, ? examples/s]

In [None]:
batched_dataset[0]['label']

[1, 0, 1, 0]

## 3.5 Concatenate

Separate datasets can be concatenated if they share the same column types. Concatenate datasets with concatenate_datasets():

In [None]:
from datasets import concatenate_datasets, load_dataset

bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.keep([col for col in wiki.column_names if col != "text"])  # only keep the 'text' column

assert bookcorpus.features.type == wiki.features.type
bert_dataset = concatenate_datasets([bookcorpus, wiki])

To apply a dataset transformation, you would typically
define a function that take as input an example (row) and then apply it to your dataset with `.map()`.

### 3.6 Format Columns

In [None]:
# Use the set_format() function to set the dataset format to be compatible with PyTorch for instance
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])
dataset.format

{'type': 'torch',
 'format_kwargs': {},
 'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
 'output_all_columns': False}

## 3.7 Align a dataset label id with label name

Pass the dictionary of the label mappings to the align_labels_with_mapping() function, and the column to align on:


In [30]:
mnli = load_dataset("glue", "mnli", split="train")
mnli

train-00000-of-00001.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

test_matched-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 392702
})

In [31]:
mnli.features['label']

ClassLabel(names=['entailment', 'neutral', 'contradiction'], id=None)

In [35]:
# We create a label2id dictionary
label2id = {"contradiction": 0, "neutral": 1, "entailment": 2}

# Apply it with `align_labels_with_mapping() function
mnli_aligned = mnli.align_labels_with_mapping(label2id=label2id, label_column='label')

In [37]:
# see how the order has been changed.
mnli_aligned.features['label']

ClassLabel(names=['contradiction', 'neutral', 'entailment'], id=None)

# 4. Creating a dataset
Two options:
- Folder and File-based builders
- `from_` methods

In [None]:
from datasets import load_dataset
# from a file
dataset = load_dataset("csv", data_files="my_file.csv")

Using `from_generator()`
- The most memory-efficient way. Especially for really large datasets.

Using `from_dict()`:
- keys are column names (string)
- value can be list of strings

In [None]:
from datasets import Dataset, IterableDataset, load_dataset

def gen():
    yield {"pokemon": "bulbasaur", "type": "grass"}
    yield {"pokemon": "squirtle", "type": "water"}

# can create a 'Dataset'
ds = Dataset.from_generator(gen)

# can create an 'IterableDataset'
ds = IterableDataset.from_generator(gen)

# A generator-based IterableDataset needs to be iterated over with a for loop for example:
for example in ds:
    print(example)

Generating train split: 0 examples [00:00, ? examples/s]

{'pokemon': 'bulbasaur', 'type': 'grass'}
{'pokemon': 'squirtle', 'type': 'water'}


In [None]:
# from dictionary
ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})

In [None]:
# from a list of dict
my_list = [{"a": 1}, {"a": 2}, {"a": 3}]
dataset = Dataset.from_list(my_list)

In [None]:
# from pandas
import pandas as pd
import datasets
df = pd.DataFrame({"a": [1, 2, 3]})
dataset = Dataset.from_pandas(df)

# 5. Slice splits

In [None]:
# concatenate train and test splits
train_test_ds = datasets.load_dataset("bookcorpus", split="train+test")

# specific rows of the train split
train_10_20_ds = datasets.load_dataset("bookcorpus", split="train[10:20]")

# Select a percentage of a split
train_10pct_ds = datasets.load_dataset("bookcorpus", split="train[:10%]")

Finally, you can even create cross-validated splits. The example below creates 10-fold cross-validated splits. Each validation dataset is a 10% chunk, and the training dataset makes up the remaining complementary 90% chunk:

In [None]:
# val_ds becomes a list of 10 datasets, where each dataset is a contiguous 10% segment of the "train" split of BookCorpus
val_ds = datasets.load_dataset("bookcorpus", split=[f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)])

# When k = 0: "train[:0%]+train[10%:]" (which essentially means all data except the first 10%).
# When k = 10: "train[:10%]+train[20%:]" (all data except the 10%-20% slice)… and so on.
train_ds = datasets.load_dataset("bookcorpus", split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)])

# 6. Dataset features

[Features](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Features) is special dictionary that defines the internal structure of a dataset.

Instantiated with a dictionary of type dict[str, FieldType], where
- keys are the desired column names, and
- values are the type of that column.

FieldType can be one of the following:

- **`Value`** feature specifies a single data type value, e.g. int64, bool, or string.

- **`ClassLabel`** feature specifies a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset. When you retrieve the labels, `ClassLabel.int2str()` and `ClassLabel.str2int()` carries out the conversion from integer value to label name, and vice versa.

- Python `dict` specifies a composite feature containing a mapping of sub-fields to sub-features. It’s possible to have nested fields of nested fields in an arbitrary manner.

- Python `list`, `LargeList` or `Sequence` specifies a composite feature containing a sequence of sub-features, all of the same feature type.

📝 Note: A Sequence with an internal dictionary feature will be automatically converted into a dictionary of lists.

# 7. Share a dataset to the Hub

On the hugging Face hub, Dataset repositories offer features such as:

- Free dataset hosting
- Dataset versioning
- Commit history and diffs
- Metadata for discoverability
- Dataset cards for documentation, licensing, limitations, etc.
- Dataset Viewer

Creating a dataset card is easy and can be done in just a few steps:

- Go to your dataset repository on the Hub and click on **Create Dataset Card** to create a new README.md file in your repository.

- Use the **Metadata UI** to select the tags that describe your dataset. You can add a license, language, pretty_name, the task_categories, size_categories, and any other tags that you think are relevant. These tags help users discover and find your dataset on the Hub.



## 7.1 Push dataset to the hub

`pip install huggingface_hub`

Push your dataset to your hub:

```python
ds.push_to_hub(
    repo_id="username/mydataset",
    config_name="filtered_subset",
    split='test',
    private=True,  
    set_default=True,
    token=HF_TOKEN,
)
```

## 7.2 Creating a dataset card

A dataset card is a best practice in the community. The purpose is to describe the dataset from different perspectives:
- content
- intended use
- known limitations
- origin
- collection method
- curation
- annotation process
- authors / annotators
- language
- license
- known bias

You can start from this [template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md)


In [None]:
# save locally to disk
ds.save_to_disk("path/of/my/dataset/directory")

# read from disk
from datasets import load_from_disk
reloaded_dataset = load_from_disk("path/of/my/dataset/directory")

You can also export your dataset into the following formats:
- CSV with `.to_csv()`
- JSON with `.to_json()`
- parquet with `.to_parquet()`
- sql with `.to_sql()`
- pandas with `.to_pandas()`