# Hugging Face Datasets Demo

Hugging Face Datasets is a library for easily accessing and sharing datasets. It provides a simple way to load, preprocess, and use datasets in machine learning tasks. In this demo, we will explore how to create custom datasets, specify feature formats, and save/load datasets using the Hugging Face Datasets library.
The main features of the library include:
- **Easy Dataset Creation**: Create datasets from dictionaries, lists, or files.
- **Feature Specification**: Define the format of features in the dataset for optimized loading.
- **Multiple Formats Support**: Support for various formats like NumPy, PyTorch, TensorFlow, JAX, and Pandas.
- **Save and Load Datasets**: Save datasets to disk and load them back easily.
- **Dataset Sharing**: Share datasets with the community or use datasets shared by others.
- **Integration with Hugging Face Ecosystem**: Seamless integration with other Hugging Face libraries like Transformers and Tokenizers.
- **Efficient Data Loading**: Efficiently load large datasets with lazy loading and caching.

# Installation 

In [1]:
!pip install datasets



# Custom Dataset Creation 

In [5]:
import numpy as np
import torch

from datasets import Dataset

In [17]:
# Make a simple dataset:

texts = ["Hi there", "How are you?", "Nice to meet you"]
labels = [0, 1, 0]
vectors = np.random.randint(0, 9, (3, 5))  # Random vectors of size 5

data = {"text": texts, "label": labels, 'vector': vectors}
dataset = Dataset.from_dict(data)
print(dataset)

Dataset({
    features: ['text', 'label', 'vector'],
    num_rows: 3
})


In [18]:
for item in dataset:
    print(item) 

{'text': 'Hi there', 'label': 0, 'vector': [5, 6, 0, 5, 2]}
{'text': 'How are you?', 'label': 1, 'vector': [5, 1, 7, 2, 0]}
{'text': 'Nice to meet you', 'label': 0, 'vector': [5, 6, 5, 6, 5]}


In [19]:
# we can convert the dataset to a PyTorch Dataset:
torch_dataset = dataset.with_format("torch")
print(torch_dataset)

Dataset({
    features: ['text', 'label', 'vector'],
    num_rows: 3
})


In [12]:
for item in torch_dataset:
    print(item) 

{'text': 'Hi there', 'label': tensor(0), 'vector': tensor([0, 1, 2, 2, 3])}
{'text': 'How are you?', 'label': tensor(1), 'vector': tensor([6, 7, 1, 3, 5])}
{'text': 'Nice to meet you', 'label': tensor(0), 'vector': tensor([4, 2, 0, 1, 4])}


## Feature Format
We can specify the format of the features in the dataset for faster and optimized loading.

In [25]:
from datasets import Features, Value, Sequence, Array2D, Array3D

features = Features({
    "text": Value("string"),
    "label": Value("int64"),
    "vector": Array2D(shape=(5,1), dtype='float32'),  # 2D array with shape (5,1)
    "img": Array3D(shape=(3, 64, 64), dtype='uint8'),  # 3D array with shape (3, 64, 64)
})

data = {
    "text": ["Hi there", "How are you?", "Nice to meet you"],
    "label": [0, 1, 0],
    "vector": np.random.randint(0, 9, (3, 5, 1)).astype(np.float32),  # Random vectors of size (5,1)
    "img": np.random.randint(0, 255, (3, 3, 64, 64), dtype=np.uint8)  # Random images of shape (3, 64, 64)
}

dataset = Dataset.from_dict(data, features=features)
print(dataset)

Dataset({
    features: ['text', 'label', 'vector', 'img'],
    num_rows: 3
})


In [26]:
for item in dataset:
    print(item)

{'text': 'Hi there', 'label': 0, 'vector': [[3.0], [2.0], [4.0], [7.0], [5.0]], 'img': [[[191, 218, 114, 223, 22, 202, 88, 25, 24, 231, 67, 6, 46, 75, 200, 0, 175, 237, 109, 96, 201, 185, 206, 140, 100, 91, 209, 144, 181, 81, 100, 238, 119, 93, 2, 103, 46, 171, 34, 254, 244, 234, 160, 100, 97, 128, 175, 14, 6, 91, 98, 15, 108, 134, 22, 131, 81, 253, 28, 163, 94, 95, 73, 128], [130, 250, 213, 21, 6, 47, 46, 104, 190, 52, 204, 12, 64, 30, 213, 76, 141, 155, 253, 170, 223, 48, 196, 41, 84, 26, 217, 78, 22, 232, 200, 159, 2, 21, 71, 173, 191, 223, 56, 179, 1, 243, 103, 128, 244, 151, 218, 178, 177, 31, 90, 198, 202, 74, 149, 43, 121, 88, 174, 75, 31, 72, 27, 109], [48, 94, 81, 67, 8, 248, 140, 236, 220, 98, 117, 99, 126, 47, 34, 24, 200, 13, 117, 166, 163, 26, 156, 116, 24, 171, 174, 77, 230, 148, 121, 149, 107, 157, 67, 197, 5, 21, 206, 98, 35, 27, 176, 22, 110, 139, 107, 227, 223, 226, 51, 66, 120, 245, 251, 59, 76, 60, 171, 111, 36, 89, 76, 185], [187, 62, 25, 106, 110, 62, 124, 70, 244

In [27]:
torch_dataset = dataset.with_format("torch")
print(torch_dataset)
for item in torch_dataset:
    for key, value in item.items():
        print(f"{key}: {value.shape if isinstance(value, torch.Tensor) else value}")

Dataset({
    features: ['text', 'label', 'vector', 'img'],
    num_rows: 3
})
text: Hi there
label: torch.Size([])
vector: torch.Size([5, 1])
img: torch.Size([3, 64, 64])
text: How are you?
label: torch.Size([])
vector: torch.Size([5, 1])
img: torch.Size([3, 64, 64])
text: Nice to meet you
label: torch.Size([])
vector: torch.Size([5, 1])
img: torch.Size([3, 64, 64])


# HF Datasets allows support on multiple formats like numpy, torch, tensorflow, jax, pandas, etc.

In [37]:
np_dataset = dataset.with_format("numpy")
print(np_dataset[0])

{'text': 'Hi there', 'label': 0, 'vector': array([[3.],
       [2.],
       [4.],
       [7.],
       [5.]], dtype=float32), 'img': array([[[191, 218, 114, ...,  95,  73, 128],
        [130, 250, 213, ...,  72,  27, 109],
        [ 48,  94,  81, ...,  89,  76, 185],
        ...,
        [ 24, 151, 153, ..., 150,   1, 160],
        [ 95, 175, 137, ...,  88,  55,   4],
        [134,  26, 185, ..., 238, 169, 162]],

       [[ 74, 168,   3, ...,  92, 248, 211],
        [ 37, 144, 248, ..., 143,  34, 241],
        [  4, 191, 179, ...,  69, 168, 176],
        ...,
        [101,  65, 166, ..., 169, 228, 235],
        [ 72,  81, 172, ...,  17,  14,  56],
        [ 95,  72, 205, ..., 148,  60, 225]],

       [[253,  42,   7, ...,   1,  91, 251],
        [101, 109, 127, ..., 157, 193,  63],
        [ 25, 155, 221, ..., 170,   5, 192],
        ...,
        [121, 192, 117, ..., 220, 158, 150],
        [ 34,  69, 190, ..., 250,   0, 119],
        [ 61, 203,  99, ..., 163,  75,  21]]])}


In [34]:
jax_dataset = dataset.with_format("jax")

ValueError: JAX needs to be installed to be able to return JAX arrays.

# Save and Load Dataset
To save the dataset, we can use the `save_to_disk` method, and to load it back, we can use `load_from_disk`.

In [32]:
from datasets import load_from_disk


# Save the dataset to disk
dataset.save_to_disk("my_dataset")
# This will create a directory named "my_dataset" containing the dataset files.
# The directory structure will look like this:
# my_dataset/
# ├── dataset_info.json
# ├── dataset.arrow
# └── state.json
# where `dataset_info.json` contains metadata about the dataset,
# `dataset.arrow` contains the actual data, and `state.json` contains the state of the dataset, i.e. the 


Saving the dataset (0/1 shards):   0%|          | 0/3 [00:00<?, ? examples/s]

In [29]:
# Load the dataset from disk
loaded_dataset = load_from_disk("my_dataset")

In [30]:
!pwd

/Users/mik/Developer/MMINT/mik_tools/notebooks


In [31]:
dataset.to_json('dataset.json')

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

170