<a href="https://colab.research.google.com/github/pablokris/scaif-hgdatasets-demo/blob/main/hugging_face_datasets_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Hugging Face Datasets**

Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency.

Let' start by pip installing the module


In [None]:
pip install datasets

## **Loading a Dataset**

Loading a dataset is as easy as one line from the [hugging face datasets hub](https://huggingface.co/?activityType=update-dataset&feedType=following)

We'll talk about splitting the next section


In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

# **Splitting**

Splitting refers to how models are often split (partitioned) to make subsets of data so you don't have to download all of the data for testing, training, ect. If you do not configure this attribute it usually returns the all the data. Splits are often defined as "train", "test", "validation"]


In [None]:
from datasets import get_dataset_split_names

get_dataset_split_names("rotten_tomatoes")

Then you can load a specific split with the split parameter. Loading a dataset split returns a Dataset object

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")
dataset

If you don’t specify a split, Datasets returns a DatasetDict object instead

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")
dataset

# **Configurations**
Some datasets contain several sub-datasets. For example, the MInDS-14 dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations, and you must explicitly select one when loading the dataset. If you don’t provide a configuration name, 🤗 Datasets will raise a ValueError and remind you to choose a configuration.

Use the get_dataset_config_names() function to retrieve a list of all the possible configurations available to your dataset:

In [None]:
from datasets import get_dataset_config_names

configs = get_dataset_config_names("PolyAI/minds14")
print(configs)

Then load the configuration you want

In [None]:
from datasets import load_dataset

dataset = load_dataset("PolyAI/minds14", "en-US", split="train")

# **Know your dataset**

A Dataset contains columns of data, and each column can be a different type of data. The index, or axis label, is used to access examples from the dataset.

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")
dataset[0]

In [None]:
dataset[-1]

Indexing by the column name returns a list of all the values in the column

In [None]:
dataset["text"]

You can combine row and column name indexing to return a specific value at a position

In [None]:
dataset[0]["text"]

# **Slicing**
Slicing returns a slice - or subset - of the dataset, which is useful for viewing several rows at once

In [None]:
dataset[:3] # Get the first three rows

In [None]:
dataset[3:6] # Get rows between three and six


Tokenize your Text

*   Models cannot process raw text, so you’ll need to convert the text into numbers.
*   Tokenization provides a way to do this by dividing text into individual words called tokens.
*   Tokens are finally converted to numbers.

In [None]:
pip install transformers

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")

The tokenizer returns a dictionary with three items:

input_ids: the numbers representing the tokens in the text.
token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
attention_mask: indicates whether a token should be masked or not.

In [None]:
tokenizer(dataset[0]["text"])

The fastest way to tokenize your entire dataset is to use the map() function. This function speeds up tokenization by applying the tokenizer to batches of examples instead of individual examples. Set the batched parameter to True

In [None]:
def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

dataset

Set the format of your dataset to be compatible with your machine learning framework:

In [None]:
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])
dataset.format['type']