[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W2T_Intro_Transformers_Datasets.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install transformers sentencepiece torch datasets

# Introduction to 🤗 Transformers and 🤗 Datasets

*This tutorial is based off some chapters of the [HuggingFace Course](https://huggingface.co/course/chapter1/1), take a look for a more detailed overview!*

Transformer models are nowadays the state-of-the-art and de-facto standard to solve all kinds of NLP tasks, from tagging to machine translation, to text classification.

The usage of these models has been widely simplified and democratized by [HuggingFace](https://huggingface.co/) (🤗 in short), the startup behind the popular [🤗 Transformers library](https://huggingface.co/transformers/). The 🤗 Transformers library is completely open-source and provides a unified framework to create, train and use many transformer-based models, accompanied by a Cloud-hosting service called [Model hub](https://huggingface.co/models) (similar to Pytorch and Tensorflow Hubs) in which every user can host and share pre-trained and fine-tuned open-source models, and which currently contains over 25k models (as of Jan 24, 2022).

We are going to start with a quick overview of the 🤗 Transformers library and its usage, and then we will dive into the 🤗 Datasets library, which is the largest open collection of text datasets ready for usage with 🤗 Transformers and other machine learning frameworks (also hosted on the [Dataset hub](https://huggingface.co/datasets)) 

## Pipelines

The most basic object in the 🤗 Transformers library is the `pipeline` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for the IK-NLP course for my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9915691018104553}]

Multiple sentences can also be passed:

In [3]:
classifier(
    ["I've been waiting for this course for my whole life.", "I hate this course so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9807901382446289},
 {'label': 'NEGATIVE', 'score': 0.9996439218521118}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English (`distilbert-base-uncased-finetuned-sst-2-english`). The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:

- The text is preprocessed into a format the model can understand.
- The preprocessed inputs are passed to the model.
- The predictions of the model are post-processed, so you can make sense of them.

Some of the [currently available pipelines](https://huggingface.co/transformers/main_classes/pipelines.html) are:

- `sentiment-analysis`
- `text-generation`
- `fill-mask` (filling a masked token or span with a predicted one)
- `text2text-generation`
- `ner` (named entity recognition)
- `question-answering`

Some pipelines, such as `sentiment-analysis`, `translation` and `summarization` are abstractions over other, more general ones (e.g. `summarization` and `translation` are abstractions over `text2text-generation`, `sentiment-analysis` over `text-classification`).

Let's see some examples of pipelines in action.

### Text generation 

The text generation (i.e. autoregressive language modeling) setting has become widely popularized by models such as [GPT-3](https://en.wikipedia.org/wiki/GPT-3). Given a prompt, the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Here is an example using a small [GPT-2](https://huggingface.co/gpt2) model:

In [4]:
from transformers import pipeline

generator = pipeline("text-generation")
generator(
    "The goal of the course is to ensure that students are familiar with a number of fundamental "
    "techniques and algorithms in the area of natural language processing, such as: "
)

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The goal of the course is to ensure that students are familiar with a number of fundamental techniques and algorithms in the area of natural language processing, such as: vernacular translation for example\n\nsimplifiers for sentences\n\ninterlinear transformations\n'}]

>🤡 **Fun Fact**: Language models trained to perform autoregressive language modeling are already widely used in the industry. For example, the [Github Copilot](https://copilot.github.com/) integrated to the VisualStudio Code editor is a model trained to autocomplete text and code snippets, and is currently helping me in writing these notebooks (grey text are model suggestions).

![Github Copilot autocompletion](https://github.com/gsarti/ik-nlp-tutorials/raw/main/img/copilot.png)

Try controlling how many different sequences are generated with the argument `num_return_sequences` and the total maximal length of the output text with the argument `max_length`.

We can use any custom model from the [Model hub](https://huggingface.co/models) by simply passing it (or its identifier) when creating the pipeline. We'll now use a [Dutch GPT-2 model](https://huggingface.co/GroNLP/gpt2-small-dutch) pretrained by our colleagues to generate some Dutch text:

In [5]:
from transformers import pipeline

generator = pipeline("text-generation", model="GroNLP/gpt2-small-dutch")
generator(
    "Het doel van de cursus is ervoor te zorgen dat studenten bekend zijn met een aantal "
    "fundamentele technieken en algoritmen op het gebied van natuurlijke taalverwerking, zoals:"
)

  next_indices = next_tokens // vocab_size


[{'generated_text': "Het doel van de cursus is ervoor te zorgen dat studenten bekend zijn met een aantal fundamentele technieken en algoritmen op het gebied van natuurlijke taalverwerking, zoals:\nBeslissingen die niet tot uitdrukking kunnen worden gebracht of door middel van andere methoden verkregen. In deze context wordt er meestal gebruikgemaakt van 'informatief' interpretaties (bijvoorbeeld bij zinnen uit verschillende talen). De belangrijkste manier om dergelijke interpretaties in praktijk te brengen is bijvoorbeeld als volgt:\nAls je wilt zeggen wat iemand zegt, kan"}]

You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains checkpoints for multilingual models that support several languages.

Once you select a model by clicking on it, you’ll see that there is a widget enabling you to try it directly online. This way you can quickly test the model’s capabilities before downloading it. More info on text generation here: [https://huggingface.co/tasks/text-generation](https://huggingface.co/tasks/text-generation)

### Mask-filling

The `fill-mask` (i.e. masked language modeling) pipeline is used to fill masked tokens with a predicted token, which is a common pre-training task for encoder-only transformers leveraging bidirectional context such as [BERT](https://huggingface.co/bert-base-uncased). This is useful to fill gaps in the text with the most likely answer given the context.

In [6]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> language processing models.", top_k=3)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


[{'sequence': 'This course will teach you all about natural language processing models.',
  'score': 0.5701010823249817,
  'token': 1632,
  'token_str': ' natural'},
 {'sequence': 'This course will teach you all about functional language processing models.',
  'score': 0.06381034106016159,
  'token': 12628,
  'token_str': ' functional'},
 {'sequence': 'This course will teach you all about programming language processing models.',
  'score': 0.043610505759716034,
  'token': 8326,
  'token_str': ' programming'}]

The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `<mask>` word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget. More info on mask-filling here: [https://huggingface.co/tasks/fill-mask](https://huggingface.co/tasks/fill-mask)

### Text2Text Generation

The `text2text-generation` pipeline encompasses all the sequence-to-sequence tasks on which a model was trained. We're gonna cover Encoder-Decoder architectures in week 6, but in the meantime you can see here some examples of sequence-to-sequence tasks:

In [7]:
# Summarization: reduce a text to its summary
# The same result can be achieved using the `summarization` pipeline.
from transformers import pipeline

summarizer = pipeline("text2text-generation", model="sshleifer/distilbart-cnn-12-6")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'generated_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [8]:
# Translation: Translate a sentence from French to English
# The same result can be achieved using the `translation` pipeline.
from transformers import pipeline

translator = pipeline("text2text-generation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce laboratoire a été adapté à partir du cours originel par HuggingFace.")

[{'generated_text': 'This laboratory was adapted from the original course by HuggingFace.'}]

This concludes our pipeline overview. Refer to the [documentation](https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/pipelines) for additional information on pipeline types and parameters.

## Behind the Pipeline

As we saw in the previous chapter, a `pipeline` has a preprocessing step, a model inference step and a post-processing step:

<div>
<img src="https://huggingface.co/course/static/chapter2/full_nlp_pipeline.png", alt="Visual representation of a full NLP pipeline" width="80%"/>
</div>

Like other neural networks, Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the `AutoTokenizer` class and its `from_pretrained` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

**Important:** Every type of transformer model has its own tokenizer and model classes (e.g. `BertTokenizer`, `GPT2Tokenizer`, `XLMTokenizer`, ...). The `AutoTokenizer` and `AutoModel` classess will rely on configurations saved alongside model checkpoints to load any model in the right class, so you don’t need to worry about this.

Let's try to load and use the tokenizer of the sentiment analysis model we tried at the beginning of this lab:

In [9]:
from transformers import AutoTokenizer

identifier = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(identifier)

print("Vocabulary size:", len(tokenizer), "tokens")
print("10 random tokens:", {k:v for i, (k,v) in enumerate(tokenizer.vocab.items()) if i < 10})

raw_inputs = ["I've been waiting for this course for my whole life.", "I hate this course so much!"]
inputs = tokenizer(raw_inputs, padding=True, return_tensors="pt")
print("="*20,"\n","Output:",inputs)

Vocabulary size: 30522 tokens
10 random tokens: {'chesapeake': 20867, 'incorporating': 13543, 'acted': 6051, 'prints': 11204, '##arium': 17285, 'ventura': 21151, 'groves': 21695, 'patent': 7353, '[unused699]': 704, 'nec': 26785}
 Output: {'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2023, 2607, 2005, 2026, 2878,
         2166, 1012,  102],
        [ 101, 1045, 5223, 2023, 2607, 2061, 2172,  999,  102,    0,    0,    0,
            0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


The `padding` argument tells the tokenizer to make sure all input sequences are encoded with the same length. This is important since the model needs batches of IDs having the same size to work properly. The `return_tensors` argument tells the tokenizer to return the output as a PyTorch tensor instead of a list of tokens. This is important since the model expects the input to be a tensor.

The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. The `attention_mask` is used by the model to ignore padding tokens when computing the loss.

We can recover the original tokens from input ids as follows:

In [10]:
print(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
print(tokenizer.convert_ids_to_tokens(inputs["input_ids"][1]))

['[CLS]', 'i', "'", 've', 'been', 'waiting', 'for', 'this', 'course', 'for', 'my', 'whole', 'life', '.', '[SEP]']
['[CLS]', 'i', 'hate', 'this', 'course', 'so', 'much', '!', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']


The special `[CLS]` and `[SEP]` tokens are used by BERT-like models to delimit the sentences, and they are automatically added by the tokenizer.

We can now proceed to download the model and perform inference over the input ids:

In [11]:
from transformers import AutoModel

identifier = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(identifier)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call **hidden states**, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing the contextual representation of that token in the model's learned embedding space.

While these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the **head**, which is responsible to perform the actual prediction associated to the target task.

The input vector to the model is usually three-dimensional, containing respectively

- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (15 in our example).
- Hidden size: The vector dimension of each model input. The vector is said to be “high dimensional” because of this last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can now feed the `inputs` produced by the tokenizer to the model:

In [12]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 15, 768])


These are the same values that could be extracted by the `feature-extraction` pipeline.

### Using Heads

Model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

<div>
<img src="https://huggingface.co/course/static/chapter2/transformer_and_head.png", alt="A Transformer model" width="80%"/>
</div>

The output of the Transformer model is sent directly to the model head to be processed.

In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are many different architectures available in 🤗 Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:

- *Model (default headless model)
- *ForCausalLM
- *ForMaskedLM
- *ForMultipleChoice
- *ForQuestionAnswering
- *ForSequenceClassification
- *ForTokenClassification
- *ForSeq2SeqLM

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the `AutoModel` class, but `AutoModelForSequenceClassification`:

In [13]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our inputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label). Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2:

In [14]:
print(outputs.logits.shape)

torch.Size([2, 2])


### Postprocessing outputs

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

In [15]:
print(outputs.logits)

tensor([[-2.0023,  1.9307],
        [ 4.4057, -3.5342]], grad_fn=<AddmmBackward0>)


Our model predicted [-2.0023, 1.9307] for the first sentence and [4.4057, -3.5342] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [16]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[1.9210e-02, 9.8079e-01],
        [9.9964e-01, 3.5611e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [0.0192, 0.9807] for the first sentence and [0.9996, 0.0004] for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model configuration:

In [17]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

- First sentence: NEGATIVE: 0.0192, POSITIVE: 0.9807
- Second sentence: NEGATIVE: 0.9996, POSITIVE: 0.0004

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing. Here is a summary of all the steps:

In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

identifier = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(identifier)
model = AutoModelForSequenceClassification.from_pretrained(identifier)
sequences = ["I've been waiting for this course for my whole life.", "I hate this course so much!"]


tokens = tokenizer(sequences, padding=True, return_tensors="pt")
output = model(**tokens)
probs = torch.nn.functional.softmax(output.logits, dim=-1).tolist()
for i, p in enumerate(probs):
    print(sequences[i], {model.config.id2label[j]:v for j, v in enumerate(p)})

I've been waiting for this course for my whole life. {'NEGATIVE': 0.019209884107112885, 'POSITIVE': 0.9807901382446289}
I hate this course so much! {'NEGATIVE': 0.9996439218521118, 'POSITIVE': 0.0003561103658284992}


In the case of generative models (both causal like GPT-2 and sequence-to-sequence like BART or T5), we make use of the `model.generate` method to generate a sequence of ids representing the predicted output. We then use the `tokenizer.decode` method to convert the ids into the corresponding text. We will use a small transformer model trained to translate from English to Russian as an example.

In [19]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

identifier = "Helsinki-NLP/opus-mt-ru-en"
tokenizer = AutoTokenizer.from_pretrained(identifier)
model = AutoModelForSeq2SeqLM.from_pretrained(identifier)
sequences = ["Меня зовут Габриелэ и я живу в Гроннингене", "Надеюсь, вам нравится курс!"]


tokens = tokenizer(sequences, padding=True, return_tensors="pt")
output = model.generate(**tokens)
print(tokenizer.batch_decode(output, skip_special_tokens=True))

# Alternatively, use decode sentence-by-sentence
for s in output.tolist():
    print(tokenizer.decode(s, skip_special_tokens=True))

['My name is Gabriele and I live in Groningen.', 'I hope you like the course!']
My name is Gabriele and I live in Groningen.
I hope you like the course!


The `skip_special_token` directive removes all the special tokens like `</s>` and `<pad>` from the output. The `generate` method is highly versatile, refer to the [documentation](https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate) for more details. 

## Loading and Saving

As mentioned before, the `AutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture. However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let’s take a look at how this works with a BERT model.

In [20]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.5",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



The configuration contains many attributes that are used to build the model. While you haven’t seen what all of these attributes do yet, you should recognize some of them: the `hidden_size` attribute defines the size of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers the Transformer model has.

Creating a model from the default configuration initializes it with random values. The model can be used in this state, but it will output gibberish; it needs to be trained first. However, this procedure requires a long time and a lot of data. To avoid unnecessary and duplicated effort, it’s imperative to be able to share and reuse models that have already been trained.

Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained` method:

In [21]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the `from_pretrained` method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME environment variable.

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture.

Saving a model is as easy as loading one — we use the `save_pretrained` method, which is analogous to the `from_pretrained` method. This saves two files to your disk: the configuration file and the weights.

In [22]:
model.save_pretrained("directory_on_my_computer")

The same can be accomplished for a tokenizer. This saves the essential files to restore the tokenizer object using `from_pretrained(directory_on_my_computer)`.

In [23]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json')

This concludes our quick overview of the `transformers` library. You can refer to the extended version of this introduction at the original [HuggingFace Course](https://huggingface.co/course) for more examples.

## Using the 🤗 Datasets library

🤗 Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:

Data format 	   |Loading script |Example
-------------------|---------------|----------------------------------------------|
CSV & TSV 	       |csv 	       |`load_dataset("csv", data_files="my_file.csv")`|
Text files 	       |text 	       |`load_dataset("text", data_files="my_file.txt")`|
JSON & JSON Lines  |json 	       |`load_dataset("json", data_files="my_file.jsonl")`|
Pickled DataFrames |pandas 	       |`load_dataset("pandas", data_files="my_dataframe.pkl")`|

As shown in the table, for each data format we just need to specify the type of loading script in the `load_dataset` function, along with a data_files argument that specifies the path to one or more files. Let’s see how we can load some remote files from the [SQUAD-it](https://github.com/crux82/squad-it) dataset for question answering in Italian:

In [24]:
from datasets import load_dataset

url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Using custom data configuration default-57dcee3ea6992346
Reusing dataset json (/home/gsarti/.cache/huggingface/datasets/json/default-57dcee3ea6992346/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426)
100%|██████████| 2/2 [00:00<00:00, 103.16it/s]


We can see that the dataset contains two splits, `train` and `test`, which contain respectively 442 and 48 pairs of (title, paragraph) fields.

In [25]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

We can now inspect the first pair, showing that it contains a `paragraph` and a nested set of `qas`. To do so, we access the `train` dataset in the DatasetDict as we would access a normal dictionary, and select the first element from it:

In [None]:
squad_it_dataset["train"][0]

Let's now load the original [SQUAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset directly from the Datasets hub, using its identified `squad`. We can see that the structure has been flattened and we have now a triplet (context, question, answer) per row, for a total of 87599 training examples and 10570 validation examples.

In [27]:
from datasets import load_dataset

squad = load_dataset("squad")
squad

Reusing dataset squad (/home/gsarti/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
100%|██████████| 2/2 [00:00<00:00, 162.40it/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

To get a feel of the dataset, let's grab a sample of two examples selected at random from the `train` split. The result of the slicing operation is a dictionary with lists containing the values for fields for each one of the examples:

In [28]:
squad_train = squad["train"].shuffle(seed=42).select(range(1000))
squad_train[:2]

Loading cached shuffled indices for dataset at /home/gsarti/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-10de9997c4b83f65.arrow


{'id': ['573173d8497a881900248f0c', '57277e815951b619008f8b52'],
 'title': ['Egypt', 'Ann_Arbor,_Michigan'],
 'context': ['The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.',
  'The Ann Arbor Hands-On Museum is located in a renovated and expanded historic downtown fire station. Multiple art galleries exist in the city, notably in the downtown area and around the University 

You may have noticed that the field `answers` contains a list of dictionaries, each one containing a list of `answer_start` and a list of `text` fields. This is because the dataset contains multiple answers per question, and because extractive QA systems use the `answer_start` field as prediction target.

Let's say we plan to train a sequence-to-sequence model to generate answers from paragraphs and questions, without extracting them directly from the context. In this case, the index of the answer in the `answers` field is not relevant, and we can remove it. We want to adapt the dataset by extracting only the value of the `text` field from the first answer for each example, and by concatenating the context with the question to have a single source field. Let's define a custom function and use the `Dataset.map` function:

In [29]:
def extract_text(example):
    return {
        "text_answer": example["answers"]["text"][0],
        "text_question": example["context"] + " Question: " + example["question"],
    }

squad_train = squad_train.map(
    extract_text, remove_columns=["title", "context", "question", "answers"]
)

squad_train[:2]

Loading cached processed dataset at /home/gsarti/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-b73576d9a85a1624.arrow


{'id': ['573173d8497a881900248f0c', '57277e815951b619008f8b52'],
 'text_answer': ['84%', 'books'],
 'text_question': ['The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery. Question: What percentage of Egyptians polled support death penalty for those leaving Islam?',
  'The Ann Arbor Hands-On Museum is located in a renovated and expanded historic downtown fire station. Multiple

We can see we are using the `remove_columns` parameter in `map` to also drop unused columns. Let's give the last touches by filtering out examples with context exceeding 512 words, and renaming the two new fields to `source` and `target` for uniformity.

In [30]:
# We use a lambda expression here for simple whitespace tokenization, but a function
# using an AutoTokenizer could also be used.
filtered_train = squad_train.filter(lambda x: len(x["text_question"].split(" ")) < 512)
print("Length original sample:", len(squad_train), "Length filtered sample:", len(filtered_train))

filtered_train = filtered_train.rename_columns({
    "text_answer": "target",
    "text_question": "source"
})

filtered_train[:2]

Loading cached processed dataset at /home/gsarti/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-f61fc212407c59e3.arrow


Length original sample: 1000 Length filtered sample: 1000


{'id': ['573173d8497a881900248f0c', '57277e815951b619008f8b52'],
 'target': ['84%', 'books'],
 'source': ['The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery. Question: What percentage of Egyptians polled support death penalty for those leaving Islam?',
  'The Ann Arbor Hands-On Museum is located in a renovated and expanded historic downtown fire station. Multiple art galleri

To enable the conversion between various third-party libraries, 🤗 Datasets provides a `Dataset.set_format` function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to the popular Pandas library used in data science:

In [31]:
# Can be reset using reset_format()
filtered_train.set_format("pandas")
filtered_train[:3]

Unnamed: 0,id,target,source
0,573173d8497a881900248f0c,84%,The Pew Forum on Religion & Public Life ranks ...
1,57277e815951b619008f8b52,books,The Ann Arbor Hands-On Museum is located in a ...
2,5727e2483acd2414000deef0,the executive,One important aspect of the rule-of-law initia...


We can now create a proper DataFrame from the dataset, and perform some additional transformations on it to see what's the average length of answers in words:

In [32]:
train_df = filtered_train[:]
train_df["len_target"] = train_df["target"].apply(lambda x: len(x.split(" ")))

frequencies = (
    train_df["len_target"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "num_words", "len_target": "frequency"})
)
frequencies.head()

Unnamed: 0,num_words,frequency
0,1,341
1,2,246
2,3,168
3,4,77
4,5,42


We can now retransform this new DataFrame into a Dataset object using the `Dataset.from_pandas` function:

In [33]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['num_words', 'frequency'],
    num_rows: 26
})

To conclude this overview of the 🤗 Datasets library, let's take a look at how Datasets can be saved. The three main functions to save a Dataset are `save_to_disk`, `to_csv` and `to_json`. 

The first one creates a folder where we can see that each split is associated with its own dataset.arrow table, and some metadata in dataset_info.json and state.json. You can think of the Arrow format as a fancy table of columns and rows that is optimized for building high-performance applications that process and transport large datasets.

Once the dataset is saved, it can be loaded back using `load_from_disk`.

In [34]:
from datasets import load_from_disk

# Save all splits to disk
freq_dataset.save_to_disk("frequencies")
del freq_dataset

# Reload them into a single DatasetDict object
freq_dataset = load_from_disk("frequencies")
freq_dataset

Dataset({
    features: ['num_words', 'frequency'],
    num_rows: 26
})

For the CSV and JSON formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the DatasetDict object:

In [35]:
for split, dataset in squad.items():
    dataset.to_json(f"squad-{split}.jsonl")

Creating json from Arrow format: 100%|██████████| 9/9 [00:00<00:00, 14.15ba/s]
Creating json from Arrow format: 100%|██████████| 2/2 [00:00<00:00, 22.97ba/s]


We can then reload the dataset using the `load_dataset` function, as seen previously:

In [36]:
data_files = {
    "train": "squad-train.jsonl",
    "validation": "squad-validation.jsonl",
}
squad = load_dataset("json", data_files=data_files)

Using custom data configuration default-260a0a433a957006


Downloading and preparing dataset json/default to /home/gsarti/.cache/huggingface/datasets/json/default-260a0a433a957006/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426...


100%|██████████| 2/2 [00:00<00:00, 1203.70it/s]
100%|██████████| 2/2 [00:00<00:00, 1218.57it/s]


Dataset json downloaded and prepared to /home/gsarti/.cache/huggingface/datasets/json/default-260a0a433a957006/0.0.0/c2d554c3377ea79c7664b93dc65d0803b45e3279000f993c7bfd18937fd7f426. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 301.62it/s]


This concludes our quick overview of the 🤗 Datasets library. For more advanced concepts related to Datasets (memory mapping, interleaving, streaming, semantic search) refer to the Chapters 5.4 and 5.5 of the [HuggingFace Course](https://huggingface.co/course/chapter5/4?fw=pt).