In [39]:
import pandas as pd

In [69]:
pd.set_option('display.max_columns', 75)

# Introduction

In Chapter 2 we explored how to use <i>tokenizers</i> and <i>pretrained models</i> to make predictions. But what if you want to <b>fine-tune a pretrained model for your own dataset</b>? That’s the topic of this chapter! You will learn:

- How to prepare a large dataset from the Hub
- How to use the high-level Trainer API to fine-tune a model
- How to use a custom training loop
- How to leverage the 🤗 Accelerate library to easily run that custom training loop on any distributed setup

# Processing the data (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# !pip install datasets evaluate transformers[sentencepiece]

In [206]:
from transformers import logging
logging.set_verbosity_error()

Continuing with the example from the previous chapter, here is how we would train a sequence classifier on one batch in PyTorch:

In other words, a simple example of <span style="color:blue">fine-tuning a pre-trained BERT model for sequence classification</span> using the Hugging Face Transformers library.

In [207]:
# 1. Import necessary libraries
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# 2. Load the pre-trained BERT model and tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# 3. Define input sequences
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]

# 4. Tokenize and create a batch of the input sequences
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# 5. Add target labels to our batch
# This is new
batch["labels"] = torch.tensor([1, 1])

# 6. Initialize the optimizer
optimizer = AdamW(model.parameters(), no_deprecation_warning=True)

# 7. Compute loss and perform backpropagation to compute gradients for each param
loss = model(**batch).loss
loss.backward()

# 8. Update model parameters (based on computed gradients)
optimizer.step()

<span style="color:red">Of course, just training the model on two sentences is not going to yield very good results</span>. <span style="color:green">To get better results, you will need to prepare a bigger dataset</span>.

In this section we will use as an example, the MRPC (Microsoft Research Paraphrase Corpus) dataset, introduced in a [paper](https://www.aclweb.org/anthology/I05-5002.pdf) by William B. Dolan and Chris Brockett. <i>The dataset consists of <b>5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)</b></i>. We’ve selected it for this chapter because it’s a small dataset, so it’s easy to experiment with training on it.

## Loading a dataset from the Hub

The Hub doesn’t just contain models; it also has multiple datasets in lots of different languages. You can browse the datasets [here](https://huggingface.co/datasets), and we recommend you try to load and process a new dataset once you have gone through this section (see the general documentation [here](https://huggingface.co/docs/datasets/loading)). But for now, let’s focus on the MRPC dataset! This is one of the 10 datasets composing the General Language Understanding Evaluation ([GLUE benchmark](https://gluebenchmark.com/), also checkout [GIANT](https://github.com/nyu-mll/jiant)), which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

The 🤗 Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this:

In [208]:
from datasets import load_dataset

# load the MRPC dataset from the GLUE benchmark
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Found cached dataset glue (/Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [209]:
type(raw_datasets), type(raw_datasets['train']), type(raw_datasets['train'][0])

(datasets.dataset_dict.DatasetDict, datasets.arrow_dataset.Dataset, dict)

As you can see, we get a `DatasetDict` object which contains the training set, the validation set, and the test set. Each of those contains several columns (sentence1, sentence2, label, and idx) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set) $\Rightarrow$ 3668+408+1725 = 5801 pairs of sentences in total.

This command downloads and caches the dataset, by default in <code>~/.cache/huggingface/datasets</code>. Recall from Chapter 2 that you can customize your cache folder by setting the HF_HOME environment variable.

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

In [210]:
raw_train_dataset = raw_datasets["train"]
print (len(raw_train_dataset))
raw_train_dataset[0]

3668


{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [211]:
raw_train_dataset[14]

{'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .',
 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .',
 'label': 0,
 'idx': 15}

In [212]:
raw_datasets["validation"][86]

{'sentence1': 'He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .',
 'sentence2': 'He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .',
 'label': 1,
 'idx': 796}

We can see the labels are already integers, so we won’t have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the `features` of our `raw_train_dataset`. This will tell us the type of each column:

In [213]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Behind the scenes, label is of type `ClassLabel`, and the mapping of integers to label name is stored in the `names` folder. `0` corresponds to `not_equivalent`, and `1` corresponds to `equivalent`.

## Preprocessing a dataset

<img src="images/text-classification-for-pair-of-sentences.png.png" style="width:850px;" title="Text classification for pair of sentences">

<img src="images/GLUE-benchmark-10-datasets.png" style="width:850px;" title="GLUE benchmark: 10 datasets">

<span style="color:blue"><b>Models like BERT are often pretrained with dual objective</b></span>
- <span style="color:blue">Objective 1: <b>Language modelling </b></span>
- <span style="color:blue">Objective 2: <b>Related to sentence pairs</b></span>
    
For instance, during pre-training, BERT is shown pair of sentences and must predict 
- both the values of randomly masked tokens and
- whether the 2nd sentence follows from the first

<b>Detailed explanation here:</b>
    
> BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model designed to capture deep contextualized representations of text. It is pre-trained on a large corpus of text using unsupervised learning, specifically with two main tasks:
> 
> 1. <b>Masked Language Modeling (MLM)</b>: In this task, BERT is trained to predict randomly masked words in a sentence. During training, a certain percentage of the input tokens are replaced with a special [MASK] token, and the model is tasked with predicting the original tokens based on the context provided by the surrounding unmasked tokens. This encourages the model to learn bidirectional contextual representations of words in a sentence.
>
> 2. <b>Next Sentence Prediction (NSP)</b>: In this task, BERT is trained to predict whether a given sentence follows another sentence. The model is provided with pairs of sentences and learns to classify whether the second sentence is a logical continuation of the first one or not. This helps the model to learn relationships between sentences and capture a broader understanding of the text.
>
> By pre-training BERT on these two tasks, the model learns a rich understanding of the language, including syntax, semantics, and context. Once pre-trained, BERT can be fine-tuned on a wide range of downstream tasks, such as text classification, named entity recognition, question-answering, and more, with relatively small amounts of labeled data. This process of fine-tuning allows BERT to leverage its pre-trained knowledge and adapt to the specific task at hand, often resulting in state-of-the-art performance.

<img src="images/BERT-can-recognize-relationships-between-two-sentences.png" style="width:850px;" title="BERT can recognize relationships between two sentences">

Tokenizers prepare the proper `token_type_ID`s and `attention` masks, that are needed for the model to understand.

<img src="images/tokenizer-for-pair-of-sentences.png" style="width:950px;" title="tokenizer for pair of sentences">

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the previous chapter, this is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

In [27]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

<span style="color:red">However, we can’t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not</span>. <span style="color:blue">We need to handle the two sequences as a pair, and apply the appropriate preprocessing. <span style="color:green">Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects</span>:

In [85]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
for key, value in inputs.items():
    print (f"{key}: {value}")

input_ids: [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102]
token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We discussed the `input_ids` and `attention_mask` keys in Chapter 2, but we put off talking about `token_type_ids`. In this example, this is what <span style="color:green">tells the model which part of the input is the first sentence and which is the second sentence.</span>

If we decode the IDs inside `input_ids` back to words and align it with `token_type_ids`, we can see the model expects the inputs to be of the form <span style="color:blue"><b>`[CLS]` sentence1 `[SEP]` sentence2 `[SEP]`</b></span> when there are two sentences.

In [87]:
df = pd.DataFrame([
    tokenizer.convert_ids_to_tokens(inputs["input_ids"]),
    inputs["token_type_ids"]
])
df.index = ["tokens", "token_type_ids"]

slice_ = [col for col in df.columns if (df.loc['token_type_ids', col]==1)]
df.style.set_properties(**{'background-color': '#ffffb3'}, subset=slice_)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
tokens,[CLS],this,is,the,first,sentence,.,[SEP],this,is,the,second,one,.,[SEP]
token_type_ids,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1


In [97]:
for key, value in raw_datasets['train'][14].items():
    print (f"{key:<10s}: {value}")

inputs = tokenizer(raw_datasets['train'][14]['sentence1'], raw_datasets['train'][14]['sentence2'])
    
df = pd.DataFrame([
    tokenizer.convert_ids_to_tokens(inputs["input_ids"]),
    inputs["token_type_ids"]
])
df.index = ["tokens", "token_type_ids"]

slice_ = [col for col in df.columns if (df.loc['token_type_ids', col]==1)]
df.style.set_properties(**{'background-color': '#ffffb3'}, subset=slice_)

sentence1 : Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .
sentence2 : The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .
label     : 0
idx       : 15


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52
tokens,[CLS],g,##yo,##rgy,he,##iz,##ler,",",head,of,the,local,disaster,unit,",",said,the,coach,was,carrying,38,passengers,.,[SEP],the,head,of,the,local,disaster,unit,",",g,##yo,##rgy,he,##iz,##ler,",",said,the,coach,driver,had,failed,to,hee,##d,red,stop,lights,.,[SEP]
token_type_ids,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


<span style="color:blue"><b>As you can see, the parts of the input corresponding to `[CLS]` `sentence1` `[SEP]` all have a token type ID of `0`, while the other parts, corresponding to `sentence2` `[SEP]`, all have a token type ID of `1`.</span>

Note that if you select a different checkpoint, you won’t necessarily have the `token_type_ids` in your tokenized inputs (for instance, they’re not returned if you use a DistilBERT model). They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

Here, BERT is pretrained with token type IDs, and on top of the masked language modeling objective we talked about in Chapter 1, it has an additional objective called <i>next sentence prediction</i>. The goal with this task is to model the relationship between pairs of sentences.

With next sentence prediction, the model is provided pairs of sentences (with randomly masked tokens) and asked to predict whether the second sentence follows the first. To make the task non-trivial, half of the time the sentences follow each other in the original document they were extracted from, and the other half of the time the two sentences come from two different documents.

<span style="color:blue">In general, you don’t need to worry about whether or not there are `token_type_ids` in your tokenized inputs: <b>as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model</b>.</span>

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our whole dataset: like in the previous chapter, we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options we saw in Chapter 2. So, one way to preprocess the training dataset is:

In [101]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)
type(tokenized_dataset) 

transformers.tokenization_utils_base.BatchEncoding

In [102]:
tokenized_dataset.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [105]:
type(tokenized_dataset['input_ids']), type(tokenized_dataset['input_ids'][0])

(list, list)

This works well, but <span style="color:red">it has the disadvantage of returning a dictionary (with our keys, `input_ids`, `attention_mask`, and `token_type_ids`, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization</span> (<span style="color:green">whereas the datasets from the 🤗 Datasets library are [Apache Arrow](https://arrow.apache.org/) files stored on the disk, so you only keep the samples you ask for loaded in memory ➡️ <b>Even if our dataset is huge, we won't get out of RAM error</b></span>).

<span style="color:green">To keep the data as a dataset, we will use the [`Dataset.map()`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization</span>. The `map()` method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

In [106]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

<span style="color:green">This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`</span>. Note that it also works if the example dictionary contains several samples (each key as a list of sentences) since the `tokenizer` works on lists of pairs of sentences, as seen before. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenization. The tokenizer is backed by a tokenizer written in Rust from the 🤗 Tokenizers library. This tokenizer can be very fast, but only if we give it lots of inputs at once.

<span style="color:blue">Note that we’ve left the `padding` argument out in our tokenization function for now</span>. <span style="color:red">This is because padding all the samples to the maximum length is not efficient</span>: <span style="color:green">it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!</span>

Here is how we apply the tokenization function on all our datasets at once. <span style="color:green">We’re using `batched=True` in our call to map so <b>the function is applied to multiple elements of our dataset at once (i.e., process examples in batches)</b>, and not on each element separately. This allows for faster preprocessing <b>by taking advantage of vectorization and parallelization</b>.</span>

The way the 🤗 Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the preprocessing function:

In [113]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [134]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, num_proc=None)
tokenized_datasets

Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-9e380720cf5195d0.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-ce2aa0cbccf1bc92.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-51d35a2247c9e603.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [115]:
type(raw_datasets), type(tokenized_datasets)

(datasets.dataset_dict.DatasetDict, datasets.dataset_dict.DatasetDict)

<span style="color:green">You can even use multiprocessing when applying your preprocessing function with `map()` by passing along a `num_proc` argument. We didn’t do this here because the 🤗 Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.</span>

Our `tokenize_function` returns a dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`, so those three fields are added to all splits of our dataset. Note that we could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied `map()`.

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as <i>dynamic padding</i>.

## Dynamic padding

<img src="images/dynamic-padding-1.png" style="width:650px;" title="Padding is needed">

**Approach 1: Static padding**

<img src="images/dynamic-padding-2.png" style="width:750px;" title="Padding - Approach 1">

<img src="images/dynamic-padding-4.png" style="width:750px;" title="Padding - Approach 1 example">

**Approach 2: Dynamic padding**
<img src="images/dynamic-padding-3.png" style="width:750px;" title="Padding - Approach 2: Dynamic padding">

<img src="images/dynamic-padding-5.png" style="width:750px;" title="Padding - Approach 2 example">

<img src="images/dynamic-padding-6.png" style="width:750px;" title="Padding - Approach 2 example cont">

The function that is responsible for putting together samples inside a batch is called a <i>collate function</i>. It’s an argument you can pass when you build a `DataLoader`, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). <span style="color:blue">This won’t be possible in our case since the inputs we have won’t all be of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit</span>, but note that if you’re training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding.

To do this in practice, we have to define a <span style="color:green">collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the 🤗 Transformers library provides us with such a function via `DataCollatorWithPadding`</span>. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [136]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, let’s grab a few samples from our training set that we would like to batch together. Here, we remove the columns `idx`, `sentence1`, and `sentence2` as they won’t be needed and contain strings (and we can’t create tensors with strings) and have a look at the <b>lengths of each entry (i.e. `input_ids`) in the batch</b>:

In [137]:
samples = tokenized_datasets["train"][:8] # Batch size of 8
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

<span style="color:blue">No surprise, we get samples of varying length, from 32 to 67. <b>Dynamic padding</b> means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch</span>. <span style="color:red"><b>Without dynamic padding</b>, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept</span>. Let’s double-check that our `data_collator` is dynamically padding the batch properly:

In [141]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

In [153]:
df = pd.DataFrame(batch['attention_mask'])
df.style.applymap(lambda x: 'background-color : yellow' if x==0 else '')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0
5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0
7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Looking good! Now that we’ve gone from raw text to batches our model can deal with, we’re ready to fine-tune it!

## Try it out!

> ✏️ Try it out! Replicate the preprocessing on the GLUE SST-2 dataset. It’s a little bit different since it’s composed of single sentences instead of pairs, but the rest of what we did should look the same. For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks.

**Step 1: Loading the dataset from Hub**

In [190]:
from datasets import load_dataset

# load the MRPC dataset from the GLUE benchmark
raw_datasets = load_dataset("glue", "sst2")
raw_datasets

Found cached dataset glue (/Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [191]:
raw_train_dataset = raw_datasets["train"]
print (len(raw_train_dataset))
raw_train_dataset[0]

67349


{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [192]:
# What does label numbers mean? {0: negative, 1: positive}
raw_train_dataset.features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None),
 'idx': Value(dtype='int32', id=None)}

**Step 2: Preprocessing a dataset**

In [193]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [194]:
inputs = tokenizer("This is the sample first sentence.")
for key, value in inputs.items():
    print (f"{key}: {value}")

input_ids: [101, 2023, 2003, 1996, 7099, 2034, 6251, 1012, 102]
token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]


In [195]:
# Approach 1a
tokenized_sentences = tokenizer(raw_datasets["train"]["sentence"])
print (type(tokenized_sentences) )
print (tokenized_sentences.keys())

<class 'transformers.tokenization_utils_base.BatchEncoding'>
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [196]:
# Approach 1b: Same as above approach, but with padding and truncation
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence"],
    padding=True,
    truncation=True,
)
print (type(tokenized_dataset) )
print (tokenized_sentences.keys())

<class 'transformers.tokenization_utils_base.BatchEncoding'>
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [197]:
# Approach 2: Slightly different approach: With map function <<< Finalised approach >>>
def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, num_proc=None)
print (type(tokenized_datasets) )
print (tokenized_datasets.keys())

Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-a5f38afb1960d405.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-7665367de28b2735.arrow


Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

<class 'datasets.dataset_dict.DatasetDict'>
dict_keys(['train', 'validation', 'test'])


In [198]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [202]:
# Data preprocessing
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, num_proc=None)
tokenized_datasets_clean = tokenized_datasets.remove_columns(['idx', 'sentence'])
tokenized_datasets_clean = tokenized_datasets_clean.rename_column("label", "labels")
tokenized_datasets_clean = tokenized_datasets_clean.with_format("torch")
tokenized_datasets_clean

Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f5d121c7346d2620.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-813f5b47193ce7f0.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-69db138cc57240a9.arrow


DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [203]:
tokenized_datasets_clean['train']

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 67349
})

In [205]:
# generate a short sample of dataset
small_train_dataset_clean = tokenized_datasets_clean["train"].select(range(100))
small_train_dataset_clean

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 100
})

**Step 3: Dynamic padding**

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [187]:
raw_train_dataset['sentence'][:8]

['hide new secretions from the parental units ',
 'contains no wit , only labored gags ',
 'that loves its characters and communicates something rather beautiful about human nature ',
 'remains utterly satisfied to remain the same throughout ',
 'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ',
 "that 's far too tragic to merit such superficial treatment ",
 'demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . ',
 'of saucy ']

In [180]:
samples = tokenized_datasets["train"][:8] # Batch size of 8
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence"]}
[len(x) for x in samples["input_ids"]]

[10, 11, 15, 10, 22, 13, 29, 6]

In [188]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 29]),
 'token_type_ids': torch.Size([8, 29]),
 'attention_mask': torch.Size([8, 29]),
 'labels': torch.Size([8])}

In [189]:
df = pd.DataFrame(batch['attention_mask'])
df.style.applymap(lambda x: 'background-color : yellow' if x==0 else '')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28
0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
5,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
7,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# temp

<img src="images/text-classification-for-pair-of-sentences.png.png" style="width:650px;" title="Padding">

Downloading and preparing dataset glue/sst2 to /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})