# Processing the data (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]



## Loading a dataset from the HUB

The datasets can be found [here](https://huggingface.co/datasets)

We focus in the **MRPC** dataset. One of the 10 datasets in the [GLUE benchmark](https://gluebenchmark.com/) which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

We download the datasets thanks to the 🤗 Datasets library

In [2]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

We get a `DatasetDict` object with: 
- Training set
- Validation set
- Test set

Each of those contain the following columns: 
- sentence1
- sentence2
- label
- idx

And a **variable number** of rows

The dataset is downloaded and cached , by default in `~/.cache/huggingface/datasets`. Use `HF_HOME` if you want to change the location

We can access each pair of sentences in our `raw_datasets` object by indexing:

In [4]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

To know which integer corresponds to which label, we can inspect the **features** of our `raw_train_dataset` by accessing `raw_train_dataset.features`:

In [5]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

### Exercise

#### Look at element 15 of the training set and element 87 of the validation set. What are their labels?

In [6]:
from datasets import load_dataset

raw_dataset = load_dataset("glue","mrpc")
raw_training_dataset = raw_dataset["train"]
print(f"Raw Training Dataset element 15: {raw_training_dataset[14]}")
print(f"Raw Training Dataset element 15 labels: {raw_training_dataset.features}")
raw_validation_dataset = raw_dataset["validation"]
print(f"Raw Validation Dataset element 87: {raw_validation_dataset[86]}")
print(f"Raw Validation Dataset element 87 labels: {raw_validation_dataset.features}")

Raw Training Dataset element 15: {'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .', 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .', 'label': 0, 'idx': 15}
Raw Training Dataset element 15 labels: {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None)}
Raw Validation Dataset element 87: {'sentence1': 'He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .', 'sentence2': 'He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .', 'label': 1, 'idx': 796}
Raw Validation Dataset element 87 labels: {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': Clas

## Preprocessing the Dataset

To preprocess the dataset: 
- Use Tokenizer to convert text into numbers

In [7]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

We need to pass the sentences as **pairs**

In [8]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

If you load another **checkpoint** you will not necessarily see the `token_type_ids`.

They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

`BERT` is pretrained with token type IDs, it has an objective called *next sentence prediction*. The goal with this task is to model the relationship between pairs of sentences.

We can add padding and truncation options for pretraining.

It works well but it has a disadvantage:
- Returns a dictionary.

This means that it will **only work** if you have enough RAM to store **the whole dataset during tokenization**. 

In [10]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

To keep the data as a **dataset**, we will use the `.map()` function. 

It works by applying a function on each element of the dataset. 

In [11]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

The previous function, takes a dictionary (like the one in our dataset) and returns the resulting dictionary from calling the `tokenizer()` function.

The `batched` option will greatly speed up the tokenization (if we give a lots of inputs).

**Important**: we are leaving the `padding` option: 
- Padding all the samples to the maximum length is not efficient.

Is much better: 
- Pad the samples when we are building a batch. We only pad within that batch. 

In [12]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

You can use multiprocessing in `map()` by passing a `num_proc` argument.

### Exercise

Take element 15 of the training set and tokenize the two sentences separately and as a pair. What’s the difference between the two results?

In [13]:
from datasets import load_dataset
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_dataset = load_dataset("glue","mrpc")
print(raw_dataset)
print("\n")
raw_training_dataset = raw_dataset["train"]
print(f"Training Dataset: {raw_training_dataset[14]}\n")
tokenized_sentence_1 = tokenizer(raw_training_dataset[14]["sentence1"])
print(f"Tokenized sentence 1: {tokenized_sentence_1}\n")
tokenized_sentence_2 = tokenizer(raw_training_dataset[14]["sentence2"])
print(f"Tokenized sentence 2: {tokenized_sentence_2}\n")
tokenized_sentences_pair = tokenizer(raw_training_dataset[14]["sentence1"], raw_training_dataset[14]["sentence2"])
print(f"Tokenized sentence 1 & 2: {tokenized_sentences_pair}\n")

tokenizer.convert_ids_to_tokens(tokenized_sentences_pair["input_ids"])

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


Training Dataset: {'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .', 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .', 'label': 0, 'idx': 15}

Tokenized sentence 1: {'input_ids': [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229, 5467, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

['[CLS]',
 'g',
 '##yo',
 '##rgy',
 'he',
 '##iz',
 '##ler',
 ',',
 'head',
 'of',
 'the',
 'local',
 'disaster',
 'unit',
 ',',
 'said',
 'the',
 'coach',
 'was',
 'carrying',
 '38',
 'passengers',
 '.',
 '[SEP]',
 'the',
 'head',
 'of',
 'the',
 'local',
 'disaster',
 'unit',
 ',',
 'g',
 '##yo',
 '##rgy',
 'he',
 '##iz',
 '##ler',
 ',',
 'said',
 'the',
 'coach',
 'driver',
 'had',
 'failed',
 'to',
 'hee',
 '##d',
 'red',
 'stop',
 'lights',
 '.',
 '[SEP]']

The difference is in the: 
- token_type_id

From `[CLS]` to first `[SEP]` `token_type_id = 0`

Until the last `[SEP]` `token_type_id = 1`

## Dynamic Padding

The function responsible for putting together samples inside a batch:
- `collate function`

You can pass it as an **argument** when you build a `DataLoader`.

If you use the default: 
- Will convert your samples to PyTorch tensors and concatenate them. (Not possible since all inputs won't have same size!)

We postpone as much as we can the **padding** as we want to apply it as necesssary on **each bach**. This is what we call dynamic padding.

This will speed up training in: 
- CPUs
- GPUs

Avoid doing it in: 
- TPUs (TPUs prefer fixed shapes, even when that requires extra padding)

### How to do it in practice

We have to define a `collate` function that will apply the correct amount of padding we want to batch.

The 🤗 Transformers library provides us with such a function via DataCollatorWithPadding.

It takes a `tokenizer` when you instantiate it. This allows the collate function to: 
- Know which padding token to use
- Know if the model needs left or right padding


In [14]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We test by taking a few samples.

With this code we check the lengths of each entry in the batch:

In [15]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

With dynamic padding all the samples within the batch will be padded to length 67, the maximum in the batch.

Without dynamic padding, all the samples would have to be padded to the maximum length in the whole dataset.

In [16]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

### Exercise

Replicate the preprocessing on the GLUE SST-2 dataset. 
It’s a little bit different since it’s composed of single sentences instead of pairs, but the rest of what we did should look the same. 

For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks.

When we refer to **GLUE task** we talk about the different tasks and datasets within the **GLUE (General Language Understanding Evaluation) benchmark**

We have the following tasks in the GLUE benchmark:

- **SST-2 (Stanford Sentiment Treebank)**: Binary sentiment classification task where the goal is to predict whether a sentence expresses a positive or negative sentiment.
- **CoLA (Corpus of Linguistic Acceptability)**: A binary classification task to determine if a sentence is grammatically correct or not.
- **MRPC (Microsoft Research Paraphrase Corpus)**: Sentence pair classification task for determining if two sentences are paraphrases of each other.
- **QQP (Quora Question Pairs)**: Another paraphrase identification task that involves determining if two questions are semantically equivalent.
- **STS-B (Semantic Textual Similarity Benchmark)**: A regression task focusing on predicting the similarity score between sentence pairs.
- **MNLI (Multi-Genre Natural Language Inference)**: In this task, models are evaluated on their ability to perform three-way classification for sentence pairs: entailment, contradiction, or neutral.
- **QNLI (Question Natural Language Inference)**: Similar to MNLI but adapted from the Stanford Question Answering Dataset (SQuAD), where the goal is to determine if a sentence answers a given question.
- **RTE (Recognizing Textual Entailment)**: Determining whether one sentence entails another or not.
- **WNLI (Winograd Schema Challenge)**: A coreference resolution task focusing on resolving pronouns in a sentence.

In [17]:
def filter_dataset(dataset, keys_to_keep): 
    return {key:value for key, value in dataset.items() if key in keys_to_keep}

In [18]:
def list_splitter(list, size): 
    for i in range(0, len(list), size):
        yield list[i: i + size]

In [19]:
def tokenize_function(row, column_names):
    return tokenizer(row[column_names[0]], row[column_names[1]], truncation=True)

In [22]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from datasets import load_dataset
glue_datasets =  ['ax', 'cola', 'mnli', 'mnli_matched', 'mnli_mismatched', 'mrpc', 'qnli', 'qqp', 'rte', 'sst2', 'stsb', 'wnli']
glue_dataset = input('Enter GLUE dataset\n')

try: 
    if glue_dataset in glue_datasets:
        # Step 1: Load dataset
        raw_dataset = load_dataset('glue', glue_dataset)
        # Step 2: Tokenize dataset
        checkpoint = 'bert-base-uncased'
        tokenizer = AutoTokenizer.from_pretrained(checkpoint)
        raw_tokenized_test_dataset = raw_dataset.map(
            lambda row: tokenize_function(row, raw_dataset.column_names['test']), 
            batched = True)
        print(raw_tokenized_test_dataset)
        print('\n')
        # Step 3: Chunking the dataset, filtering content and collating
        batch_size = 8
        keys_to_keep = ['input_ids', 'token_type_ids', 'attention_mask', 'label']
        data_collator = DataCollatorWithPadding(tokenizer = tokenizer)
        for list in list_splitter(raw_tokenized_test_dataset['test'], batch_size):
            # filtering
            raw_training_filtered_tokenized_dataset = filter_dataset(list, keys_to_keep)
            # dynamic padding 
            batch = data_collator(raw_training_filtered_tokenized_dataset)
            print("Dynamic Batch: \n")
            print({k: v.shape for k, v in batch.items()})
            print('\n')
            
        
except ValueError:
    print('Not a valid dataset\n')
    print(f'Available: {glue_datasets}')
    


Enter GLUE dataset
 ax


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1104
    })
})


Dynamic Batch: 

{'input_ids': torch.Size([8, 78]), 'token_type_ids': torch.Size([8, 78]), 'attention_mask': torch.Size([8, 78]), 'labels': torch.Size([8])}


Dynamic Batch: 

{'input_ids': torch.Size([8, 86]), 'token_type_ids': torch.Size([8, 86]), 'attention_mask': torch.Size([8, 86]), 'labels': torch.Size([8])}


Dynamic Batch: 

{'input_ids': torch.Size([8, 45]), 'token_type_ids': torch.Size([8, 45]), 'attention_mask': torch.Size([8, 45]), 'labels': torch.Size([8])}


Dynamic Batch: 

{'input_ids': torch.Size([8, 69]), 'token_type_ids': torch.Size([8, 69]), 'attention_mask': torch.Size([8, 69]), 'labels': torch.Size([8])}


Dynamic Batch: 

{'input_ids': torch.Size([8, 70]), 'token_type_ids': torch.Size([8, 70]), 'attention_mask': torch.Size([8, 70]), 'labels': torch.Size([8])}


Dynamic Batch: 

{'input_i