# Processing Data

## Loading a dataset from the Hub

We will use the MPRC (Microsoft Research Paraphrase Corpus) dataset as an example
- Contains 5801 pairs of sentences with a label indicating if they are paraphrases or not
- Has a GLUE (General Language Understanding Evaluation) benchmark - academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks

Datasets from the hub: https://huggingface.co/datasets

More on loading datasets: https://huggingface.co/docs/datasets/loading

In [1]:
# The Hugging face datasets library - provides an API that allows us to easily download and cache a dataset on the Hub
from datasets import load_dataset
# Download the MRPC Dataset
raw_datasets = load_dataset("glue", "mrpc") #using the identifier
raw_datasets

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

We get a ```DatasetDict``` object: A dictionary containing each split of our dataset - the training set, validation set and test set

In [2]:
"""
We can access each split by indexing the name, which is an instance of the Dataset class

Each of the splits contains several columns and a variable numer of rows (num_rows)

We can access each pair of our sentences/element by its index
"""

raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

Hugging Face Datasets Library: Everything is saved to disk - even if dataset is huge, you won't run out of ram. Only the elements you request are loaded in memory

In [5]:
""" 
You can access a slice of your dataset
"""

raw_train_dataset[:5]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at 

In [6]:
""" 
The features attribute of a Dataset gives us more info about its cols 
"""

raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

## Preprocessing a dataset

### Convert the text to numbers that the model can make sense of i.e. Tokenize

In [7]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenize all the first sentences of the paraphrase pair 
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
# Tokenize all the second sentences of the paraphrase pair 
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



We need to handle the 2 sequences as a pair and apply the appropriate preprocessing, so we need to change the code above

Test how tokenizer deals with one pair of sentences

Take a look at ```token_type_ids```: In this example, it tells the model which part of the input is the first sentence and which is the second sentence

```
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]
```

In [8]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
"""
Decode the ids in input_ids just to see
"""
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

Note: In general, you don’t need to worry about whether or not there are token_type_ids in your tokenized inputs: as long as you use the same checkpoint for the tokenizer and the model, everything will be fine as the tokenizer knows what to provide to its model.

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our whole dataset

We can feed the tokenizer the pairs of sentences by giving it the list of first sentences, then the list of second sentences

In [10]:
"""
Try 1
Disadvantage: 
- Returns a dictionary
- Only works if we have enough RAM to store the whole dataset during tokenization
  as compared to datasets type from the HuggingFace Library that are Apache Arrow files stored on the disk,
so we only keep the samples that are asked for loaded in memory
"""

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

In [12]:
"""
Try 2: Use the Dataset.map() method
The map() method works by applying a function on each element of the dataset
Can be applied to all the splits in our datasets with the map method
As long as the function returns a dictionary-like object (see above), the map method
will add new columns as needed 
The HuggingFace Datasets library applies his processing by adding new fields to the datasets, 
one new field for each key in the dictionary returned by the preprocessing function

The tokenizer is backed by Rust

We can preprocess faster (function is applied to multiple elements at the same time, not on each element separately) by using the option batched=True. The applied function will then
receive multiple examples at each call

Leave the padding argument out for now: padding all the samples to the maximum length is 
not efficient: it’s better to pad the samples when we’re building a batch, 
as then we only need to pad to the maximum length in that batch, and not the 
maximum length in the entire dataset

Let's define a function that tokenizes our inputs
"""

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

You can even use multiprocessing when applying your preprocessing function with map() by passing along a num_proc argument. 

We didn’t do this here because the HuggingFace Tokenizers library already uses multiple threads to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing

Recap: Our tokenize_function returns a dictionary with the keys input_ids, attention_mask, and token_type_ids, so those three fields are added to all splits of our dataset. We could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied map().

### Dynamic Padding

The function that is responsible for putting together samples inside a batch is called a ```collate function``` 

It is an argument we can pass when we build a ```DataLoder```. The default is a function that will convert our samples to PyTorch tensors and concatenate them

This is not possible in our case since our inputs won't all be of the same size because we have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding

This will speed up training by quite a bit, but note that if you’re training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the HuggingFace Transformers library provides us with such a function via ```DataCollatorWithPadding```
- Takes a tokenizer when you instantiate it in order to know which padding tokens to use, whether the model expects padding to be on the left or right of the inputs
- Will do everything we need

In [13]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

"""
Testing: Grab a few samples from the training set that we want to batch together
Remove columns idx, sentence1, sentence2 as they are not needed and contain strings (we can't create tensors with strings)
"""
# Pick up a few samples
samples = tokenized_datasets["train"][:8]
# Remove the columns
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
# Check out the lengths
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

No surprise, we get samples of varying length, from 32 to 67.

```Dynamic Padding``` means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. 

Let’s double-check that our data_collator is dynamically padding the batch properly:

In [14]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}