# Processing the data (PyTorch)

The explanation of this notebook is in the Hugging Face course, chapter 3, section 2: [Processing the data](https://huggingface.co/course/chapter3/2?fw=pt)

The original code of this notebook is in the Hugging Face's SageMaker repository: [section2_pt.ipynb](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter3/section2_pt.ipynb)

## Run conditions

This notebook has been tested in the following environment:
- Environment: Project created in [Paperspace Gradient](https://gradient.paperspace.com) with Python 3.9.13.
- Machine: P5000 (30GiB RAM 8 CPU 16GiB GPU) (more details on [Paperspace Machines](https://docs.paperspace.com/gradient/machines/)).
- IDE: Visual Studio Code using remote Jupyter server.

## Install dependencies

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# Install the libraries datasets v2.7.1, evaluate v0.3.0, and transformers v4.25.1 with quiet and upgrade flags.
%pip install -q datasets==2.7.1 evaluate==0.3.0 transformers==4.25.1 --upgrade

[0mNote: you may need to restart the kernel to use updated packages.


## Processing the data

We would train a sequence classifier on one batch in PyTorch.

In [2]:
# Import PyTorch.
import torch

# Import AdamW, AutoTokenizer and AutoModelForSequenceClassification from Transformers.
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Create a checkpoint from the model.
checkpoint = "bert-base-uncased"
# Create a tokenizer from the checkpoint.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Create a model from the checkpoint.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# Create a sequences from two sentences.
sequences = ["I've been waiting for a HuggingFace course my whole life.", "This course is amazing!"]
# Create a batch from the sequences.
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# Set to the key "labels" a tensor of [1, 1] (the two sequences are positive).
batch["labels"] = torch.tensor([1, 1])
# Create a AdamW optimizer.
optimizer = AdamW(model.parameters())
# Get the loss  from the model.
loss = model(**batch).loss
# Backpropagate the loss.
loss.backward()
# Update the weights.
optimizer.step()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [3]:
print(torch.tensor([1, 1]))

tensor([1, 1])


## Loading a dataset from the Hub

In [4]:
# Import load_dataset from Datasets.
from datasets import load_dataset

# Create raw_datasets from the glue dataset.
raw_datasets = load_dataset("glue", "mrpc")
# Print raw_datasets.
print(raw_datasets)

Found cached dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


In [5]:
# Create the raw train dataset.
raw_train_dataset = raw_datasets["train"]
# Print the first row of the raw train dataset.
print(raw_train_dataset[0])

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


In [6]:
# Show features of the raw train dataset.
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

## Preprocessing a dataset

In [7]:
# Import AutoTokenizer from Transformers.
from transformers import AutoTokenizer

# Create a checkpoint from the model.
checkpoint = "bert-base-uncased"
# Create a tokenizer from the checkpoint.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Create tokenized_sentence_1 from the first sentence of the raw train dataset.
tokenized_sentence_1 = tokenizer(raw_train_dataset[0]["sentence1"])
# Create tokenized_sentence_2 from the second sentence of the raw train dataset.
tokenized_sentence_2 = tokenizer(raw_train_dataset[0]["sentence2"])



In [8]:
# Create inputs from tokenizer with 2 sentences.
inputs = tokenizer("This is the first sentence.", "This is the second one.")
# Print inputs.
print(inputs)

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [9]:
# Convert inputs ids to tokens.
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

In [10]:
# Create tokenized_dataset from a tokenizer. This tokenizer will tokenize the two sentences of the raw train dataset: sentence1 and sentence2. The padding and truncation will be True.
tokenized_dataset = tokenizer(raw_train_dataset["sentence1"], raw_train_dataset["sentence2"], padding=True, truncation=True)
# Print the first row of the tokenized_datasets.
print(tokenized_dataset[0])



Encoding(num_tokens=103, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In [11]:
def tokenize_function(example):
    """
    This function will tokenize the two sentences of the raw train dataset: sentence1 and sentence2. The padding and truncation will be True.

    Returns
    -------
    tokenized_examples: dict
        The tokenized examples.
    """
    # Create tokenized examples from the two sentences.
    tokenized_examples = tokenizer(example["sentence1"], example["sentence2"], truncation=True)
    # Return the tokenized examples.
    return tokenized_examples

In [12]:
# Create tokenized_datasets from the raw dataset and the tokenize_function.
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
# Print the tokenized_datasets.
print(tokenized_datasets)

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-271f2697dfd05c20.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-372f1f19557f5a89.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})


## Dynamic padding

In [13]:
# Import DataCollatorWithPadding from Transformers.
from transformers import DataCollatorWithPadding

# Create a data collator.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [14]:
# Create samples from the eight train tokenized_datasets.
samples = tokenized_datasets["train"][:8]
# Populate samples with tensors removing the columns idx, sentence1 and sentence2.
samples = {key: value for key, value in samples.items() if key not in ["idx", "sentence1", "sentence2"]}
# Print the length of each entry of inputs_ids in the batch.
print([len(entry) for entry in samples["input_ids"]])


[50, 59, 47, 67, 59, 50, 62, 32]


In [15]:
# Create batch from the data collator with the samples.
batch = data_collator(samples)
# Create a dictionary with the keys and the values of the batch.
print({k: v.shape for k, v in batch.items()})

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]), 'token_type_ids': torch.Size([8, 67]), 'attention_mask': torch.Size([8, 67]), 'labels': torch.Size([8])}
