# Preprocess the data

Loading datasets from Hub, the data is about similarity in quesitons found in forums like quora. The aim is to classify similar questions using BERT model.

To preprocess the dataset, we need to convert the text to numbers the model can make sense of.

In [None]:
# !pip install tokenizer
# !pip install datasets
# !pip install accelerate
# !pip install evaluate

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("glue","mrpc")
print(raw_datasets)

raw_train_dataset = raw_datasets["train"]
print("Sample data:\n",raw_train_dataset[0],"\n")

print("Features:\n", raw_train_dataset.features,"\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
Sample data:
 {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0} 

Features:
 {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None)} 



How ever we can't pass 2 sentences to a model , we need to pair them up and for that we can pass it as sequence. Thankfully tokenizer supports this.

We can see here that `input_id`s for [CLS] and [SEP] tokens are 101 and 102 respectively , it would differ in another model but the idea is same. These special tokens are assigned special values.

In [None]:
from transformers  import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentence_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentence_1 = tokenizer(raw_datasets["train"]["sentence2"])

inputs = tokenizer(["This is the first sentence", "This is the sencond one"])
print("Encoded Inputs:\n", inputs)
decoded = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Decoded Inputs:\n", decoded)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Encoded Inputs:
 {'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2023, 2003, 1996, 12411, 8663, 2094, 2028, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Decoded Inputs:
 ['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '[SEP]']


The DatasetDict is memory efficient format using Apache Arrow. We use `dataset.map()` The map() method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs.

Since the tokenizer works on lists of pairs of sentences, as seen before. This will allow us to use the option batched=True in our call to map(), which will greatly speed up the tokenization

In [None]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

To put data into batches we use the function called `collate function`. Its an argument we can pass when building a DataLoader. The default being a function that will just convert your samples to PyTorch tensors and concatenate them.

We have not done padding yet as we have variable length inputs that vary a lot and it would over-pad the whole data. We can apply padding via batch.
To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together via the `DataCollatorWithPadding`.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
print([len(x) for x in samples["input_ids"]],"\n")

batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

[50, 59, 47, 67, 59, 50, 62, 32] 



{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

## Trainer API

In [None]:
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification

training_args  = TrainingArguments("test-trainer")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Training and Evaluation**

Note that when you pass the `tokenize`r as we did here, the default `data_collato`r used by the `Trainer` will be a `DataCollatorWithPadding` as defined previously, so you can skip the line `data_collator=data_collator` in this call.

We can also load the metrics associated with our model using the `evaluate` library from HF.

In [None]:
from transformers import Trainer
import evaluate

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

# DataLoaders

Before training , we need to setup few things.**First** one beign the `DataLoaders` for iterating over, but before that we need to do few preprocessing on the `tokenized_dataset`.
- Remove columns corresponding to the values that the model does not expects like `sentence1 & sentence2`.
- Rename `label` to `labels` because BERT accepts only the name `labels`.
- Set the format of data to return PyTorch Tensors.


In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

# see the shape of batches
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 57]),
 'token_type_ids': torch.Size([8, 57]),
 'attention_mask': torch.Size([8, 57])}

## Training Loop

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AdamW

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor(1.4436, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


The lr scheduler  used by default is just a linear decay function from maximum value 5e-5 to 0.

To properly define it we need to use the scheduler to set the number of epochs , number of steps ( which is num_epoch x the num of batches in the dataloader ). The Trainer uses 3 epochs by default.

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer = optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

print(num_training_steps)

1377


In [None]:
# Training Loop
import torch
from tqdm.auto import tqdm
import evaluate

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
print(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    batch = {k:v.to(device) for k,v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    progress_bar.update(1)

metrics = evaluate.load("glue","mrpc")
model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

cpu


  0%|          | 0/1377 [00:00<?, ?it/s]