[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W5T_Transformers_Finetune.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install transformers datasets evaluate peft sentencepiece accelerate

# Finetuning and Inference with 🤗 Transformers

*Adapted from the [🤗 Transformers documentation](https://huggingface.co/transformers/training.html)*

In this notebook we will see how to use 🤗 Transformers to finetune a model on a downstream task and how to use it for inference. Starting your training procedure from a pre-trained model reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. This procedure is known as fine-tuning.

## Preprocess

Let's start by loading the Yelp Reviews dataset:

In [1]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

dataset["train"][100]

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset yelp_review_full (/home/gsarti/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|██████████| 2/2 [00:00<00:00, 159.06it/s]


{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets map method to apply a preprocessing function over the entire dataset:

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")


def tokenize_function(examples):

    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /home/gsarti/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-44f5e8aeb7e65f9a.arrow
100%|██████████| 50/50 [00:09<00:00,  5.14ba/s]


If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:

In [3]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))

small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at /home/gsarti/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-16b27b83f415f401.arrow


## Train

🤗 Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision. Start by loading your model and specify the number of expected labels. From the Yelp Review dataset card, you know there are five labels.

>You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don’t worry, this is completely normal! The pretrained head of ALBERT is discarded, and replaced with a randomly initialized classification head. You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("albert-base-v2", num_labels=5)

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.dense.bias', 'predictions.bias', 'predictions.LayerNorm.weight', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.decoder.weight']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You sho

Next, create a [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class which contains all the hyperparameters you can tune as well as flags for activating different training options. For this tutorial you can start with the default training hyperparameters, but feel free to experiment with these to find your optimal settings.

Specify where to save the checkpoints from your training:

In [5]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

## Evaluate

Trainer does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics. The 🤗 Evaluate library provides a simple accuracy function you can load with the`evaluate.load` (see this [quicktour](https://huggingface.co/docs/evaluate/a_quick_tour) for more information) function:

In [6]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

Call `compute` on the metric to calculate the accuracy of your predictions. Before passing your predictions to compute, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits):

In [7]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

If you’d like to monitor your evaluation metrics during fine-tuning, specify the `evaluation_strategy` parameter in your training arguments to report the evaluation metric at the end of each epoch. The [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class is highly customizable with all common training parameters (e.g. learning rate, batch size, etc.) and flags for activating different training options (e.g. mixed precision, gradient accumulation). See the link for more information. 

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", per_device_train_batch_size=2)

Create a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) object with your model, training arguments, training and test datasets, and evaluation function:

In [9]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Then fine-tune your model by calling `train()`:

In [10]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 1500
  Number of trainable parameters = 11687429
 33%|███▎      | 500/1500 [01:45<03:33,  4.69it/s]Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `AlbertForSequenceClassifica

{'loss': 1.6644, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


                                                  
 33%|███▎      | 500/1500 [02:21<03:33,  4.69it/s]

{'eval_loss': 1.6014658212661743, 'eval_accuracy': 0.218, 'eval_runtime': 36.1239, 'eval_samples_per_second': 27.682, 'eval_steps_per_second': 3.46, 'epoch': 1.0}


 67%|██████▋   | 1000/1500 [04:08<01:48,  4.62it/s] Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'loss': 1.6188, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}


                                                   
 67%|██████▋   | 1000/1500 [04:47<01:48,  4.62it/s]

{'eval_loss': 1.6127325296401978, 'eval_accuracy': 0.214, 'eval_runtime': 38.8268, 'eval_samples_per_second': 25.755, 'eval_steps_per_second': 3.219, 'epoch': 2.0}


100%|██████████| 1500/1500 [06:44<00:00,  4.21it/s]  Saving model checkpoint to test_trainer/checkpoint-1500
Configuration saved in test_trainer/checkpoint-1500/config.json
Model weights saved in test_trainer/checkpoint-1500/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `AlbertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `AlbertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'loss': 1.6162, 'learning_rate': 0.0, 'epoch': 3.0}


                                                   
100%|██████████| 1500/1500 [07:27<00:00,  4.21it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 1500/1500 [07:27<00:00,  3.35it/s]

{'eval_loss': 1.608689308166504, 'eval_accuracy': 0.205, 'eval_runtime': 42.9809, 'eval_samples_per_second': 23.266, 'eval_steps_per_second': 2.908, 'epoch': 3.0}
{'train_runtime': 447.7088, 'train_samples_per_second': 6.701, 'train_steps_per_second': 3.35, 'train_loss': 1.6331433919270832, 'epoch': 3.0}





TrainOutput(global_step=1500, training_loss=1.6331433919270832, metrics={'train_runtime': 447.7088, 'train_samples_per_second': 6.701, 'train_steps_per_second': 3.35, 'train_loss': 1.6331433919270832, 'epoch': 3.0})

Now we can load the model and tokenizer from the checkpoint directory and use it to predict the sentiment of a new review:

In [10]:
model = AutoModelForSequenceClassification.from_pretrained("test_trainer/checkpoint-1500", num_labels=5)
# We didn't save tokenizer in the folder, so we need to reload it
tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")

In [11]:
import torch

text = "Don't waste your time. We had two different people come to our house to give us estimates for a deck (one of them the OWNER). Both times, we never heard from them. Not a call, not the estimate, nothing."

tokens = tokenizer(text, padding=True, return_tensors="pt")
output = model(**tokens)
probs = torch.nn.functional.softmax(output.logits, dim=-1).tolist()
print(probs)

[[0.19721245765686035, 0.19987523555755615, 0.19545863568782806, 0.20727425813674927, 0.20017942786216736]]


## Training in native PyTorch

Trainer takes care of the training loop and allows you to fine-tune a model in a single line of code. For users who prefer to write their own training loop, you can also fine-tune a 🤗 Transformers model in native PyTorch.

In [12]:
from transformers import AutoTokenizer
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset yelp_review_full (/home/gsarti/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|██████████| 2/2 [00:00<00:00, 177.29it/s]
Loading cached processed dataset at /home/gsarti/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-12b7931c9b1e5501.arrow
100%|██████████| 50/50 [00:09<00:00,  5.25ba/s]


In [13]:
# Remove the text column because the model does not accept raw text as an input:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])

# Rename the label column to labels because the model expects the argument to be named labels:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# Set the format of the dataset to return PyTorch tensors instead of lists:
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at /home/gsarti/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-e100a11ae18cbc88.arrow


In [14]:
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from transformers import AutoModelForSequenceClassification

# Specify device to use a GPU if you have access to one. Otherwise, training on a CPU may take several hours instead of a couple of minutes.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("albert-base-v2", num_labels=5).to(device)

# Create a DataLoader for your training and test datasets so you can iterate over batches of data
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=2)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
# Create the default learning rate scheduler from Trainer
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.dense.weight', 'predictions.LayerNorm.bias', 'predictions.LayerNorm.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.decoder.bias', 'predictions.decoder.weight']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.weight', 'classifier.bias']
You sho

To keep track of your training progress, use the [tqdm](https://tqdm.github.io/) library to add a progress bar over the number of training steps:

In [15]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

100%|██████████| 1500/1500 [05:25<00:00,  4.31it/s]

Just like how you added an evaluation function to Trainer, you need to do the same when you write your own training loop. But instead of calculating and reporting the metric at the end of each epoch, this time you’ll accumulate all the batches with add_batch and calculate the metric at the very end.

In [16]:
import evaluate

metric = evaluate.load("accuracy")
model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()

{'accuracy': 0.248}

The poor performance of the model (random baseline: 20%) is likely due to *\<take a guess\>*? 🙂

The [examples/pytorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch) folder in the main 🤗 Transformers repository contains many examples of how to use 🤗 Transformers with Trainer and PyTorch. You can also check out the [PyTorch Lightning integration](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#pytorch-lightning-integration) for more information.

## Advanced Topics: Parameter-efficient Training with 🤗 PEFT

*Adapted from the [🤗 PEFT blog post](https://huggingface.co/blog/peft)*

In this section we will briefly have a look at the [PEFT](https://github.com/huggingface/peft) (Parameter-Efficient Fine-Tuning) library released in early 2023 for reducing the computational burden of fine-tuning large language models. The two step pre-training and fine-tuning procedure already helps in reducing the amount of training needed to reach good performances, with fine-tuning on downstream datasets resulting in huge performance gains when compared to using the pretrained LLMs out-of-the-box (zero-shot inference, for example).

However, as models get larger and larger, full fine-tuning becomes infeasible to train on consumer hardware. In addition, storing and deploying fine-tuned models independently for each downstream task becomes very expensive, because fine-tuned models are the same size as the original pretrained model. Parameter-Efficient Fine-tuning (PEFT) approaches are meant to address both problems!

PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of [catastrophic forgetting](https://arxiv.org/abs/1312.6211), a behaviour observed during the full fine-tuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. 

We will explore efficient NLP approaches like LoRA in the final lecture of this course, in the meantime we will try out the PEFT library for fine-tuning a sequence-to-sequence multilingual model.

In [None]:
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model, LoraConfig, TaskType

model_name_or_path = "bigscience/mt0-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

After this ad-hoc loading procedure, the fine-tuning process follows the regular 🤗 Transformers approach. You can see how the number of trainable parameters in the loaded network now correspond only to a small fraction of the total (2M instead of 1.2B). Refer to [this file](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq.ipynb) for a full example.

Once the model is trained, you can proceed with saving as usual. Thanks to PEFT, only the small amount of fine-tuned parameters will be saved, since the rest can be reloaded from the pretrained model. This is a huge advantage when deploying models in production or working in low-resource settings, since you can save a lot of storage space and bandwidth.

In [18]:
model.save_pretrained("output_dir") 

Finally, here's an example of using the saved model for inference:

In [21]:
from transformers import AutoModelForSeq2SeqLM
from peft import PeftModel, PeftConfig

# For example, https://hf.co/smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM
peft_model_id = "output_dir"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

model = model.to(device)
model.eval()
inputs = tokenizer("Tweet text : @HondaCustSvc Your customer service has been horrible during the recall process. I will never purchase a Honda again. Label :", return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])

# Output: complaint

See the [PEFT readme](https://github.com/huggingface/peft) for more information on supported efficient methods and models.

## Advanced Topics: Quantized Inference with `bitsandbytes`

*Adapted from the [bitsandbytes blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)*

The `bitsandbytes` library is a new 🤗 Transformers integration that allows you to perform quantized inference on your models. Quantized inference is a technique that reduces the precision of the model weights and activations to reduce the memory footprint and computational cost of inference, which is especially useful for very large models.

While quantization usually implied a degradation in model performance, the `bitsandbytes` library allows you to quantize your models with next to no loss in accuracy. This is achieved by preserving so-called "outlier features" (i.e. large activations in the model) in full precision, while quantizing the majority of the model weights and activations to 8-bit precision. We'll briefly cover this in the last lecture, but in the meantime you can find more details in the blog post above or the [LLM.int8 paper](https://arxiv.org/abs/2208.07339).

<img src="https://huggingface.co/blog/assets/96_hf_bitsandbytes_integration/Mixed-int8.gif">

In practice, you can simply pass `load_in_int8=True` to the `from_pretrained` method to load the model in quantized form:

In [20]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# We need sharded weights otherwise we get CPU OOM errors
model_id=f"ybelkada/t5-3b-sharded"

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Needs at least 12GB of GPU memory
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
model_8bit.get_memory_footprint()

For t5-3b the int8 model is about ~2.9GB, whereas the original model has 11GB. For t5-11b the int8 model is about ~11GB vs 42GB for the original model. We can use `generate` normally:

In [None]:
max_new_tokens = 30

input_ids = tokenizer(
    "translate English to German: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face", return_tensors="pt"
).input_ids  

outputs = model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))