<a href="https://colab.research.google.com/github/hyesunyun/huggingface-lab/blob/main/huggingface_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning HuggingFace 🤗 Transformer Models

*Adapted from https://huggingface.co/docs/transformers/en/training, https://huggingface.co/docs/transformers/en/tasks/sequence_classification, and https://huggingface.co/docs/transformers/en/tasks/summarization#load-billsum-dataset*

In this lab, we will go over how to fine-tune a model on text classification (sentiment analysis) and summarization tasks.

Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique.

We will also go over how we can use HuggingFace's [Trainer](https://huggingface.co/docs/transformers/en/trainer) to complete training and evaluation loop for PyTorch models. You only need to pass it the necessary pieces for training (model, tokenizer, dataset, evaluation function, training hyperparameters, etc.), and the Trainer class takes care of the rest. This makes it easier to start training faster without manually writing your own training loop. But at the same time, Trainer is very customizable and offers a ton of training options so you can tailor it to your exact training needs.

Make sure you have the runtime to GPU. You can pick T4 GPU.

Run the following cell to verify your GPU setup. You should see information about the GPU available for your session.

In [None]:
! nvidia-smi

### Install Packages

This is only needed for Google Colab users.

In [None]:
# Transformers installation
! pip install transformers[torch] datasets evaluate accelerate rouge_score
# Install dependencies
! pip install torch

If you would like to upload your fine-tuned model to the HuggingFace Hub. Then, I encourage you to create an account and login.

In [None]:
## Uncomment and run the code below if you would like to login to HuggingFace
# from huggingface_hub import notebook_login

## When prompted, enter your authemtication token to login
## You can generate a secret auth token on HuggingFace account
# notebook_login()

## Text Classification (Sentiment Analysis)

In this section, we will finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDb dataset](https://huggingface.co/datasets/stanfordnlp/imdb) to determine whether a movie review is positive or negative.

### Load IMDB dataset

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")

print(f"Splits of IMDB dataset: {imdb.keys()}")

print(f"Number of training samples: {len(imdb['train'])}")
print(f"Number of testing samples: {len(imdb['test'])}")

print(f"Example data instance: {imdb['test'][0]}")

There are two fields in this dataset:

*   `text`: the movie review text.
*   `label`: a value that is either 0 for a negative review or 1 for a positive review.

In addition to loading datasets that are on HuggingFace's Hub, you can also load local/remote custom data files using `load_dataset` from `datasets` library.

For example:

```python
from datasets import load_dataset

# for CSV file
# pass your CSV files as a list if you are several CSV files (ex: train.csv, val.csv, test.csv)
dataset = load_dataset("csv", data_files="my_file.csv")

# for JSON file
dataset = load_dataset("json", data_files="my_file.json")

# for remote JSON file via HTTP
base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
dataset = load_dataset("json", data_files={"train": base_url + "train-v1.1.json", "validation": base_url + "dev-v1.1.json"}, field="data")
```

For more details, please refer to this documentation: https://huggingface.co/docs/datasets/v3.2.0/loading#load

In [None]:
# We will just use a subset of the dataset so we don't have to wait too long for fine-tuning to finish.
from datasets import DatasetDict, Dataset
import random

# get the first 2500 rows for train and 500 for test
train_indices = random.sample(range(len(imdb['train'])), 2500)
test_indices = random.sample(range(len(imdb['test'])), 500)
imdb_train = imdb['train'].select(train_indices)
imdb_test = imdb['test'].select(test_indices)

# create a new dataset in DatasetDict object
imdb = DatasetDict({
    'train': Dataset.from_list(imdb_train),
    'test':  Dataset.from_list(imdb_test)
})

print(f"Number of training samples: {len(imdb['train'])}")
print(f"Number of testing samples: {len(imdb['test'])}")

### Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the text field:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use HuggingFace's Datasets map function. You can speed up map by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

**Note**: Data Collator is used to form a batch by using a list of dataset elements as input.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the accuracy metric (see the Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")
# you could also load "precision" or "f1" as well for this task

Then create a function that passes your predictions and labels to `compute` to calculate the accuracy:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your compute_metrics function is ready to go now, and you'll return to it when you setup your training.

### Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

You're ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

At this point, only three steps remain:


1.   Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the Trainer will evaluate the accuracy and save the training checkpoint. **If you want to push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).**
2.   Pass the training arguments to Trainer along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [None]:
output_dir = "./imdb_distilbert_model/"

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none", # you can use wandb, tensorboard, etc to log your train process
    push_to_hub=False, # set to True if you want to push this model to the Hub
)

# Trainer applies dynamic padding by default when you pass tokenizer to it.
# In this case, you don’t need to specify a data collator explicitly.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

## Once training is completed, upload your model to the Hub with the push_to_hub() method
# trainer.push_to_hub()

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

In [None]:
trainer.evaluate()

In [None]:
checkpoint_number = 314

### Inference

Now that you've finetuned a model, you can use it for reference!



In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu" # checks if gpu is available
pipeline_device = 0 if device == "cuda" else -1 # for determining if we want to load model in GPU or CPU

In [None]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

## Easiest way to do inference is to use pipeline()
from transformers import pipeline

##### ADD YOUR CODE BELOW #####
# load the fine-tuned model from locally saved model weights
# instantiate sentiment-analysis pipeline
# do inference on the above text


In [None]:
# Using AutoTokenizer and AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

checkpoint_dir = f"/content/imdb_distilbert_model/checkpoint-{checkpoint_number}"
tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint_dir).to(device)

# Tokenize the text and return PyTorch tensors
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
  # Pass your inputs to the model and return the logits
  logits = model(**inputs).logits

# Get the class with the highest probability, and use the model's id2label mapping to convert it to a text label
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

## Summarization

Summarization creates a shorter version of a document or an article that captures all the important information. Summarization can be:

*   Extractive: extract the most relevant information from a document.
*   Abstractive: generate new text that captures the most relevant information.

This section will show you how to:

1. Finetune [T5](https://huggingface.co/google-t5/t5-small) on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset for abstractive summarization.
2. Use your finetuned model for inference.

### Load BillSum dataset

Start by loading the smaller California state bill subset of the BillSum dataset from the 🤗 Datasets library:

In [None]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

print(f"Number of samples: {len(billsum)}")

Split the dataset into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [None]:
billsum = billsum.train_test_split(test_size=0.2)

print(f"Number of train samples: {len(billsum['train'])}")
print(f"Number of test samples: {len(billsum['test'])}")

In [None]:
# take a look at an example
billsum["train"][0]

There are two fields that you'll want to use:

* `text`: the text of the bill which'll be the input to the model.
* `summary`: a condensed version of text which'll be the model target.

### Preprocess

The next step is to load a T5 tokenizer to process text and summary:

In [None]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The preprocessing function you want to create needs to:

1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
2. Use the keyword `text_target` argument when tokenizing labels.
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [None]:
prefix = "summarize: "

def preprocess_function(examples):
    """
    Preprocess function for the summarization dataset.

    Args:
        examples: dataset examples

    Returns:
        model_inputs: dictionary of model inputs
    """
    ##### ADD YOUR CODE BELOW #####

    # prefix the inputs (examples["text"]) with a prompt so T5 knows this is a summarization task.
    # tokenize the inputs using max_length=1024 and truncate
    # tokenize the target (examples["summary"]) to max_length=128 and truncate

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

To apply the preprocessing function over the entire dataset, use Datasets `map` method. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
##### ADD YOUR CODE BELOW #####

Now create a batch of examples using [DataCollatorForSeq2Seq](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/data_collator#transformers.DataCollatorForSeq2Seq). It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

### Evaluate

Including a metric during training is often helpful for evaluating your model's performance. For this task, load the [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) metric.

In [None]:
import evaluate

rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to compute to calculate the ROUGE metric:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # During training, some tokens (typically padding tokens) in the labels might be set to -100.
    # The purpose of this line is to replace those -100 values with the actual padding token ID used by the tokenizer.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # stem the text before computing ROUGE
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # calculate the length of each generated summary (padding tokens are not counted)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    # calculate the average length of the generated summaries
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

### Train

Load T5 with AutoModelForSeq2SeqLM:

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

##### ADD CODE HERE #####

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the Trainer will evaluate the ROUGE metric and save the training checkpoint.
2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call `train()` to finetune your model.

In [None]:
output_dir = "./billsum_t5_model/"

##### ADD CODE HERE #####

# add the missing arguments for Seq2SeqTrainingArguments
# follow the example from sentiment analysis to help you
training_args = Seq2SeqTrainingArguments(
    # ADD more arguments here
    save_total_limit=2,
    logging_steps=50,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True, #change to bf16=True for XPU
    report_to="none", # you can use wandb, tensorboard, etc to log your train process
    push_to_hub=False, # set to True if you want to push this model to the Hub
)

# load Seq2SeqTrainer with the correct arguments
# follow example from sentiment analysis
# then call train()
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# OPTIONAL
# trainer.push_to_hub()

In [None]:
trainer.evaluate()

The best model should be saved in `/content/billsum_t5_model/checkpoint-186`

### Inference

Come up with some text you’d like to summarize. For T5, you need to prefix your input depending on the task you’re working on.

In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [None]:
##### ADD CODE HERE #####
# either use pipeline() or AutoModelForSeq2SeqLM to do inference
# make sure you load the correct checkpoint directory

## Appendix

### Trainer Basic Tutorial

https://huggingface.co/docs/transformers/en/training#train-with-pytorch-trainer

### Parameter-Efficient Fine-Tuning

Even fine-tuning LLMs can become resource intensive. Hence, some researchers have come up with parameter-efficient fine-tuning (PEFT) methods.

LoRa (low-rank adaptation) is one of the most well-known and popular PEFT method. It's a popular method for fine-tuning models in a parameter-efficient way, by only training a few adapters, keeping the existing model untouched. LoRa is available in the [PEFT library](https://huggingface.co/docs/peft/index) by Hugging Face, which also supports various other PEFT methods.

QLoRa is another PEFT method that is even more efficient. This method applies LoRa to a quantized model (like 8-bit or 4-bit model).

If "full fine-tuning" exceeds the GPU RAM, then these training methods can help.

### More Examples

For more notebookes with examples on fine-tuning, please refer to the following resources:

*   https://huggingface.co/docs/transformers/notebooks#pytorch-nlp
*   https://github.com/NielsRogge/Transformers-Tutorials

### Hyperparameter Search

`Trainer` supports four hyperparameter search backends currently: optuna, sigopt, raytune and wandb.

https://huggingface.co/docs/transformers/en/hpo_train