<a href="https://colab.research.google.com/github/heber-augusto/udacity-generative-ai-nanodegree/blob/main/lightweight-fine-tuning-foundation-model/gpt2_finetuning_using_lora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lightweight Fine-Tuning Project

* PEFT technique: [Lora](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)
* Model: [GPT-2](https://huggingface.co/transformers/v3.0.2/model_doc/gpt2.html)
* Evaluation approach: Accuracy
* Fine-tuning dataset: [pii-masking](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)

---

### Introduction

In this notebook, I aim to fine-tune the GPT-2 model to identify Personally Identifiable Information (PII) in text data, specifically focusing on phone numbers and social security numbers (SOCIALNUMBER). I utilize the [pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k) dataset for this task. The goal is to classify text spans as either containing PII or not containing PII.

### Dataset Preparation

1. **Dataset Loading**: I load the `pii-masking-300k` dataset and split it into training, validation, and test sets.
2. **Labeling**: Each text span is labeled as 1 (contains PII) or 0 (does not contain PII) based on the presence of keywords such as 'SOCIALNUMBER' and 'TEL'.
3. **Column Cleaning**: Irrelevant columns are removed to focus on the necessary text and label columns.
4. **Tokenization**: The text data is tokenized using the GPT-2 tokenizer, ensuring compatibility with the model.

### Model and Technique

- **Model**: I use the GPT-2 model for sequence classification.
- **Fine-Tuning Technique**: I employ the Parameter-Efficient Fine-Tuning (PEFT) technique called LoRA (Low-Rank Adaptation), which fine-tunes only a subset of the model parameters, making the process more efficient.

### Training Process

1. **Optimizer and Scheduler**: I use the AdamW optimizer with a linear learning rate scheduler that includes a warmup phase.
2. **Training Loop**: The model is trained over multiple epochs. In each epoch:
    - The model is set to training mode and trained on the training dataset.
    - The model is evaluated on the test dataset, and the accuracy is computed.
    - If the current model's accuracy is better than the previous best accuracy, the model is saved as the best model.

### Evaluation

- **Initial Accuracy Check**: Before fine-tuning, the model's accuracy is evaluated on the validation dataset.
- **Continuous Evaluation**: During the training loop, the model's accuracy on the test dataset is monitored.
- **Best Model Selection**: The model achieving the highest accuracy on the test dataset during training is selected as the best model.
- **Final Evaluation**: The best model's performance is evaluated on the validation dataset to ensure its effectiveness.

### Summary

This notebook demonstrates the process of fine-tuning GPT-2 to identify PII in text data using the LoRA technique. The effectiveness of the model is evaluated based on accuracy, with continuous monitoring and comparison against a baseline accuracy before fine-tuning.

The initial accuracy was 0.328 and the final accuracy, after the training process was *0.858*

---

## Libraries Instalation

In [1]:
!pip install -q peft transformers datasets evaluate

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m245.8/251.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [2]:
import argparse
import os

import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from peft import (
    get_peft_config,
    get_peft_model,
    get_peft_model_state_dict,
    set_peft_model_state_dict,
    LoraConfig,
    PeftType,
    PrefixTuningConfig,
    PromptEncoderConfig,
)

import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup
from tqdm import tqdm

## Define configuration parameters

In [3]:
batch_size = 16
model_name_or_path = "gpt2"
task = "mrpc"
peft_type = PeftType.LORA
device = "cuda"
num_epochs = 20
lr = 3e-4
padding_side = "left"

## Configure PEFT using Lora

In [4]:
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1)


## Load tokenizer


In [5]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    padding_side=padding_side)
# Set the pad token if not already set
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Load GPT2 model and apply Lora

In [6]:
# Load the pre-trained GPT-2 model for sequence classification
gpt2_model = AutoModelForSequenceClassification.from_pretrained(
    model_name_or_path,
    return_dict=True)

# Apply PEFT configuration to the model
model = get_peft_model(gpt2_model, peft_config)
model.print_trainable_parameters()

# Set the pad token in the model configuration
model.config.pad_token_id = model.config.eos_token_id

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,448 || all params: 124,737,792 || trainable%: 0.2377




# Load PII Dataset

In [7]:
from datasets import load_dataset, DatasetDict, load_from_disk

# Function to add a 'label' column to the dataset indicating if there is PII (Personal Identifiable Information)
def add_is_pii(example):
    pii_mask = ['SOCIALNUMBER', 'TEL']
    example["label"] = 1 if any(word in example['span_labels'] for word in pii_mask)else 0
    return example

# Attempt to load a saved dataset; if it doesn't exist, load and process the original dataset

try:
    dataset = load_from_disk("saved_dataset")
    print("Loaded saved Dataset")
except:
    columns_to_remove = ['target_text', 'privacy_mask', 'span_labels', 'mbert_text_tokens', 'mbert_bio_labels', 'id', 'language', 'set']
    columns_renaming_dict = {'source_text':'text'}

    dataset_train = load_dataset("ai4privacy/pii-masking-300k", split='train[:1000]')
    dataset_train = dataset_train.map(add_is_pii)
    dataset_train = dataset_train.remove_columns(columns_to_remove)
    dataset_train = dataset_train.rename_columns(columns_renaming_dict)

    dataset_test  = load_dataset("ai4privacy/pii-masking-300k", split='train[-1000:-500]')
    dataset_test = dataset_test.map(add_is_pii)
    dataset_test = dataset_test.remove_columns(columns_to_remove)
    dataset_test = dataset_test.rename_columns(columns_renaming_dict)

    dataset_val   = load_dataset("ai4privacy/pii-masking-300k", split='train[-500:]')
    dataset_val = dataset_val.map(add_is_pii)
    dataset_val = dataset_val.remove_columns(columns_to_remove)
    dataset_val = dataset_val.rename_columns(columns_renaming_dict)

    datasets = DatasetDict(
        { 'train': dataset_train,
          'test': dataset_test,
          'validation': dataset_val}
    )

    datasets.save_to_disk("saved_dataset")
    print("Splitted and Saved Dataset")
    columns_to_remove_after_tokenizer = ["text",]

Downloading readme:   0%|          | 0.00/15.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/103M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/114M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/108M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/104M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/102M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/177677 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/47728 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/500 [00:00<?, ? examples/s]

Splitted and Saved Dataset


## Apply tokenizer

In [8]:
def tokenize_function(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["text"], truncation=True, max_length=None)
    return outputs

# Apply the tokenization function to the dataset
tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=columns_to_remove_after_tokenizer,
)

# Rename the 'label' column to 'labels' which is the expected name for labels by the models of the
# transformers library
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

## Load Accuracy metric

In [9]:
metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

## Create DataLoaders for training, testing, and validation

In [10]:
# Collation function to standardize batch sizes
def collate_fn(examples):
    return tokenizer.pad(
        examples,
        padding="longest",
        return_tensors="pt")


# Instantiate dataloaders.
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=collate_fn,
    batch_size=batch_size)

test_dataloader = DataLoader(
    tokenized_datasets["test"],
    shuffle=False,
    collate_fn=collate_fn,
    batch_size=batch_size
)

val_dataloader = DataLoader(
    tokenized_datasets["validation"],
    shuffle=False,
    collate_fn=collate_fn,
    batch_size=batch_size
)

## Accuracy of the model, using foundation model and validation dataset

In [12]:
from evaluate import evaluator

task_evaluator = evaluator("text-classification")

results_val = task_evaluator.compute(
    model_or_pipeline=gpt2_model,
    tokenizer=tokenizer,
    data=datasets['validation'],
    input_column='text',
    label_column='label',
    metric="accuracy",
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
    strategy="simple",
    random_state=0
)

results_val

{'accuracy': 0.328,
 'total_time_in_seconds': 16.97992352099999,
 'samples_per_second': 29.446540167370188,
 'latency_in_seconds': 0.03395984704199998}

## Accuracy of the model, using foundation model and test dataset

In [13]:
results_test = task_evaluator.compute(
    model_or_pipeline=gpt2_model,
    tokenizer=tokenizer,
    data=datasets['test'],
    input_column='text',
    label_column='label',
    metric="accuracy",
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
    strategy="simple",
    random_state=0
)

results_test

{'accuracy': 0.332,
 'total_time_in_seconds': 7.879885030000082,
 'samples_per_second': 63.4527024311159,
 'latency_in_seconds': 0.015759770060000164}

## Save initial model and accuracy to use inside training loop

In [14]:
best_model_accuracy = results_test['accuracy']
best_model = gpt2_model

## Optimizer and Learning Rate Scheduler Setup

In this section, we initialize the optimizer and the learning rate scheduler for training the model. We use the AdamW optimizer, which is known for its efficiency in training deep learning models by adapting the learning rate for each parameter and incorporating weight decay to prevent overfitting.

The learning rate scheduler is configured to gradually increase the learning rate during the initial warmup phase (6% of the total training steps) and then decrease it linearly for the remainder of the training. This helps in stabilizing the training process and improving the model's performance.

- **Optimizer**: `AdamW` with a specified learning rate and model parameters.
- **Scheduler**: Linear schedule with a warmup period, ensuring smooth adjustments to the learning rate during training.


In [15]:
optimizer = AdamW(
    params=model.parameters(),
    lr=lr)

# Instantiate scheduler
lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs),
    num_training_steps=(len(train_dataloader) * num_epochs),
)

## Make train loop recording best model using test dataset accuracy as parameter

This code performs the training and evaluation of the model in a loop over multiple epochs. During each epoch, the model is trained on the training dataset and evaluated on the test dataset. The learning rate is adjusted dynamically using a scheduler. After each epoch, the model's performance is evaluated, and if the current model performs better than the previous best model (based on accuracy), the best model is updated. This process helps in finding and saving the best version of the model during training.

In [16]:
model.to(device)
for epoch in range(num_epochs):
    model.train()
    for step, batch in enumerate(tqdm(train_dataloader)):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    for step, batch in enumerate(tqdm(test_dataloader)):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        predictions, references = predictions, batch["labels"]
        metric.add_batch(
            predictions=predictions,
            references=references,
        )

    eval_metric = metric.compute()
    print(f"epoch {epoch}:", eval_metric)
    if eval_metric['accuracy'] > best_model_accuracy:
        print('updating best model')
        best_model_accuracy = eval_metric['accuracy']
        best_model = model



100%|██████████| 63/63 [00:41<00:00,  1.52it/s]
100%|██████████| 32/32 [00:09<00:00,  3.37it/s]


epoch 0: {'accuracy': 0.582}
updating best model


100%|██████████| 63/63 [00:39<00:00,  1.58it/s]
100%|██████████| 32/32 [00:09<00:00,  3.41it/s]


epoch 1: {'accuracy': 0.676}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]
100%|██████████| 32/32 [00:09<00:00,  3.36it/s]


epoch 2: {'accuracy': 0.672}


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.40it/s]


epoch 3: {'accuracy': 0.586}


100%|██████████| 63/63 [00:40<00:00,  1.55it/s]
100%|██████████| 32/32 [00:09<00:00,  3.34it/s]


epoch 4: {'accuracy': 0.694}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]
100%|██████████| 32/32 [00:09<00:00,  3.41it/s]


epoch 5: {'accuracy': 0.718}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.36it/s]


epoch 6: {'accuracy': 0.728}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]
100%|██████████| 32/32 [00:09<00:00,  3.39it/s]


epoch 7: {'accuracy': 0.752}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.55it/s]
100%|██████████| 32/32 [00:09<00:00,  3.42it/s]


epoch 8: {'accuracy': 0.768}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.37it/s]


epoch 9: {'accuracy': 0.716}


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.39it/s]


epoch 10: {'accuracy': 0.828}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.54it/s]
100%|██████████| 32/32 [00:09<00:00,  3.37it/s]


epoch 11: {'accuracy': 0.842}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]
100%|██████████| 32/32 [00:09<00:00,  3.39it/s]


epoch 12: {'accuracy': 0.834}


100%|██████████| 63/63 [00:41<00:00,  1.54it/s]
100%|██████████| 32/32 [00:09<00:00,  3.33it/s]


epoch 13: {'accuracy': 0.834}


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.34it/s]


epoch 14: {'accuracy': 0.766}


100%|██████████| 63/63 [00:40<00:00,  1.54it/s]
100%|██████████| 32/32 [00:09<00:00,  3.40it/s]


epoch 15: {'accuracy': 0.824}


100%|██████████| 63/63 [00:39<00:00,  1.58it/s]
100%|██████████| 32/32 [00:09<00:00,  3.41it/s]


epoch 16: {'accuracy': 0.816}


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.37it/s]


epoch 17: {'accuracy': 0.85}
updating best model


100%|██████████| 63/63 [00:40<00:00,  1.56it/s]
100%|██████████| 32/32 [00:09<00:00,  3.29it/s]


epoch 18: {'accuracy': 0.848}


100%|██████████| 63/63 [00:40<00:00,  1.57it/s]
100%|██████████| 32/32 [00:09<00:00,  3.42it/s]

epoch 19: {'accuracy': 0.844}





## Check Accuracy from the best model at the validation dataset

In [17]:
predictions_list = []
references_list  = []

for step, batch in enumerate(tqdm(val_dataloader)):
    batch.to(device)
    with torch.no_grad():
        outputs = best_model(**batch)
    predictions = outputs.logits.argmax(dim=-1)

    predictions, references = predictions, batch["labels"]
    metric.add_batch(
        predictions=predictions,
        references=references,
    )
    predictions_list += predictions.tolist()
    references_list += references.tolist()

eval_metric = metric.compute()
print(f"best model result:", eval_metric)

100%|██████████| 32/32 [00:08<00:00,  3.77it/s]

best model result: {'accuracy': 0.858}





## Show Classification Report and Confusion Matrix

In [18]:
# Import the necessary libraries
from sklearn.metrics import classification_report, confusion_matrix

# Calculate the classification_report
report = classification_report(
    references_list,
    predictions_list,
    target_names = ['without pii', 'with pii']
    )

# Print the classification_report
print(report)

              precision    recall  f1-score   support

 without pii       0.96      0.83      0.89       350
    with pii       0.70      0.91      0.79       150

    accuracy                           0.86       500
   macro avg       0.83      0.87      0.84       500
weighted avg       0.88      0.86      0.86       500



In [19]:
# Calculate the classification_report
report = confusion_matrix(
    references_list,
    predictions_list
    )

# Print the classification_report
print(report)

[[292  58]
 [ 13 137]]
