# Lightweight Fine-Tuning Project

Description of Choices

* PEFT technique: LoRA (Low-Rank Adaptation) is the selected PEFT techniques since it is compatible with all models
* Model: GPT-2 is the selected model since it is relatively small and compatible with sequence classification and LoRA
* Evaluation approach: Evaluation will be conducted using Hugging Face's `Trainer` class which simplifies the training and evaluation workflow. accuracy and F1-score are computed using `compute_metrics` function which allaows for a comparison of the performance of the original model against the fine-tunned model
* Fine-tuning dataset: The IMDb dataset from the Hugging Face `datasets` library will be used for fine-tunning. This dataset is a standard benchmark for binary sentiment classification tasks making it a good fit in this context.

## Loading and Evaluating Foundation Model

Load GPT-2 pre-trained Hugging Face model and evaluate its performance prior to fine-tuning

The following steps are taken:
- Load the GPT-2 model and tokenizer.
- Add a padding token to the tokenizer for compatibility.
- Preprocess the IMDb dataset for sequence classification.
- Evaluate the baseline performance using accuracy and F1-score.

In [117]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, GPT2Tokenizer, GPT2ForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from peft import get_peft_model, LoraConfig
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

In [118]:
# Load the tokenizer and model
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1})

# Add padding token to the tokenizer if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.eos_token_id

# Load the IMDb dataset
dataset = load_dataset("imdb")
train_dataset = dataset["train"].shuffle(seed=42).select(range(1000)) 
test_dataset = dataset["test"].shuffle(seed=42).select(range(500))


# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Define the compute_metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="weighted")
    return {"accuracy": acc, "f1": f1}

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=10,
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# Evaluate the model
baseline_metrics = trainer.evaluate()
print("Baseline Performance:", baseline_metrics)

#### Key Observation ####
The metrics suggest that while the model has decent initialization and speed, its performance (accuracy and F1 score) can be improved

## Performing Parameter-Efficient Fine-Tuning

create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In this section:
- A LoRA configuration is created with specified parameters for adaptation.
- The PEFT model is trained and fine-tuned parameters are saved for later use.

In [119]:
# Define LoRA configuration
lora_config = LoraConfig(
    task_type="SEQ_CLS", ## SEQ_CLS for sequence classification tasks
    inference_mode=False, 
    lora_dropout=0.1
)

# Create PEFT model
peft_model = get_peft_model(model, lora_config)

# Update the Trainer to use the PEFT model
trainer.model = peft_model

# Train the model
trainer.train()

# Save the PEFT model weights
peft_model.save_pretrained("gpt-lora")

#### Key Observations ####
- The model showed strong improvement in performance with accuracy increasing from 51.2% to 83.6%
- The are signs of overfitting in epoch 3 with training loss dropped to 0.0001
- Accuracy and F1 scores stabilized by Epoch 3

## Performing Inference with a PEFT Model

Load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Compare the  results to the results from prior to fine-tuning.

In [120]:
from peft import PeftModelForSequenceClassification

# Reload the PEFT model
fine_tuned_model = PeftModelForSequenceClassification.from_pretrained(model,"gpt-lora")
fine_tuned_model.eval()

# Create trainer for evaluation
fine_tuned_trainer = Trainer(
    model=fine_tuned_model,
    args=training_args,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# Evaluate fine-tuned model
fine_tuned_metrics = fine_tuned_trainer.evaluate()

# Print comparison of metrics
print("\nPerformance Comparison:")
print("-" * 50)
print("Metric       | Baseline | Fine-tuned")
print("-" * 50)
print(f"Accuracy     | {baseline_metrics['eval_accuracy']:.4f}   | {fine_tuned_metrics['eval_accuracy']:.4f}")
print(f"F1 Score     | {baseline_metrics['eval_f1']:.4f}   | {fine_tuned_metrics['eval_f1']:.4f}")
print(f"Loss         | {baseline_metrics['eval_loss']:.4f}   | {fine_tuned_metrics['eval_loss']:.4f}")
print("-" * 50)


#### Key Observations ####
- The PEFT model significantly outperforms the Base Model in accuracy and F1 score, indicating better generalization and effectiveness.
- The PEFT model slightly reduces the evaluation loss compared to the Base Model.

In [123]:
# Demonstrate inference on sample texts
sample_texts = [
    "The movie was absolutely wonderful! A masterpiece.",
    "Terrible movie. I would not recommend it to anyone.",
    "It was just okay, nothing too special.",
]
inputs = tokenizer(sample_texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Move model and inputs to device
device = "cuda" if torch.cuda.is_available() else "cpu"
fine_tuned_model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Perform inference
outputs = fine_tuned_model(**inputs)
predictions = np.argmax(outputs.logits.detach().numpy(), axis=-1)

# Make a dataframe with the sample texts, predictions, and predicted labels
import pandas as pd
df = pd.DataFrame({
    "text": sample_texts,
    "prediction": predictions,
    "predicted_label": [model.config.id2label[p] for p in predictions]
}) 

# Show all the cells in the dataframe
pd.set_option("display.max_colwidth", None)
print(df)

### Key Observations ###

- The model accurately classified one review as positive (1), which was a highly favorable comment about the movie.
- It correctly identified a negative review as negative (0), showcasing its ability to discern critical feedback.
- A neutral or mixed review was also classified as negative (0), which might indicate a lack of nuance in distinguishing between neutral and negative sentiments.

Overall, the model performed well in identifying clear positive and negative sentiments but might need refinement to handle more nuanced or neutral statements effectively.