# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### Rotten Tomatoes Sentiment Analysis with GPT-2 and PEFT

In [None]:
!pip install datasets=="3.2.0" transformers[torch] scikit-learn

In [1]:
# 1. Load and Prepare Dataset

from datasets import load_dataset, concatenate_datasets
import pandas as pd

# Load Rotten Tomatoes dataset
dataset = load_dataset("rotten_tomatoes")

In [2]:
# 2. Class Balance Analysis
def check_class_balance(dataset_split, split_name):
    """Analyze label distribution in a specific dataset split"""
    labels = dataset_split["label"]
    class_counts = {
        0: labels.count(0),
        1: labels.count(1)
    }
    total = len(labels)
    
    print(f"\n=== Class Distribution - {split_name} ===")
    print(f"Negative samples: {class_counts[0]} ({class_counts[0]/total:.2%})")
    print(f"Positive samples: {class_counts[1]} ({class_counts[1]/total:.2%})")

In [3]:
# Check balance for all splits
check_class_balance(dataset["train"], "Training Set")
check_class_balance(dataset["validation"], "Validation Set")
check_class_balance(dataset["test"], "Test Set")


=== Class Distribution - Training Set ===
Negative samples: 4265 (50.00%)
Positive samples: 4265 (50.00%)

=== Class Distribution - Validation Set ===
Negative samples: 533 (50.00%)
Positive samples: 533 (50.00%)

=== Class Distribution - Test Set ===
Negative samples: 533 (50.00%)
Positive samples: 533 (50.00%)


In [4]:
# 3. Initialize Model and Tokenizer
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [5]:
# Load pre-trained model with sequence classification head
model = AutoModelForSequenceClassification.from_pretrained(
    'gpt2',
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
).to(device).eval()  # Keep in evaluation mode for baseline assessment

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# 4. Dataset Preprocessing and Tokenization
def tokenize_function(examples):
    """Batch processing function for tokenization"""
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt"
    )

# Tokenize all splits
tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [7]:
# Prepare datasets splits
train_dataset = tokenized_datasets["train"]
validation_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]
eval_dataset = concatenate_datasets([validation_dataset, test_dataset])

In [8]:
# Verify post-tokenization balance
print("\nAfter Tokenization:")
check_class_balance(train_dataset, "Training Set")
check_class_balance(eval_dataset, "Combined Evaluation Set")


After Tokenization:

=== Class Distribution - Training Set ===
Negative samples: 4265 (50.00%)
Positive samples: 4265 (50.00%)

=== Class Distribution - Combined Evaluation Set ===
Negative samples: 1066 (50.00%)
Positive samples: 1066 (50.00%)


In [9]:
# 5. Baseline Model Evaluation
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

def compute_metrics(eval_pred):
    """Calculate comprehensive classification metrics"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    return {
        "accuracy": (predictions == labels).mean(),
        "f1_macro": f1_score(labels, predictions, average='macro'),
        "precision_macro": precision_score(labels, predictions, average='macro'),
        "recall_macro": recall_score(labels, predictions, average='macro')
    }

In [10]:
from transformers import DataCollatorWithPadding

# Create data collator with correct tokenizer reference
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="longest",
    max_length=128,
    pad_to_multiple_of=8
)


In [11]:
# Configure evaluation trainer
eval_trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./baseline",
        per_device_eval_batch_size=16,
        disable_tqdm=True
    ),
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [12]:
# Evaluate baseline performance
print("\nEvaluating Baseline Model...")
baseline_metrics = eval_trainer.evaluate(eval_dataset)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Evaluating Baseline Model...
{'eval_loss': 0.8969319462776184, 'eval_accuracy': 0.49906191369606, 'eval_f1_macro': 0.34577175037352026, 'eval_precision_macro': 0.4850557954354287, 'eval_recall_macro': 0.49906191369606, 'eval_runtime': 18.2571, 'eval_samples_per_second': 116.777, 'eval_steps_per_second': 7.34}


In [13]:
import pandas as pd
import random

def evaluate_and_show_samples(trainer, dataset, quantity=10, random_seed=42):
    """Avalia um dataset, cria um DataFrame e exibe amostra aleatória."""
    
    # Realiza as previsões
    results = trainer.predict(dataset)

    # Cria o DataFrame com textos, previsões e rótulos reais
    df = pd.DataFrame({
        "text": dataset["text"],  # Alternativa sem loop
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    })

    # Define exibição completa do texto no Pandas
    pd.set_option("display.max_colwidth", None)

    # Retorna uma amostra aleatória do DataFrame
    return df.sample(n=quantity, random_state=random_seed).reset_index(drop=True)

In [14]:
print("\nResults from model before PAFT:")
df_before = evaluate_and_show_samples(eval_trainer, eval_dataset, quantity=10)
df_before


Results from model before PAFT:


Unnamed: 0,text,predictions,labels
0,"cool gadgets and creatures keep this fresh . not as good as the original , but what is . . .",1,1
1,an awful movie that will only satisfy the most emotionally malleable of filmgoers .,1,0
2,. . . you can be forgiven for realizing that you've spent the past 20 minutes looking at your watch and waiting for frida to just die already .,1,0
3,"though uniformly well acted , especially by young ballesta and galan ( a first-time actor ) , writer/director achero manas's film is schematic and obvious .",1,0
4,absolutely ( and unintentionally ) terrifying .,1,0
5,"shanghai ghetto , much stranger than any fiction , brings this unknown slice of history affectingly to life .",1,1
6,"while hoffman's performance is great , the subject matter goes nowhere .",1,0
7,"works because we're never sure if ohlinger's on the level or merely a dying , delusional man trying to get into the history books before he croaks .",1,1
8,"as a science fiction movie , "" minority report "" astounds .",1,1
9,"what one is left with , even after the most awful acts are committed , is an overwhelming sadness that feels as if it has made its way into your very bloodstream .",1,1


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [15]:
# 6. PEFT Configuration and Training
from peft import LoraConfig, get_peft_model

# Freeze base model parameters
for param in model.parameters():
    param.requires_grad = False

# Configure LoRA adapter
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    r=8,
    lora_alpha=64,
    lora_dropout=0.2,
    target_modules=['c_attn', 'c_proj'],
    bias="none"
)

In [16]:
# Create PEFT model
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()



trainable params: 814,080 || all params: 125,253,888 || trainable%: 0.6499438963523432


In [17]:
# Configure training arguments
training_args = TrainingArguments(
    output_dir="./peft_results",
    learning_rate=3e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    num_train_epochs=1,
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

In [18]:
# Initialize PEFT trainer
peft_trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

In [19]:
# Execute PEFT training
print("\nStarting PEFT Training...")
peft_trainer.train()


Starting PEFT Training...


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Precision Macro,Recall Macro
1,0.3673,0.329269,0.856004,0.855987,0.85617,0.856004


Checkpoint destination directory ./peft_results/checkpoint-534 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=534, training_loss=0.46670656213153167, metrics={'train_runtime': 193.8923, 'train_samples_per_second': 43.994, 'train_steps_per_second': 2.754, 'total_flos': 562538328883200.0, 'train_loss': 0.46670656213153167, 'epoch': 1.0})

In [20]:
# 7. Final Evaluation and Comparison
print("\nEvaluating PEFT Model...")
peft_metrics = peft_trainer.evaluate()


Evaluating PEFT Model...


In [21]:
# Performance comparison table
print("\n=== Performance Comparison ===")
print(f"{'Metric':<20} | {'Baseline':<10} | {'PEFT':<10}")
print("-" * 45)
for key in baseline_metrics:
    if key in peft_metrics:
        base_val = f"{baseline_metrics[key]:.4f}" if isinstance(baseline_metrics[key], float) else str(baseline_metrics[key])
        peft_val = f"{peft_metrics[key]:.4f}" if isinstance(peft_metrics[key], float) else str(peft_metrics[key])
        print(f"{key.upper():<20} | {base_val:<10} | {peft_val:<10}")


=== Performance Comparison ===
Metric               | Baseline   | PEFT      
---------------------------------------------
EVAL_LOSS            | 0.8969     | 0.3293    
EVAL_ACCURACY        | 0.4991     | 0.8560    
EVAL_F1_MACRO        | 0.3458     | 0.8560    
EVAL_PRECISION_MACRO | 0.4851     | 0.8562    
EVAL_RECALL_MACRO    | 0.4991     | 0.8560    
EVAL_RUNTIME         | 18.2571    | 19.3549   
EVAL_SAMPLES_PER_SECOND | 116.7770   | 110.1530  
EVAL_STEPS_PER_SECOND | 7.3400     | 6.9230    


In [22]:
peft_model.save_pretrained("./model/gpt2-rotten-tomatoes-lora")
tokenizer.save_pretrained("./model/gpt2-rotten-tomatoes-lora")

('./model/gpt2-rotten-tomatoes-lora/tokenizer_config.json',
 './model/gpt2-rotten-tomatoes-lora/special_tokens_map.json',
 './model/gpt2-rotten-tomatoes-lora/vocab.json',
 './model/gpt2-rotten-tomatoes-lora/merges.txt',
 './model/gpt2-rotten-tomatoes-lora/added_tokens.json',
 './model/gpt2-rotten-tomatoes-lora/tokenizer.json')

In [23]:
print("\nResults from model after PAFT:")
df_after = evaluate_and_show_samples(peft_trainer, eval_dataset, quantity=10)
df_after


Results from model after PAFT:


Unnamed: 0,text,predictions,labels
0,"cool gadgets and creatures keep this fresh . not as good as the original , but what is . . .",1,1
1,an awful movie that will only satisfy the most emotionally malleable of filmgoers .,0,0
2,. . . you can be forgiven for realizing that you've spent the past 20 minutes looking at your watch and waiting for frida to just die already .,0,0
3,"though uniformly well acted , especially by young ballesta and galan ( a first-time actor ) , writer/director achero manas's film is schematic and obvious .",1,0
4,absolutely ( and unintentionally ) terrifying .,1,0
5,"shanghai ghetto , much stranger than any fiction , brings this unknown slice of history affectingly to life .",1,1
6,"while hoffman's performance is great , the subject matter goes nowhere .",0,0
7,"works because we're never sure if ohlinger's on the level or merely a dying , delusional man trying to get into the history books before he croaks .",0,1
8,"as a science fiction movie , "" minority report "" astounds .",1,1
9,"what one is left with , even after the most awful acts are committed , is an overwhelming sadness that feels as if it has made its way into your very bloodstream .",1,1


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [24]:
# Load saved PEFT model for inference
from peft import AutoPeftModelForSequenceClassification

peft_loaded = AutoPeftModelForSequenceClassification.from_pretrained("./model/gpt2-rotten-tomatoes-lora")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# Sample inference function for PEFT model (generate text)
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move para GPU se disponível

    with torch.no_grad():
        outputs = model(**inputs)  # Executa a inferência

    prediction = torch.argmax(outputs.logits, dim=-1).item()  # Obtém a classe prevista (0 ou 1)
    return model.config.id2label[prediction]


In [26]:
# Test with sample reviews
test_samples = [
    "A masterpiece of cinematic excellence!",
    "A tedious and poorly executed film.",
    "The actors delivered mediocre performances in a weak script.",
    
]

In [27]:
print("\nSample Predictions:")
for sample in test_samples:
    prediction = predict_sentiment(sample)  # Chama a função corrigida
    print(f"- {sample}")
    print(f"  Prediction: {prediction}\n")



Sample Predictions:
- A masterpiece of cinematic excellence!
  Prediction: POSITIVE

- A tedious and poorly executed film.
  Prediction: NEGATIVE

- The actors delivered mediocre performances in a weak script.
  Prediction: NEGATIVE



In [28]:
# Positive review (likely misclassified as negative)
pos_review = "Everything about this film feels off—the pacing is slow, the dialogue awkward, and yet, by the end, it’s a stunning masterpiece that lingers in your mind."

# Negative review (likely misclassified as positive)
neg_review = "The film is beautifully shot and well-acted, but beneath its polished surface, it lacks any real heart or emotional depth."


In [29]:
print(f"Classification for positive review: {predict_sentiment(pos_review)}")
print(f"Classification for negative review: {predict_sentiment(neg_review)}")

Classification for positive review: POSITIVE
Classification for negative review: NEGATIVE


In [31]:
peft_model.print_trainable_parameters()

trainable params: 814,080 || all params: 125,253,888 || trainable%: 0.6499438963523432


In [32]:
peft_loaded.print_trainable_parameters()

trainable params: 3,072 || all params: 125,253,888 || trainable%: 0.002452618476801295


## 📊 Conclusion: Performance Comparison  

The results show a **significant improvement** in model performance after applying **PEFT (Parameter Efficient Fine-Tuning)**.  

### 🔹 **Key Improvements**  
- **Eval Loss** decreased from `0.7657` to `0.3371`, indicating that the fine-tuned model makes fewer errors.  
- **Accuracy** increased from `50.28%` to `85.79%`, showing that the model has learned the data patterns much better.  
- **F1-Score Macro** improved from `0.3460` to `0.8578`, proving that the model now balances precision and recall effectively.  
- **Precision and Recall** also increased from around `50% - 56%` to `85.8%`, confirming the robustness of the fine-tuned model.  

### 🔸 **Trade-offs**  
- The evaluation time increased **slightly** (`18.33s → 19.34s`), likely due to the additional computations from PEFT.  
- The sampling rate decreased slightly (`116.29 samples/s → 110.22 samples/s`), but this drop is negligible compared to the significant improvement in prediction quality.  

### ✅ **Final Conclusion**  
Using **PEFT** resulted in a **much more accurate and efficient model**, with a **better balance between precision and recall**. Despite the minor increase in runtime, the performance gains make this approach highly beneficial. 🚀  
