# Lightweight Fine-Tuning Project

Description of Choices

* PEFT technique: LoRA (Low-Rank Adaptation) is the selected PEFT techniques since it is compatible with all models
* Model: GPT-2 is the selected model since it is relatively small and compatible with sequence classification and LoRA
* Evaluation approach: Evaluation will be conducted using Hugging Face's `Trainer` class which simplifies the training and evaluation workflow. accuracy and F1-score are computed using `compute_metrics` function which allaows for a comparison of the performance of the original model against the fine-tunned model
* Fine-tuning dataset: The IMDb dataset from the Hugging Face `datasets` library will be used for fine-tunning. This dataset is a standard benchmark for binary sentiment classification tasks making it a good fit in this context.

## Loading and Evaluating Foundation Model

Load GPT-2 pre-trained Hugging Face model and evaluate its performance prior to fine-tuning

The following steps are taken:
- Load the GPT-2 model and tokenizer.
- Preprocess the IMDb dataset for sequence classification.
- Evaluate the baseline performance using accuracy and F1-score.

In [187]:
import importlib
import light_weight_finetuning as lft

importlib.reload(lft);

In [188]:
# Define the model and the output directory
MODEL_NAME = 'gpt2'
OUTPUT_DIR = 'gpt-lora'
DATA_SET = 'imdb'

base_model, tokenizer = lft.load_base_model('gpt2', num_labels=2)
train_dataset, test_dataset = lft.prepare_dataset(DATA_SET, tokenizer)
trainer = lft.create_trainer(base_model, train_dataset, test_dataset)

# Evaluate the model
baseline_metrics = trainer.evaluate()
print("Baseline Performance:", baseline_metrics)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Baseline Performance: {'eval_loss': 0.8299095630645752, 'eval_model_preparation_time': 0.0011, 'eval_accuracy': 0.51, 'eval_f1': 0.35017791538992193, 'eval_runtime': 62.2568, 'eval_samples_per_second': 8.031, 'eval_steps_per_second': 8.031}


#### Key Observation ####
The metrics suggest that while the model has decent initialization and speed, its performance (accuracy and F1 score) can be improved

## Performing Parameter-Efficient Fine-Tuning

create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In this section:
- A LoRA configuration is created with specified parameters for adaptation.
- The PEFT model is trained and fine-tuned parameters are saved for later use.

In [189]:
# Create the PEFT model
peft_model = lft.create_peft_model(base_model)

# Update the Trainer to use the PEFT model
trainer.model = peft_model

# Train the model
trainer.train()

# Save the PEFT model weights
peft_model.save_pretrained(OUTPUT_DIR)



Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy,F1
1,1.477,1.14949,0.003,0.512,0.351106
2,0.7287,0.699181,0.003,0.814,0.812875
3,0.0014,0.782918,0.003,0.836,0.835419


#### Key Observations ####
- The model showed strong improvement in performance with accuracy increasing from 51.2% to 83.6%
- The are signs of overfitting in epoch 3 with training loss dropped to 0.0001
- Accuracy and F1 scores stabilized by Epoch 3

## Performing Inference with a PEFT Model

Load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Compare the  results to the results from prior to fine-tuning.

In [190]:
# Load and Evaluate saved fine-tuned model
fine_tuned_metrics, fine_tuned_model = lft.load_saved_model(base_model, OUTPUT_DIR, test_dataset)

# Print comparison of metrics
print("\nPerformance Comparison:")
print("-" * 50)
print("Metric       | Baseline | Fine-tuned")
print("-" * 50)
print(f"Accuracy     | {baseline_metrics['eval_accuracy']:.4f}   | {fine_tuned_metrics['eval_accuracy']:.4f}")
print(f"F1 Score     | {baseline_metrics['eval_f1']:.4f}   | {fine_tuned_metrics['eval_f1']:.4f}")
print(f"Loss         | {baseline_metrics['eval_loss']:.4f}   | {fine_tuned_metrics['eval_loss']:.4f}")
print("-" * 50)



Performance Comparison:
--------------------------------------------------
Metric       | Baseline | Fine-tuned
--------------------------------------------------
Accuracy     | 0.5100   | 0.8360
F1 Score     | 0.3502   | 0.8354
Loss         | 0.8299   | 0.7829
--------------------------------------------------


#### Key Observations ####
- The PEFT model significantly outperforms the Base Model in accuracy and F1 score, indicating better generalization and effectiveness.
- The PEFT model slightly reduces the evaluation loss compared to the Base Model.

In [191]:
# Demonstrate inference on sample texts
sample_texts = [
    "The movie was absolutely wonderful! A masterpiece.",
    "Terrible movie. I would not recommend it to anyone.",
    "It was just okay, nothing too special.",
]
inputs = tokenizer(sample_texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

predictions = lft.predictions(fine_tuned_model, inputs)

# Make a dataframe with the sample texts, predictions, and predicted labels
import pandas as pd
df = pd.DataFrame({
    "text": sample_texts,
    "prediction": predictions,
    "predicted_label": [base_model.config.id2label[p] for p in predictions]
}) 

# Show all the cells in the dataframe
pd.set_option("display.max_colwidth", None)
print(df)

                                                  text  prediction  \
0   The movie was absolutely wonderful! A masterpiece.           1   
1  Terrible movie. I would not recommend it to anyone.           0   
2               It was just okay, nothing too special.           0   

  predicted_label  
0        positive  
1        negative  
2        negative  


### Key Observations ###

- The model accurately classified one review as positive (1), which was a highly favorable comment about the movie.
- It correctly identified a negative review as negative (0), showcasing its ability to discern critical feedback.
- A neutral or mixed review was also classified as negative (0), which might indicate a lack of nuance in distinguishing between neutral and negative sentiments.

Overall, the model performed well in identifying clear positive and negative sentiments but might need refinement to handle more nuanced or neutral statements effectively.