# Lightweight Fine-Tuning Project

This is my submission for the Project 1 - 20240330 

This project brings together all of the essential components of a PyTorch + Hugging Face training and inference process. Specifically:
Load a pre-trained model and evaluate its performance
Perform parameter-efficient fine tuning using the pre-trained model
Perform inference using the fine-tuned model and compare its performance to the original model

Choices for this project:
* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Transformer Trainer
* Fine-tuning dataset: Rotten Tomatoes (https://huggingface.co/datasets/rotten_tomatoes) 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
# Install the required version of datasets in case you have an older version
! pip install -q "datasets==2.15.0"
#! pip install transformers
#! pip install peft
#! pip install datasets
#! pip install pandas
#! pip install numpy
! pip install scikit-learn
#! pip install tqdm




Defaulting to user installation because normal site-packages is not writeable


In [2]:
from datasets import load_dataset

# Number of epochs for this run
number_epochs = 4


# Using the rotten_tomatoes dataset from HuggingFace, which contains 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie review.
# It has already train, test and validation splits, with 8530, 1066 and 1066 entries respectively.
dataset = load_dataset("rotten_tomatoes")


# Create the splits for train, test and validation
splits = ["train", "test", "validation"]


#for testing, decrease the number of entries to 500.
#for split in splits: 
#    dataset[split] = dataset[split].shuffle(seed=42).select(range(500))



# View the dataset characteristics
dataset


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
# Inspect the first example. Is this a positive or negative review?
dataset["train"][0]


{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}



## Pre-process datasets
Now we are going to process our datasets by converting all the text into tokens for our models.

In [4]:
from transformers import AutoTokenizer

# Transform the data to tokens so the model can understand
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Add padding and truncates the review to 512 bytes
def preprocess_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_lenght=512)
     
tokenized_dataset = {}

# Tokenize the datasets
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["text"], truncation=True), batched=True)

  
# Inspect the available columns in the dataset
tokenized_dataset["train"]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 8530
})

## Load and set up the model
In this case we are doing a full fine tuning, so we will want to unfreeze all parameters.

In [5]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding


# Creates the outputs to be used as classification
model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=2,
    id2label={0: "Negative review", 1: "Positive review"},
    label2id={"Negative review": 0, "Positive review": 1},
)
model.config.pad_token_id = tokenizer.pad_token_id
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



# Unfreeze all the model parameters since we will train the whole model
for param in model.parameters():
    param.requires_grad = True
      
    
print(model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


## Let's prepare the module to be trained, but..

.. before training, let's evaluate its current accuracy


In [6]:

import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support




# Define the metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred   
    predictions = np.argmax(predictions, axis=1)    
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    return {"accuracy": accuracy_score(labels, predictions), "f1": f1, "precision": precision, "recall": recall}


# Define the training arguments
training_args = TrainingArguments(
        output_dir="./results",
        # Set the learning rate
        learning_rate = 2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size = 32,
        per_device_eval_batch_size = 64,
        # Evaluate and save the model after each epoch
        evaluation_strategy = "epoch", 
        save_strategy = "epoch",
        logging_dir="./logs",
        num_train_epochs=number_epochs,
        weight_decay=0.01,
        load_best_model_at_end=True,
        logging_steps=100,
        warmup_ratio=0.1,
    )

# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)



In [7]:
# Evaluate the current data prior to training the model with the rotten tomatoes labeled (train) data it

trainer.evaluate()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.750957190990448,
 'eval_accuracy': 0.5393996247654784,
 'eval_f1': 0.5355540233924044,
 'eval_precision': 0.5407492354740061,
 'eval_recall': 0.5393996247654784,
 'eval_runtime': 4.7934,
 'eval_samples_per_second': 222.389,
 'eval_steps_per_second': 3.547}

In [8]:
# Train the model

trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4426,0.369217,0.835835,0.835184,0.841227,0.835835
2,0.3211,0.34163,0.847092,0.847092,0.847093,0.847092
3,0.2642,0.342974,0.84803,0.848025,0.848074,0.84803
4,0.2248,0.354061,0.859287,0.859287,0.859292,0.859287


Checkpoint destination directory ./results/checkpoint-267 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-534 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1068, training_loss=0.33878634663556845, metrics={'train_runtime': 413.8636, 'train_samples_per_second': 82.443, 'train_steps_per_second': 2.581, 'total_flos': 876339287654400.0, 'train_loss': 0.33878634663556845, 'epoch': 4.0})

## View some data

Let's look at a few examples

In [9]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

# Random data
items_for_manual_review = tokenized_dataset["test"].select(
    [0, 1, 22, 31, 43, 29, 48, 287]
)

# Predict the results for entries above
results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "review": [item["text"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)

# Show all the cell
pd.set_option("display.max_colwidth", None)
df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,review,predictions,labels
0,"lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .",1,1
1,consistently clever and suspenseful .,1,1
2,"grown-up quibbles are beside the point here . the little girls understand , and mccracken knows that's all that matters .",1,1
3,"the main story . . . is compelling enough , but it's difficult to shrug off the annoyance of that chatty fish .",0,1
4,"the film has a laundry list of minor shortcomings , but the numerous scenes of gory mayhem are worth the price of admission . . . if "" gory mayhem "" is your idea of a good time .",1,1
5,a soul-stirring documentary about the israeli/palestinian conflict as revealed through the eyes of some children who remain curious about each other against all odds .,1,1
6,the film truly does rescue [the funk brothers] from motown's shadows . it's about time .,1,1
7,ya-yas everywhere will forgive the flaws and love the film .,1,1


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [10]:
from peft import LoraConfig, PeftModelForSequenceClassification, TaskType, AutoPeftModelForSequenceClassification


# PEFT model configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=4,
    lora_alpha=16,
    lora_dropout=0.1
)

# Load the pre-trained GPT-2 model
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=2)
model.config.pad_token_id = model.config.eos_token_id

peft_model = PeftModelForSequenceClassification(model, peft_config)

# Print
peft_model.print_trainable_parameters()

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 150,528 || all params: 124,590,336 || trainable%: 0.1208183594592762


In [11]:
from transformers import EvalPrediction


# Define metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred   
    predictions = np.argmax(predictions, axis=1)    
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    return {"accuracy": accuracy_score(labels, predictions), "f1": f1, "precision": precision, "recall": recall}




# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=number_epochs,
    weight_decay=0.01,
    logging_dir='./logs',
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=100,
    warmup_ratio=0.1,
)

# Initialize the Trainer with compute_metrics and arguments
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


In [12]:
# Evaluate the model prior to training
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)

Evaluation Results: {'eval_loss': 3.5329086780548096, 'eval_accuracy': 0.49906191369606, 'eval_f1': 0.33291614518147683, 'eval_precision': 0.24976525821596246, 'eval_recall': 0.49906191369606, 'eval_runtime': 3.8267, 'eval_samples_per_second': 278.571, 'eval_steps_per_second': 4.443}


In [13]:
# Start training
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.8635,0.723009,0.602251,0.601745,0.602774,0.602251
2,0.7123,0.645214,0.660413,0.658489,0.66411,0.660413
3,0.6655,0.618797,0.683865,0.681937,0.688434,0.683865
4,0.6554,0.60912,0.697936,0.696628,0.70141,0.697936


Checkpoint destination directory ./results/checkpoint-267 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-534 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-801 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1068 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1068, training_loss=1.0434400187003032, metrics={'train_runtime': 273.7944, 'train_samples_per_second': 124.619, 'train_steps_per_second': 3.901, 'total_flos': 877874337331200.0, 'train_loss': 1.0434400187003032, 'epoch': 4.0})

In [14]:
#save the PEFT model to be referenced later
peft_model.save_pretrained('model/peft_model')


## Performing Inference with a PEFT Model

In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [15]:
# Load the saved PEFT model
inference_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "model/peft_model",
    num_labels=2
)
inference_model.config.pad_token_id = inference_model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
# Run the predictions using the test dataset
trainer = Trainer(
    model=inference_model,
    args=training_args,
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Evaluate the model
evaluation_results_lora = trainer.evaluate()
print("Evaluation Results:", evaluation_results_lora)


Evaluation Results: {'eval_loss': 0.6091195344924927, 'eval_accuracy': 0.6979362101313321, 'eval_f1': 0.6966280615419423, 'eval_precision': 0.7014101558442488, 'eval_recall': 0.6979362101313321, 'eval_runtime': 3.8692, 'eval_samples_per_second': 275.506, 'eval_steps_per_second': 4.394}


## Make some predictions on the finetuned dataset


In [17]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

# Select some random entries from the test dataset
items_for_manual_review = tokenized_dataset["test"].select(
    [4, 54, 222, 331, 413, 129, 8, 87]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "review": [item["text"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,review,predictions,labels
0,"red dragon "" never cuts corners .",0,1
1,"the piano teacher is not an easy film . it forces you to watch people doing unpleasant things to each other and themselves , and it maintains a cool distance from its material that is deliberately unsettling .",0,1
2,"jolie gives it that extra little something that makes it worth checking out at theaters , especially if you're in the mood for something more comfortable than challenging .",1,1
3,"how i killed my father would be a rarity in hollywood . it's an actor's showcase that accomplishes its primary goal without the use of special effects , but rather by emphasizing the characters -- including the supporting ones .",0,1
4,there is a general air of exuberance in all about the benjamins that's hard to resist .,1,1
5,"this new zealand coming-of-age movie isn't really about anything . when it's this rich and luscious , who cares ?",0,1
6,"a real audience-pleaser that will strike a chord with anyone who's ever waited in a doctor's office , emergency room , hospital bed or insurance company office .",1,1
7,[grant] goes beyond his usual fluttering and stammering and captures the soul of a man in pain who gradually comes to recognize it and deal with it .,1,1


## Final Summary

Situation: The Objective of the tests are to identify what/if the increase prior to the training, after the training with gtp2, and if PEFT LoRA improved the performance of the model.

Tasks: for that, the notebook was executed 3 times, with different scenarios

Activities: The following tests were performed:

Test a) 2 epochs, 500 rows for TRAIN and TEST, each, total 1,000 samples; then
Test b) still 2 epochs, however I have increased the TRAIN and TEST sample sizes to 8530 and 1066 rows respectively; and last test
Test c) same as b), but 4 epochs.

On test a) (500 entries), GPT2 evaluation model returned a F1 of 31%, and afer training, it increase to 62.4%. ~100% increase on the performance after the training. With LoRA, the training did not improve the F1, and actually decreased it -0.5pts (from 31% to 30.5%).

Then, when we had a larger sample (scenario B), things improved quite a bit:
-pre-trained F1 was 33%, and after trianing, it increase 53.5pts, that is 86.5%, a 162% improvement. LoRa did not respond so well, jumping from 33.5% to 52.8%, not a good performance.

And last, scenario C), with 4 epochs, full sample, GPT2 F1 after training landed at 85.9% (less than with 2 epochs = 86.5%). On the other hand, the LoRA increase +16.9pts after 4 epochs, from 52.2% to 69.7%, but still less than the GPT2 trained (86.5% with 2 epochs and 85.9% with 4 epochs.


Main Findings:
1) Based on the above we can conclude that fine-tuning may not necessarily improve the performance of the model.

2) That also helps to conclude that running more epochs (and spending more and resources: processing power, energy, $$$) may not necessarily improve the performance 

3) We could also see than increasing the sample size did improve the model. How close it could get to 100% precision if we increase the number of entries, and how big would the train data would have to be? 



