# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Transformer Trainer
* Fine-tuning dataset: Rotten Tomatoes (https://huggingface.co/datasets/rotten_tomatoes) 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
# Install the required version of datasets in case you have an older version
! pip install -q "datasets==2.15.0"

In [2]:
from datasets import load_dataset


dataset = load_dataset("rotten_tomatoes")


# The rotten_tomatoes dataset contains 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie review.
# It has already train, test and validation splits, with 8530, 1066 and 1066 entries respectively.

splits = ["train", "test", "validation"]


#for testing, decrease the number of entries to 500.
#for split in splits: 
#    dataset[split] = dataset[split].shuffle(seed=42).select(range(500))



# View the dataset characteristics
dataset


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
dataset["train"]
dataset["test"]
dataset["validation"]

Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})

In [4]:
# Inspect the first example. Do you think this is spam or not?
dataset["train"][0]


{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}



## Pre-process datasets
Now we are going to process our datasets by converting all the text into tokens for our models.

In [5]:
from transformers import AutoTokenizer

#transform the data to tokens so the model can understand
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token


def preprocess_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_lenght=512)
     
tokenized_dataset = {}


for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["text"], truncation=True), batched=True)

  
# Inspect the available columns in the dataset
tokenized_dataset["train"]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 8530
})

## Load and set up the model
In this case we are doing a full fine tuning, so we will want to unfreeze all parameters.

In [6]:
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding



model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=2,
    id2label={0: "Negative review", 1: "Positive review"},
    label2id={"Negative review": 0, "Positive review": 1},
)
model.config.pad_token_id = tokenizer.pad_token_id
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



# Unfreeze all the model parameters.
# Hint: Check the documentation at https://huggingface.co/transformers/v4.2.2/training.html
for param in model.parameters():
    param.requires_grad = True
      
    
print(model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


## Let's train it!
Now it's time to train our model. We'll use the Trainer class.

First we'll define a function to compute our accuracy metreic then we make the Trainer.

In this instance, we will fill in some of the training arguments

In [7]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
        output_dir="./results",
        # Set the learning rate
        learning_rate = 2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size = 32,
        per_device_eval_batch_size = 64,
        # Evaluate and save the model after each epoch
        evaluation_strategy = "epoch", 
        save_strategy = "epoch",
        logging_dir="./logs",
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
        logging_steps=100,
        warmup_ratio=0.1,
    )

# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)
trainer.train()


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4461,0.385046,0.831144
2,0.3336,0.321643,0.863977


Checkpoint destination directory ./results/checkpoint-267 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-534 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=534, training_loss=0.6788276822379465, metrics={'train_runtime': 194.1197, 'train_samples_per_second': 87.884, 'train_steps_per_second': 2.751, 'total_flos': 439179617009664.0, 'train_loss': 0.6788276822379465, 'epoch': 2.0})

## View some data

Let's look at a few examples

In [8]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

items_for_manual_review = tokenized_dataset["test"].select(
    [0, 1, 22, 31, 43, 29, 48, 287]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "review": [item["text"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,review,predictions,labels
0,"lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .",1,1
1,consistently clever and suspenseful .,1,1
2,"grown-up quibbles are beside the point here . the little girls understand , and mccracken knows that's all that matters .",1,1
3,"the main story . . . is compelling enough , but it's difficult to shrug off the annoyance of that chatty fish .",0,1
4,"the film has a laundry list of minor shortcomings , but the numerous scenes of gory mayhem are worth the price of admission . . . if "" gory mayhem "" is your idea of a good time .",1,1
5,a soul-stirring documentary about the israeli/palestinian conflict as revealed through the eyes of some children who remain curious about each other against all odds .,1,1
6,the film truly does rescue [the funk brothers] from motown's shadows . it's about time .,1,1
7,ya-yas everywhere will forgive the flaws and love the film .,1,1


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [9]:
from peft import LoraConfig, PeftModelForSequenceClassification, TaskType, AutoPeftModelForSequenceClassification


# PEFT model configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=4,
    lora_alpha=16,
    lora_dropout=0.1
)

# Load the pre-trained GPT-2 model
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=2)
model.config.pad_token_id = model.config.eos_token_id

peft_model = PeftModelForSequenceClassification(model, peft_config)

# Print
peft_model.print_trainable_parameters()

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 150,528 || all params: 124,590,336 || trainable%: 0.1208183594592762


In [10]:
from transformers import EvalPrediction

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results/peft_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir='./logs/peft_model',
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=100,
    warmup_ratio=0.1,
)

# Initialize the Trainer with compute_metrics
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start training
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.7624,0.707016,0.537523
2,0.6996,0.681623,0.602251


Checkpoint destination directory ./results/peft_model/checkpoint-267 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/peft_model/checkpoint-534 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=534, training_loss=0.8106115069728665, metrics={'train_runtime': 125.8323, 'train_samples_per_second': 135.577, 'train_steps_per_second': 4.244, 'total_flos': 439948910979072.0, 'train_loss': 0.8106115069728665, 'epoch': 2.0})

In [11]:
# Evaluate
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)

Evaluation Results: {'eval_loss': 0.6816233396530151, 'eval_accuracy': 0.6022514071294559, 'eval_runtime': 3.4653, 'eval_samples_per_second': 307.619, 'eval_steps_per_second': 4.906, 'epoch': 2.0}


In [12]:
#save the PEFT model
peft_model.save_pretrained('model/peft_model')


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [13]:
inference_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "model/peft_model",
    num_labels=2
)
inference_model.config.pad_token_id = inference_model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
trainer = Trainer(
    model=inference_model,
    args=training_args,
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Evaluate the model
evaluation_results_lora = trainer.evaluate()
print("Evaluation Results:", evaluation_results_lora)


Evaluation Results: {'eval_loss': 0.6816233396530151, 'eval_accuracy': 0.6022514071294559, 'eval_runtime': 3.4778, 'eval_samples_per_second': 306.517, 'eval_steps_per_second': 4.888}


## Make some predictions on the finetuned dataset


In [15]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

items_for_manual_review = tokenized_dataset["test"].select(
    [4, 54, 222, 331, 413, 129, 8, 87]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "review": [item["text"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,review,predictions,labels
0,"red dragon "" never cuts corners .",0,1
1,"the piano teacher is not an easy film . it forces you to watch people doing unpleasant things to each other and themselves , and it maintains a cool distance from its material that is deliberately unsettling .",1,1
2,"jolie gives it that extra little something that makes it worth checking out at theaters , especially if you're in the mood for something more comfortable than challenging .",1,1
3,"how i killed my father would be a rarity in hollywood . it's an actor's showcase that accomplishes its primary goal without the use of special effects , but rather by emphasizing the characters -- including the supporting ones .",1,1
4,there is a general air of exuberance in all about the benjamins that's hard to resist .,0,1
5,"this new zealand coming-of-age movie isn't really about anything . when it's this rich and luscious , who cares ?",1,1
6,"a real audience-pleaser that will strike a chord with anyone who's ever waited in a doctor's office , emergency room , hospital bed or insurance company office .",0,1
7,[grant] goes beyond his usual fluttering and stammering and captures the soul of a man in pain who gradually comes to recognize it and deal with it .,0,1


## Final Summary

First, I tested a sample of 500 rows for TRAIN and TEST, each, total 1,000 samples. 

With that, in the GPT-2 training **with one epoch**, it returned had a precision (accuracy) of 53.6%, while with PEFT/LoRA we had an accuracy of 50.2%, showing that GPT-2 approach was more effective for this dataset parameters and sample size.
We can also conclude that fine-tuning may not necessarily increase the accuracy.

Then, I have increased the TRAIN and TEST sample sizes to 8530 and 1066 rows respectively. 
That gave us an expressive bump on the precision: +31.5 pts for GPT-2 (from 53.6% to 85.1%) and for LoRA, the precision, increased, but by a bit less: +13.1pts (from 50.2% to 63.3%). That is, GPT-2 accuracy increase with a larger dataset increased 2.4x more than LoRA. 

Now, same cenario above (TRAIN = 8530 samples; TEST = 1066 samples), but with 2 epochs, the accuracy of GPT-2 was 83.1% for epoch 1 and 86.4% for epoch 2. And for LoRA = 53.7% for epoch 1 and 60.2% for epoch 2.

As we could see above, besides doubling the epochs the gain was not expressive, with +1.3 pts increase for GPT-2 and actually a decrease of -2.1pts for LoRA.

That also helps to conclude that running more epochs (and spending more time and processing power) may not necessarily improve the performance 

