_________________________________________
This code is an example of how to fine tune with Lora.  Fine tuning is a way of improving the training of a model.  So instead of creating a model from scratch with the billions of dollars involved in that process, fine tuning can retrain a small portion of the model and improve results dramatically.
_________________________________________
  

In [1]:
# Step 1: Install Necessary Libraries not already installed
!pip install datasets   
!pip install evaluate
!pip install transformers
!pip install --upgrade huggingface_hub
!pip install peft==0.13.0

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting huggingface_hub
  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.27.0-py3-none-any.whl (450 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m450.5/450.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.24.7
    Uninstalling huggingface-hub-0.24.7:
      Successfully uninstalled huggingface-hub-0.24.7
Successfully installed huggingface_hub-0.27.0
Collecting peft==0.13.0
  Downloading peft-0.13.0-py3-none-any.whl.metadat

In [2]:
# Step 2: Import Required Libraries
from datasets import load_dataset, DatasetDict, Dataset

import os

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

import huggingface_hub
import peft

import evaluate
import numpy as np
import torch
from torch.nn.functional import softmax

from peft import PeftModel, PeftConfig, get_peft_model, LoraConfig

In [3]:
#This code block loads a pretrained model.  We use roberta_base
  #because it is small and efficient and ignores upper or lower case but other
  #BERT models should work providing slightly better or worse results

model_chkpt = 'roberta-base'

#define label maps.  This changes binary to text to characterize a review as positive or negative
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE":0, "POSITIVE":1}

#Load the base model before fine tuning
model = AutoModelForSequenceClassification.from_pretrained(model_chkpt, num_labels=2
                                                           , id2label=id2label, label2id=label2id)

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# Step 3: Load and Prepare Dataset
#load dataset of movie reviews from IMDB
dataset = load_dataset('imdb')
dataset

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
# Step 4: Tokenize Dataset
# Tokenize the dataset using a pretrained tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_chkpt, add_prefix_space=True)

# create tokenize function
def tokenize_function(examples):
  #extract text
  text = examples["text"]

  #tokenize and truncate texte
  tokenizer.truncation_side = "left"
  tokenized_inputs = tokenizer(text,
                                return_tensors= 'np',
                                max_length=512,
                                truncation=True)

  return tokenized_inputs

if tokenizer.pad_token is None:
  tokenizer.add_special_tokens({'pad_token': '[PAD]'})
  model.resize_token_embeddings(len(tokenizer))

# tokenize training and validation datasets
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [6]:
# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [7]:
#This creates a metric for determining the accuracy of the model

accuracy = evaluate.load("accuracy")

# define an evaluation function to pass into trainer later
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [8]:
# define list of examples for the model to determine if the review is positive or negative


test_list = [
    "It was good",
    "Not a fan, don't recommend",
    "Better than the first one",
    "This is not worth watching once",
    "This one is a pass",
    "Loved it, highly recommend!",
    "Terrible plot, skip it.",
    "Surprisingly entertaining!",
    "A dull and predictable mess.",
    "Decent, but nothing special.",
    "Fresh and fun to watch!",
    "Confusing from start to end.",
    "Better than I expected.",
    "Lacked heart and emotion.",
    "Perfect for a lazy afternoon.",
    "Not great, but not awful.",
    "A total waste of two hours.",
    "Visually stunning, great vibes!",
    "Nothing memorable here.",
    "A sequel that actually works!"
]

print("Untrained model predictions plus confidence of prediction:")
print("_____________________________")

for text in test_list:
    inputs=tokenizer.encode(text, return_tensors='pt')

    # Ensure the model is in evaluation mode
    model.eval()

    with torch.no_grad():
        logits = model(inputs).logits

    probabilities = softmax(logits, dim=1)  # Convert logits to probabilities
    predicted_class = torch.argmax(probabilities, dim=1).item()
    confidence = probabilities[0][predicted_class].item() * 100

    print(f"{text} - {id2label[predicted_class]} - {confidence:.2f}%")



Untrained model predictions plus confidence of prediction:
_____________________________
It was good - POSITIVE - 53.74%
Not a fan, don't recommend - POSITIVE - 53.57%
Better than the first one - POSITIVE - 53.80%
This is not worth watching once - POSITIVE - 53.89%
This one is a pass - POSITIVE - 53.63%
Loved it, highly recommend! - POSITIVE - 53.73%
Terrible plot, skip it. - POSITIVE - 54.25%
Surprisingly entertaining! - POSITIVE - 53.90%
A dull and predictable mess. - POSITIVE - 54.10%
Decent, but nothing special. - POSITIVE - 54.67%
Fresh and fun to watch! - POSITIVE - 53.95%
Confusing from start to end. - POSITIVE - 54.00%
Better than I expected. - POSITIVE - 54.43%
Lacked heart and emotion. - POSITIVE - 54.11%
Perfect for a lazy afternoon. - POSITIVE - 54.32%
Not great, but not awful. - POSITIVE - 54.35%
A total waste of two hours. - POSITIVE - 54.46%
Visually stunning, great vibes! - POSITIVE - 53.77%
Nothing memorable here. - POSITIVE - 54.13%
A sequel that actually works! - POS

_________________________________________
Here are the results generated

Untrained model predictions:
_____________________________
It was good - POSITIVE
Not a fan, don't recommend - POSITIVE
Better than the first one - POSITIVE
This is not worth watching once - POSITIVE
This one is a pass - POSITIVE
Loved it, highly recommend! - POSITIVE
Terrible plot, skip it. - POSITIVE
Surprisingly entertaining! - POSITIVE
A dull and predictable mess. - POSITIVE
Decent, but nothing special. - POSITIVE
Fresh and fun to watch! - POSITIVE
Confusing from start to end. - POSITIVE
Better than I expected. - POSITIVE
Lacked heart and emotion. - POSITIVE
Perfect for a lazy afternoon. - POSITIVE
Not great, but not awful. - POSITIVE
A total waste of two hours. - POSITIVE
Visually stunning, great vibes! - POSITIVE
Nothing memorable here. - POSITIVE
A sequel that actually works! - POSITIVE
_________________________________________

In [9]:
#This is the code for finetuning the model.  YOu can change these parameters
  #to balance speed with accuracy.  The r and lora_alpha are the two parameters
  #that will affect this the most.  Note the trainable% output in the next code set.
  #The higher the trainable%, the better the accuracy but the slower the model.


peft_config = LoraConfig(
    task_type="SEQ_CLS",  # Sequence Classification
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["query", "key", "value"],  # Modules specific to RoBERTa
)


In [10]:
#Load model and output the trainable parameters vs total parameters

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 1,476,866 || all params: 126,124,036 || trainable%: 1.1710


_________________________________________
Here are the results.  

*trainable params: 1,476,866 || all params: 126,124,036 || trainable%: 1.1710*

Note the trainable percentage.  This indicates that the fine tuning is changing just over 1% of the model parameters.  This is much cheaper and faster than retraining 100% of the model.
________________________________________  

In [11]:
#These sets the hyperparameters for the finetuned model.

#hyperparameters
lr= 3e-5
batch_size=4
num_epochs=10

#training arguements

training_args = TrainingArguments(
    output_dir= model_chkpt + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

In [12]:
#This trains the model on just the portion set to fine tuning.
  #This will be very slow and perhaps unusable without a GPU.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    #processing_class=tokenizer,   #Trainer args has changed.  Older versions use tokenizer
                                    #newer versions use processing_class
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

#Train model
trainer.train()
###



Epoch,Training Loss,Validation Loss,Accuracy
1,0.2107,0.180416,{'accuracy': 0.94572}
2,0.2015,0.163299,{'accuracy': 0.95036}
3,0.1893,0.171868,{'accuracy': 0.95088}
4,0.1699,0.172385,{'accuracy': 0.95268}
5,0.1677,0.186032,{'accuracy': 0.95124}
6,0.1489,0.182251,{'accuracy': 0.95396}
7,0.1413,0.183439,{'accuracy': 0.95436}
8,0.1314,0.189029,{'accuracy': 0.95392}
9,0.1367,0.186107,{'accuracy': 0.9544}
10,0.1393,0.190127,{'accuracy': 0.95376}




TrainOutput(global_step=31250, training_loss=0.17179314123535155, metrics={'train_runtime': 18965.3193, 'train_samples_per_second': 13.182, 'train_steps_per_second': 1.648, 'total_flos': 6.238103014602144e+16, 'train_loss': 0.17179314123535155, 'epoch': 10.0})

This takes approximately 3hrs to fine tune with Kaggle's GP4 TUx2 accelerator.  Much faster and cheaper than Facebook/META's initial training.

In [14]:


fine_tuned_model_path = "roberta-base-lora-text-classification"
print(os.listdir(fine_tuned_model_path))


if os.path.exists(fine_tuned_model_path):
    print(f"The directory '{fine_tuned_model_path}' exists.")
else:
    print(f"The directory '{fine_tuned_model_path}' does NOT exist.")



['checkpoint-12500', 'checkpoint-6250', 'checkpoint-21875', 'checkpoint-25000', 'checkpoint-9375', 'checkpoint-15625', 'checkpoint-18750', 'checkpoint-28125', 'checkpoint-3125', 'checkpoint-31250']
The directory 'roberta-base-lora-text-classification' exists.


In [15]:
trainer.save_model("roberta-base-lora-text-classification")


In [16]:
#This tests the new fine tuned model on the example from earlier to see if there
  #is an improvement.

# Load the fine-tuned model
model = AutoModelForSequenceClassification.from_pretrained(fine_tuned_model_path)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_chkpt, add_prefix_space=True)

# Define the list of examples
test_list = [
    "It was good",
    "Not a fan, don't recommend",
    "Better than the first one",
    "This is not worth watching once",
    "This one is a pass",
    "Loved it, highly recommend!",
    "Terrible plot, skip it.",
    "Surprisingly entertaining!",
    "A dull and predictable mess.",
    "Decent, but nothing special.",
    "Fresh and fun to watch!",
    "Confusing from start to end.",
    "Better than I expected.",
    "Lacked heart and emotion.",
    "Perfect for a lazy afternoon.",
    "Not great, but not awful.",
    "A total waste of two hours.",
    "Visually stunning, great vibes!",
    "Nothing memorable here.",
    "A sequel that actually works!"
]

print("Trained model predictions:")
print("_____________________________")

# Make predictions using the trained model
for text in test_list:
    inputs = tokenizer.encode(text, return_tensors='pt')

    # Ensure the model is in evaluation mode
    model.eval()

    with torch.no_grad():
        logits = model(inputs).logits

    probabilities = softmax(logits, dim=1)  # Convert logits to probabilities
    predicted_class = torch.argmax(probabilities, dim=1).item()
    confidence = probabilities[0][predicted_class].item() * 100

    print(f"{text} - {id2label[predicted_class]} - {confidence:.2f}%")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trained model predictions:
_____________________________
It was good - POSITIVE - 88.97%
Not a fan, don't recommend - NEGATIVE - 72.04%
Better than the first one - POSITIVE - 92.99%
This is not worth watching once - NEGATIVE - 98.20%
This one is a pass - POSITIVE - 59.27%
Loved it, highly recommend! - POSITIVE - 99.76%
Terrible plot, skip it. - NEGATIVE - 99.56%
Surprisingly entertaining! - POSITIVE - 98.17%
A dull and predictable mess. - NEGATIVE - 99.43%
Decent, but nothing special. - NEGATIVE - 98.07%
Fresh and fun to watch! - POSITIVE - 99.41%
Confusing from start to end. - NEGATIVE - 96.02%
Better than I expected. - POSITIVE - 95.50%
Lacked heart and emotion. - NEGATIVE - 93.41%
Perfect for a lazy afternoon. - POSITIVE - 93.03%
Not great, but not awful. - NEGATIVE - 74.36%
A total waste of two hours. - NEGATIVE - 99.10%
Visually stunning, great vibes! - POSITIVE - 99.61%
Nothing memorable here. - NEGATIVE - 95.18%
A sequel that actually works! - POSITIVE - 95.86%


Trained model predictions Roberta:

_____________________________

It was good - POSITIVE - 88.97%

Not a fan, don't recommend - NEGATIVE - 72.04%

Better than the first one - POSITIVE - 92.99%

This is not worth watching once - NEGATIVE - 98.20%

*This one is a pass - POSITIVE - 59.27%

Loved it, highly recommend! - POSITIVE - 99.76%

Terrible plot, skip it. - NEGATIVE - 99.56%

Surprisingly entertaining! - POSITIVE - 98.17%

A dull and predictable mess. - NEGATIVE - 99.43%

$Decent, but nothing special. - NEGATIVE - 98.07%

Fresh and fun to watch! - POSITIVE - 99.41%

Confusing from start to end. - NEGATIVE - 96.02%

Better than I expected. - POSITIVE - 95.50%

Lacked heart and emotion. - NEGATIVE - 93.41%

Perfect for a lazy afternoon. - POSITIVE - 93.03%

$Not great, but not awful. - NEGATIVE - 74.36%

A total waste of two hours. - NEGATIVE - 99.10%

Visually stunning, great vibes! - POSITIVE - 99.61%

Nothing memorable here. - NEGATIVE - 95.18%

A sequel that actually works! - POSITIVE - 95.86%

This shows a result of 17 correct and 1 incorrect as well as 2 neutral statements.  These results are great with a 94% accuracy rate.  In addition, the one mistake had an almost 60% confidence level%.  You could design a system where any review less than a convfidence level of maybe 75%, could be flagged for human review.  

The one thing the system did have trouble with is neutral or inconclusive reviews.  Future improvements could work on improving that aspect.  

Trained model predictions from distilbert:
*-Incorrect
$-Neutral Statement
_____________________________
It was good - POSITIVE - 90.81%

Not a fan, don't recommend - NEGATIVE - 93.56%

Better than the first one - POSITIVE - 92.05%

*This is not worth watching once - POSITIVE - 60.50%

*This one is a pass - POSITIVE - 83.91%

Loved it, highly recommend! - POSITIVE - 98.26%

Terrible plot, skip it. - NEGATIVE - 99.56%

Surprisingly entertaining! - POSITIVE - 93.83%

A dull and predictable mess. - NEGATIVE - 99.36%

$Decent, but nothing special. - NEGATIVE - 89.03%

Fresh and fun to watch! - POSITIVE - 99.79%

*Confusing from start to end. - POSITIVE - 52.08%

Better than I expected. - POSITIVE - 95.88%

Lacked heart and emotion. - NEGATIVE - 62.29%

Perfect for a lazy afternoon. - POSITIVE - 82.45%

$Not great, but not awful. - POSITIVE - 77.93%

A total waste of two hours. - NEGATIVE - 97.45%

Visually stunning, great vibes! - POSITIVE - 99.17%

Nothing memorable here. - NEGATIVE - 98.00%

A sequel that actually works! - POSITIVE - 73.38%

_____________________________________________________

This is from an ealier run.  The only difference is that we used distilbert as the base model which is a lighter, cheaper model.  The fine tuning took about half the time but the results were not as impressive.  We have 15 correct, 3 incorrect and 2 relatively neutral statements.  That's an accuracy of 83% vs the 