# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from datasets import load_dataset
import torch
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)
from sklearn.metrics import (accuracy_score, 
                             precision_recall_fscore_support,
                             confusion_matrix, 
                             classification_report)

In [2]:
# loading a small dataset for financial sentiment (positive , negative, neutral)
dataset = load_dataset("financial_phrasebank", "sentences_allagree", trust_remote_code=True)

In [3]:
model_name="bert-base-uncased"


In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [5]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, 
                                                           num_labels=3) 
# 3 labels = positive, negative, neutral
print(f"Base model parameters: {model.num_parameters():,}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Base model parameters: 109,484,547


In [None]:
# classifier = pipeline(
#         "text-classification",
#         model=model,
#         tokenizer=tokenizer,
#         top_k=None
#     )

In [6]:
# explore data and assign test & train labels
import numpy as np
# taking arbitrary datasets so it has one of each label type
test_data = dataset["train"]  # FinancialPhraseBank only has train split
sentences = test_data["sentence"]
true_labels = test_data["label"]

print(f"Dataset size: {len(sentences)} sentences")
print(f"Label distribution: {np.bincount(true_labels)}")

Dataset size: 2264 sentences
Label distribution: [ 303 1391  570]


In [7]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, predictions)
    
    # Add zero_division parameter to handle undefined metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted', zero_division=0
    )
    
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [8]:
def tokenize_function(examples):
    # Tokenize the sentences
    return tokenizer(
        examples["sentence"], 
        truncation=True, 
        padding=True, 
        max_length=512
    )


In [9]:
from datasets import Dataset

eval_data = {
    "sentence": list(sentences),  # Ensure it's a proper list
    "label": list(true_labels)    # Ensure it's a proper list
}


eval_dataset = Dataset.from_dict(eval_data)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# IMPORTANT: Remove the original text column and rename label to labels
eval_dataset = eval_dataset.remove_columns(['sentence'])
eval_dataset = eval_dataset.rename_column('label', 'labels')

# Set up trainer for evaluation
training_args = TrainingArguments(
    output_dir='./temp_eval',
    per_device_eval_batch_size=8,  # Small batch size for your 10 examples
    dataloader_drop_last=False,
    remove_unused_columns=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Evaluate the model
print("Evaluating model...")
eval_results = trainer.evaluate()
print(f"Accuracy: {eval_results['eval_accuracy']:.4f}")

Map:   0%|          | 0/2264 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Evaluating model...


Accuracy: 0.5777


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [10]:
from peft import (
    get_peft_model, 
    LoraConfig, 
    TaskType,
    PeftModel,
    PeftConfig
)

In [11]:
# finr the target modules for our model (bert-base-uncased)
import torch
linear_cls = torch.nn.Linear
names = []
for name, module in model.named_modules():
    if isinstance(module, linear_cls):
        names.append(name.split('.')[-1])
print(list(set(names)))

['dense', 'value', 'key', 'classifier', 'query']


In [12]:
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=[
        "query",
        "value", 
        "key",
        "dense"
    ],
    bias="none",
    modules_to_save=["classifier"],
)

In [13]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
peft_model = get_peft_model(model, peft_config)
peft_model = peft_model.float()  # Ensure float32
peft_model = peft_model.to(device)

Using device: cuda


In [14]:
print(f"PEFT model parameters: {peft_model.num_parameters():,}")
trainable_params = peft_model.get_nb_trainable_parameters()
trainable_params = trainable_params[0]  # Extract the number from tuple

print(f"Trainable parameters: {trainable_params:,}")
print(f"Percentage of trainable parameters: {trainable_params / peft_model.num_parameters() * 100:.2f}%")


PEFT model parameters: 110,826,246
Trainable parameters: 1,344,006
Percentage of trainable parameters: 1.21%


In [15]:
def prepare_dataset(tokenizer, test_size=0.2):
    dataset = load_dataset("financial_phrasebank", "sentences_allagree", trust_remote_code=True)
    train_val_split = dataset["train"].train_test_split(test_size=test_size, seed=42)

    def tokenize_function(examples):
        return tokenizer(
            examples["sentence"],
            truncation=True,
            padding='max_length',  # We'll pad dynamically in the data collator
            max_length=512
        )

    # Tokenize datasets
    tokenized_train = train_val_split["train"].map(tokenize_function, batched=True)
    tokenized_val = train_val_split["test"].map(tokenize_function, batched=True)

    print(f"Training samples: {len(tokenized_train)}")
    print(f"Validation samples: {len(tokenized_val)}")

    # Check label distribution
    train_labels = tokenized_train["label"]
    print(f"Label distribution: {np.bincount(train_labels)}")

    return tokenized_train, tokenized_val

In [16]:
train_dataset, val_dataset = prepare_dataset(tokenizer)

Training samples: 1811
Validation samples: 453
Label distribution: [ 230 1111  470]


In [17]:
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding=True,
    #max_length=512,
    pad_to_multiple_of=8,  # For efficiency on GPU
    return_tensors="pt"
)

In [18]:
torch.cuda.empty_cache() # empty cache
training_args = TrainingArguments(
    output_dir="./peft_financial_sentiment_fixed",
    num_train_epochs=2,  # Reduced epochs
    per_device_train_batch_size=4,  # Smaller batch size
    per_device_eval_batch_size=8,
    warmup_steps=50,  # Reduced warmup
    weight_decay=0.01,
    learning_rate=3e-4,  # IMPORTANT: Add explicit learning rate
    logging_dir="./logs",
    logging_steps=50,
    evaluation_strategy="steps",
    eval_steps=100,  # More frequent evaluation
    save_strategy="steps", 
    save_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    report_to=None,
    gradient_checkpointing=False,
    dataloader_pin_memory=False,
    remove_unused_columns=True,
    dataloader_drop_last=False,
    group_by_length=False,
    fp16=False,  # Explicitly disable mixed precision
    dataloader_num_workers=0,  # Avoid multiprocessing issues
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [19]:
# Train the model
print("Starting training...")
train_result = trainer.train()

# Print training results
print("\nTraining completed!")
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training steps: {train_result.global_step}")

# Evaluate the model
print("\nEvaluating model...")
eval_result = trainer.evaluate()

print("Evaluation results:")
for key, value in eval_result.items():
    if key.startswith('eval_'):
        print(f"  {key}: {value:.4f}")

Starting training...


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
100,0.7049,0.540271,0.754967,0.717461,0.798167,0.754967
200,0.484,0.434057,0.777042,0.727658,0.702129,0.777042
300,0.2297,0.305738,0.931567,0.929712,0.933842,0.931567
400,0.1943,0.426732,0.898455,0.901859,0.916757,0.898455
500,0.1293,0.201719,0.95585,0.955711,0.955628,0.95585
600,0.1037,0.246559,0.951435,0.950838,0.950947,0.951435
700,0.1225,0.147053,0.971302,0.971251,0.971215,0.971302
800,0.1557,0.171835,0.969095,0.969014,0.969554,0.969095
900,0.1255,0.161317,0.969095,0.969124,0.969641,0.969095



Training completed!
Training loss: 0.2805
Training steps: 906

Evaluating model...


Evaluation results:
  eval_loss: 0.1471
  eval_accuracy: 0.9713
  eval_f1: 0.9713
  eval_precision: 0.9712
  eval_recall: 0.9713
  eval_runtime: 16.4393
  eval_samples_per_second: 27.5560
  eval_steps_per_second: 3.4670


In [20]:
# Saving the model
peft_model.save_pretrained("/tmp/my_finsentiment_pretrained_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [21]:
from peft import AutoPeftModelForSequenceClassification
peft_model_saved = AutoPeftModelForSequenceClassification.from_pretrained(
    "/tmp/my_finsentiment_pretrained_model",
    num_labels=3  # Explicitly specify 3 classes for financial sentiment
)
peft_model_saved = peft_model_saved.to('cuda')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
print("Evaluating PEFT model...")

peft_training_args = TrainingArguments(
    output_dir='./temp_peft_eval',
    per_device_eval_batch_size=8,
    dataloader_drop_last=False,
    remove_unused_columns=False,
)

peft_trainer = Trainer(
    model=peft_model_saved,
    args=peft_training_args,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Evaluate the PEFT model
print("Evaluating PEFT model...")
peft_eval_results = peft_trainer.evaluate()

# Print PEFT model results
print(f"\nPEFT Model Results:")
print(f"Accuracy: {peft_eval_results['eval_accuracy']:.4f}")

Evaluating PEFT model...
Evaluating PEFT model...



PEFT Model Results:
Accuracy: 0.9784


In [23]:
accuracy_before = eval_results['eval_accuracy']
accuracy_after = peft_eval_results['eval_accuracy']
print(f"Accuracy Before: {accuracy_before}")
print(f"accuracy_after={accuracy_after:.4f}")
print(f"Improvement in accuracy={(accuracy_after/accuracy_before-1)*100}%")


Accuracy Before: 0.5777385159010601
accuracy_after=0.9784
Improvement in accuracy=69.34250764525993%
