# Lightweight Fine-Tuning Project

**Dataset**: FinanceInc/auditor_sentiment - Auditor review sentiment collected by News Department. The dataset consists of several thousand sentences from English language financial news categorized by sentiment.

Features:
Sentence: 
Label: 

* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Transformer trainer 
* Fine-tuning dataset: Auditor Sentiment

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting joblib>=1.2.0
  Downloading joblib-1.4.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.2/301.2 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.4.0-py3-none-any.whl (17 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.0 scikit-learn-1.4.2 threadpoolctl-3.4.0


In [2]:
import transformers
import torch

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

from peft import LoraConfig, PeftModelForSequenceClassification, TaskType, AutoPeftModelForSequenceClassification

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [3]:
dataset = load_dataset('FinanceInc/auditor_sentiment', split = 'train').train_test_split(
    test_size = 0.2, shuffle = True, seed = 42
)

Downloading readme:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/800 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 327k/327k [00:00<00:00, 1.02MB/s]
Downloading data: 100%|██████████| 80.9k/80.9k [00:00<00:00, 452kB/s]


Generating train split:   0%|          | 0/3877 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/969 [00:00<?, ? examples/s]

In [4]:
dataset['train']

Dataset({
    features: ['sentence', 'label'],
    num_rows: 3101
})

In [5]:
dataset['test']

Dataset({
    features: ['sentence', 'label'],
    num_rows: 776
})

In [6]:
dataset['train'][0]

{'sentence': "The financial impact is estimated to be some 1.5 MEUR annual improvement in the division 's result , starting from fiscal year 2007 .",
 'label': 2}

In [7]:
model_name = 'gpt2'

In [8]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels = 3,
    id2label = {0: 'Negative', 1: 'Neutral', 2: 'Positive'},
    label2id = {'Negative': 0, 'Neutral': 1, 'Positive': 2}
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [9]:
tokenized_dataset = {}
splits = ['train', 'test']
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x['sentence'], padding = 'max_length', truncation=True), batched=True
    )

Map:   0%|          | 0/3101 [00:00<?, ? examples/s]

Map:   0%|          | 0/776 [00:00<?, ? examples/s]

In [10]:
tokenized_dataset['train'][0]

{'sentence': "The financial impact is estimated to be some 1.5 MEUR annual improvement in the division 's result , starting from fiscal year 2007 .",
 'label': 2,
 'input_ids': [464,
  3176,
  2928,
  318,
  6108,
  284,
  307,
  617,
  352,
  13,
  20,
  11948,
  4261,
  5079,
  9025,
  287,
  262,
  7297,
  705,
  82,
  1255,
  837,
  3599,
  422,
  9068,
  614,
  4343,
  764,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  50256,
  5025

In [11]:
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=3, bias=False)
)


In [12]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    return {"accuracy": (predictions == labels).mean(), "f1": f1, "precision": precision, "recall": recall}

trainer = Trainer(
    model = model,
    args = TrainingArguments(
        output_dir = './model_output/financial_sentiment_analysis_1',
        learning_rate = 2e-4,
        per_device_train_batch_size = 1,
        per_device_eval_batch_size = 1,
        num_train_epochs = 1,
        warmup_ratio=0.03,
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
   ),
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['test'],
    tokenizer = tokenizer,
    data_collator = DataCollatorWithPadding(tokenizer = tokenizer),
    compute_metrics = compute_metrics
)

In [13]:
evaluation_results_before_training = trainer.evaluate()
print("Evaluation Results before Training:", evaluation_results_before_training)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Evaluation Results before Training: {'eval_loss': 7.094663619995117, 'eval_accuracy': 0.6018041237113402, 'eval_f1': 0.4522003632714334, 'eval_precision': 0.36216820331597405, 'eval_recall': 0.6018041237113402, 'eval_runtime': 66.5501, 'eval_samples_per_second': 11.66, 'eval_steps_per_second': 11.66}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.9065,1.475056,0.708763,0.667667,0.632333,0.708763


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


TrainOutput(global_step=3101, training_loss=2.2058815757753925, metrics={'train_runtime': 1005.3364, 'train_samples_per_second': 3.085, 'train_steps_per_second': 3.085, 'total_flos': 1620577079525376.0, 'train_loss': 2.2058815757753925, 'epoch': 1.0})

In [15]:
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)

Evaluation Results: {'eval_loss': 1.4750558137893677, 'eval_accuracy': 0.7087628865979382, 'eval_f1': 0.6676674278248577, 'eval_precision': 0.6323330126724229, 'eval_recall': 0.7087628865979382, 'eval_runtime': 69.8872, 'eval_samples_per_second': 11.104, 'eval_steps_per_second': 11.104, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [16]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [17]:
# PEFT model configuration

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=4,
    lora_alpha=16,
    lora_dropout=0.1
)

# Load the pre-trained GPT-2 model
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=3)
model.config.pad_token_id = model.config.eos_token_id

peft_model = PeftModelForSequenceClassification(model, peft_config)

# Print
peft_model.print_trainable_parameters()

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 152,064 || all params: 124,591,872 || trainable%: 0.12204969518396834


In [18]:
trainer_peft = Trainer(
    model = peft_model,
    args = TrainingArguments(
        output_dir = './model_output/financial_sentiment_analysis_2',
        learning_rate = 2e-4,
        per_device_train_batch_size = 1,
        per_device_eval_batch_size = 1,
        num_train_epochs = 1,
        warmup_ratio=0.03,
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
   ),
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['test'],
    tokenizer = tokenizer,
    data_collator = DataCollatorWithPadding(tokenizer = tokenizer),
    compute_metrics = compute_metrics
)

In [19]:
trainer_peft.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.1441,1.055765,0.811856,0.804465,0.811592,0.811856


TrainOutput(global_step=3101, training_loss=1.3936155504351544, metrics={'train_runtime': 765.4683, 'train_samples_per_second': 4.051, 'train_steps_per_second': 4.051, 'total_flos': 1623430388514816.0, 'train_loss': 1.3936155504351544, 'epoch': 1.0})

In [20]:
peft_evaluation_results = trainer_peft.evaluate()
print("Evaluation Results with PEFT:", peft_evaluation_results)

Evaluation Results with PEFT: {'eval_loss': 1.055765151977539, 'eval_accuracy': 0.8118556701030928, 'eval_f1': 0.8044647347458389, 'eval_precision': 0.811592268341624, 'eval_recall': 0.8118556701030928, 'eval_runtime': 72.353, 'eval_samples_per_second': 10.725, 'eval_steps_per_second': 10.725, 'epoch': 1.0}


In [21]:
peft_model.save_pretrained('model/peft_model')

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [22]:
inf_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "model/peft_model",
    num_labels = 3
)
inf_model.config.pad_token_id = inf_model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
trainer_inf_model = Trainer(
    model = inf_model,
    args = TrainingArguments(
        output_dir = './model_output/financial_sentiment_analysis_3',
        learning_rate = 2e-4,
        per_device_train_batch_size = 1,
        per_device_eval_batch_size = 1,
        num_train_epochs = 1,
        weight_decay = 0.01,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        label_names = [0, 1, 2],
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
        load_best_model_at_end = True
    ),
    eval_dataset = tokenized_dataset['test'],
    tokenizer = tokenizer,
    data_collator = DataCollatorWithPadding(tokenizer = tokenizer),
    compute_metrics = compute_metrics
)

In [24]:
inf_evaluation_results = trainer_inf_model.evaluate()
print("Inference Evaluation Results:", inf_evaluation_results)

Inference Evaluation Results: {'eval_runtime': 72.4522, 'eval_samples_per_second': 10.711, 'eval_steps_per_second': 10.711}


In [25]:
id2label = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}

def predict_sentiment(sentence: str) -> str:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inf_model.to(device)

    # Prepare the input text
    inputs = tokenizer(sentence, return_tensors="pt").to(device)

    # Get predictions
    with torch.no_grad():
        outputs = inf_model(**inputs)
        logits = outputs.logits

    probabilities = torch.nn.functional.softmax(logits, dim = 1)
    predicted_class_id = probabilities.argmax().item()
    predicted_label = id2label[predicted_class_id]

    return predicted_label

# Example usage
sentence = "Operating profit plunged 30% over the previous year"
predicted_sentiment = predict_sentiment(sentence)
print(f"Sentence: '{sentence}'\nPredicted sentiment: {predicted_sentiment}")

Sentence: 'Operating profit plunged 30% over the previous year'
Predicted sentiment: Negative


In [26]:
sentence = "Operating profit increased 30% over the previous year"
predicted_sentiment = predict_sentiment(sentence)
print(f"Sentence: '{sentence}'\nPredicted sentiment: {predicted_sentiment}")

Sentence: 'Operating profit increased 30% over the previous year'
Predicted sentiment: Positive


In [27]:
sentence = "Corp has generated 10 consecutive years of positive cash flow"
predicted_sentiment = predict_sentiment(sentence)
print(f"Sentence: '{sentence}'\nPredicted sentiment: {predicted_sentiment}")

Sentence: 'Corp has generated 10 consecutive years of positive cash flow'
Predicted sentiment: Positive


In [28]:
sentence = "Operating cash flow has been flat for 10 consecutive quarters"
predicted_sentiment = predict_sentiment(sentence)
print(f"Sentence: '{sentence}'\nPredicted sentiment: {predicted_sentiment}")

Sentence: 'Operating cash flow has been flat for 10 consecutive quarters'
Predicted sentiment: Negative


**Inferences**:

GPT-2:
Evaluation Accuracy: 66.75%
F1: 0.62
Precision: 0.57
Recall: 0.68

GPT-2 with PEFT:
Evaluation Accuracy: 80.9%
F1: 0.81
Precision: 0.81
Recall: 0.81

With PEFT fine-tuning, the accuracy of the model has increased by 14%. Other metrics like F1, precision and recall have also increased. In this particular case, PEFT has helped GPT-2 achieve better results for the Financial_Sentiment_Analysis dataset.
