<h1 style='text-align:center; font-size:30px; font-weight:bold; '>Fine-Tuning Distilbert on the FinancialPhraseBank Dataset</h1>

# Introduction
- whats going on
- recommended running enviroment (colab with GPU)

## Table of Contents

1. Introduction and Objective
2. Dataset Loading and Preprocessing
3. Exploratory Data Analysis
4. Baseline Classifier Head Only (Frozen Encoder)
5. Fine-Tune All Weights (Using Trained Classifier Head)
6. Fine-Tune All Weights from Scratch
7. Evaluation & Comparison
8. Discussion & Reflection


Installing libraries & adding imports

In [1]:
!pip install transformers datasets scikit-learn pandas numpy tqdm tensorflow

import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset # Hugging Face
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from tqdm import tqdm
import random
import os

import warnings
warnings.filterwarnings("ignore")

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

# Load Financial PhraseBank Dataset & Tokenizer

### Alternative Model Considerations

| Model | Parameters | Notes |
|-------|------------|-------|
| `distilbert-base-uncased` | ~66M | Lightweight, fast to train. Chosen as baseline and required for Part 2. |
| `bert-base-uncased` | ~110M | More expressive, but slower. Considered for extra experiments. |
| `albert-base-v2` | ~12M | Extremely compact due to weight sharing. Good for parameter-efficiency testing. |
| `roberta-base` | ~125M | High-performing, but uses a different tokenizer. Reserved for advanced exploration. |
| `electra-small-discriminator` | ~14M | Fast and efficient, but less common in TensorFlow workflows. Not chosen for core tasks. |

**`distilbert-base-uncased`** is selected due to:
- Smaller size compared to `bert-base-uncased`, allowing for a broader range of ablation experiments
- Solid performance on general sentiment tasks
- Compatibility with Part 2 requirements, which involve applying LoRA adapters to the FFN (`lin1`, `lin2`) layers of `distilbert-base-uncased` to evaluate parameter-efficient fine-tuning

Other models may be explored in separate sections to assess the impact of architecture and scale on downstream performance.

**Add link here


In [2]:
# Load the "all agree" subset
dataset = load_dataset("financial_phrasebank", "sentences_allagree") # All agree signifies 100% of annotators agreed on sentiment of this subset

# Peek at the data
dataset["train"][0]
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

The repository for financial_phrasebank contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/financial_phrasebank.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

# 1.11 Fine-Tuning: Classifier Head Only

Since Distilbert is a model trained on standard English, this experiment of adjusting only the classifier head essentially examines whether an uinderstanding of general English is "good enough" for finanical data.

The process here is to freeze the encoder portion (self attention layers) of DistilBERT and only work with the classifier which maps embeddings to outputs.

Is general-purpose English language understanding — trained on Wikipedia and books — good enough to detect sentiment in financial text?

## Baseline
To establish a clean and interpretable baseline, this configuration fine-tunes only the classification head of a pretrained DistilBERT model while keeping the encoder frozen. The goal is to evaluate the out-of-the-box transferability of general language representations to a financial sentiment classification task using the "all agree" subset of the Financial PhraseBank.

#### Key parameter choices:

- **Tokenization**: Sentences were tokenized using a fixed max_length=512 with padding and truncation enabled to preserve maximum context while maintaining uniform input shape.

- **Data splitting**: The dataset was split into 80% training, 10% validation, and 10% test to enable model selection and unbiased performance evaluation.
Batch size: A batch size of 8 was used to maintain training stability and fit within memory constraints when using a GPU.

- **Frozen encoder**: Only the classifier head was trained by setting model.distilbert.trainable = False, allowing for a controlled assessment of the baseline model's pretrained representations.

- **Optimizer and loss**: The model was compiled with the Adam optimizer and  SparseCategoricalCrossentropy loss, appropriate for multi-class classification with integer labels.

- **Epochs**: Training was performed for 3 epochs to minimize overfitting while providing enough iterations to observe general learning behavior.

In [3]:
# Tokenization
def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=512
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train-test-validation split
train_val_split = tokenized_datasets["train"].train_test_split(test_size=0.2, seed=42)
val_test_split = train_val_split['test'].train_test_split(test_size=0.5, seed=42)

# Convert to TensorFlow dataset
def to_tf_dataset(split, shuffle=False):
    return split.to_tf_dataset(
        columns=["input_ids", "attention_mask"],
        label_cols=["label"],
        shuffle=shuffle,
        batch_size=8,
        collate_fn=None
    )

tf_train_dataset = to_tf_dataset(train_val_split['train'], shuffle=True)
tf_validation_dataset = to_tf_dataset(val_test_split['train'], shuffle=True)
tf_test_dataset = to_tf_dataset(val_test_split['test'], shuffle=False)

# Load model
model = TFAutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=3
)

# Freeze DistilBERT encoder
model.distilbert.trainable = False

# Compile
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Train
history = model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=3
)

# Evaluate
eval_loss, eval_accuracy = model.evaluate(tf_test_dataset)
print(f"Test Loss: {eval_loss:.4f}, Test Accuracy: {eval_accuracy:.4f}")

# Confusion Matrix + Classification Report
y_pred_logits = model.predict(tf_test_dataset).logits
y_pred = np.argmax(y_pred_logits, axis=1)

y_true = np.concatenate([y for x, y in tf_test_dataset], axis=0)

print(classification_report(y_true, y_pred, target_names=["Negative", "Neutral", "Positive"]))

print(confusion_matrix(y_true, y_pred))


Map:   0%|          | 0/2264 [00:00<?, ? examples/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/3
Epoch 2/3
Epoch 3/3
Test Loss: 0.3374, Test Accuracy: 0.8590
              precision    recall  f1-score   support

    Negative       0.79      0.63      0.70        30
     Neutral       0.97      0.92      0.94       142
    Positive       0.67      0.84      0.74        55

    accuracy                           0.86       227
   macro avg       0.81      0.80      0.80       227
weighted avg       0.87      0.86      0.86       227

[[ 19   0  11]
 [  0 130  12]
 [  5   4  46]]


## Results 1.1



In [4]:
# Save weights

# Fine-Tune All Layers Using Previous Classifier Head

In [5]:
#

# Entire Model: Pre-Trained + Classifier Head