# Class Imbalance Experiments

This notebook explores strategies to mitigate the effects of class imbalance in the Financial PhraseBank classification task. As demonstrated in the exploratory data analysis and the benchmark baseline (Section 3.1), the dataset is skewed toward Neutral sentiment, both in overall counts and particularly among short-length inputs. This imbalance leads to uneven model performance, with significantly lower recall for the Negative class.

The goal of this notebook is to evaluate whether adjusting for class imbalance can improve macro-averaged performance and minority-class recall, without degrading overall accuracy. Ideally, the adjustments will lead to a fine-tuned classifier head with better perfomance. 

## Experiment Plan

1. **Baseline Reproduction (Section 3.1 Reference)**  
   Load and replicate the benchmark model with no class weighting to serve as a direct comparison.

2. **Static Class Weights**  
   Apply inverse-frequency class weights during training to emphasize minority classes.

3. **Length-Aware Class Reweighting**  
   Explore dynamic class weights or sampling techniques based on token length bins (e.g., over-weight short Negative examples).

4. **Short Sequence Filtering**  
   Evaluate the effect of removing or down-weighting very short, Neutral-dominant sequences.

5. **Combined Strategy**  
   Combine class weighting with input filtering to assess cumulative gains.

6. **Performance Comparison and Summary**  
   Compare all approaches to the benchmark using macro F1, per-class recall, and confusion matrices.



# Loading Up to 3.1 Benchmark

In [None]:
!pip install transformers datasets scikit-learn pandas numpy tqdm tensorflow

import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset # Hugging Face
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
from tqdm import tqdm
import random
import os
import matplotlib.pyplot as plt
from collections import Counter
import scipy.stats
import seaborn as sns
# Compute class weights per bin
from collections import defaultdict, Counter

import warnings
warnings.filterwarnings("ignore")

# Load the "all agree" subset
dataset = load_dataset("financial_phrasebank", "sentences_allagree") # All agree signifies 100% of annotators agreed on sentiment of this subset

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenization
def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=256  # Based on EDA to cover 100% of samples
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train-test-validation split (80/10/10 stratified)
train_val_split = tokenized_datasets["train"].train_test_split(test_size=0.2, seed=42)
val_test_split = train_val_split["test"].train_test_split(test_size=0.5, seed=42)

# Convert to TensorFlow datasets
def to_tf_dataset(split, shuffle=False):
    return split.to_tf_dataset(
        columns=["input_ids", "attention_mask"],
        label_cols=["label"],
        shuffle=shuffle,
        batch_size=16,
        collate_fn=None
    )

tf_train_dataset = to_tf_dataset(train_val_split["train"], shuffle=True)
tf_validation_dataset = to_tf_dataset(val_test_split["train"], shuffle=True)
tf_test_dataset = to_tf_dataset(val_test_split["test"], shuffle=False)

# Load model
model = TFAutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=3
)

# Freeze encoder
model.distilbert.trainable = False  # frozen encoder

# Compile model
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

# Train model
history = model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=3
)

## 2. Static Class Weighting

This experiment applies inverse-frequency class weights during training to penalize errors on underrepresented classes. The class weights are computed based on the training split label distribution. This provides a simple and interpretable baseline for imbalance-aware training.

In [None]:


# Compute class weights from training set labels
train_labels = [example['label'] for example in train_val_split['train']]
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(train_labels), y=train_labels)
class_weight_dict = dict(enumerate(class_weights))

print("Class Weights:", class_weight_dict)

# Recompile the model with class weights
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
model.distilbert.trainable = False

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Wrap with weighted loss
def weighted_loss(y_true, y_pred):
    weights = tf.gather(
        tf.constant([class_weight_dict[i] for i in range(3)], dtype=tf.float32),
        tf.cast(y_true, tf.int32)
    )
    unweighted_loss = loss_fn(y_true, y_pred)
    return tf.reduce_mean(unweighted_loss * weights)

model.compile(optimizer=tf.keras.optimizers.Adam(), loss=weighted_loss, metrics=["accuracy"])


## 3. Dynamic Class Weighting Based on Token Length

This experiment explores a more granular weighting scheme that accounts not only for class imbalance but also for how that imbalance varies across sentence lengths.

A quantile-based binning strategy is used to split the training data into length-based bins. Within each bin, class proportions are computed and used to generate local class weights. These weights are then dynamically applied during training based on the token length of each input.


In [None]:
# Example: Bin training set into short/mid/long based on quantiles
lengths = [len(tokenizer.tokenize(ex['sentence'])) for ex in train_val_split['train']]
q1, q2 = np.percentile(lengths, [33, 66])

def assign_length_bin(length):
    if length <= q1:
        return 'short'
    elif length <= q2:
        return 'medium'
    else:
        return 'long'

# Attach bin labels
for ex in train_val_split['train']:
    ex['length_bin'] = assign_length_bin(len(tokenizer.tokenize(ex['sentence'])))



bin_class_counts = defaultdict(Counter)
for ex in train_val_split['train']:
    bin_class_counts[ex['length_bin']][ex['label']] += 1

# Convert to weights (inverse frequency)
dynamic_class_weights = {}
for bin_name, counts in bin_class_counts.items():
    total = sum(counts.values())
    weights = {cls: total / (len(counts) * count) for cls, count in counts.items()}
    dynamic_class_weights[bin_name] = weights

print(dynamic_class_weights)
