<a href="https://colab.research.google.com/github/nycoder103/financial-sentiment-analyzer-r-and-d/blob/main/notebooks/01_Benchmark_Framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiment 1: The "Reality Check" Benchmark
**Hypothesis:** A model pre-trained on financial news will perform equally well on social media financial discourse.
#
## This notebook loads multiple industry-standard NLP models and tests them against two distinct datasets:
### 1. **Financial Phrasebank (News):** Formal, structured financial language.
### 2. **Twitter Financial News (Social):** Informal, sarcastic, and noisy financial discourse.


In [1]:
# 1. SETUP & INSTALLATION
# We install datasets<3.0.0 to support the loading scripts required for Financial Phrasebank.
# IMPORTANT: After running this cell, you MUST restart the Colab Runtime (Runtime > Restart Session).
## If you don't restart: Python will continue using the newer, default version of datasets,
## which blocks the loading scripts, causing your financial_phrasebank download to fail with a RuntimeError.
!pip install transformers pandas plotly scipy torch "datasets<3.0.0" huggingface_hub

Collecting datasets<3.0.0
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting fsspec>=0.8.5 (from torch)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.0
    Uninstalling fsspec-2025.3.0:
      Successfully uninstalled fsspec-2025.3.0
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
[31mERROR: pip's dependency resolver does not currently take into account all 

In [2]:
# 2. IMPORTS & AUTHENTICATION
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import plotly.graph_objects as go
from datasets import load_dataset
import datasets
from google.colab import userdata
from huggingface_hub import login

# Authenticate with Hugging Face (Required for gated datasets like Financial Phrasebank)
try:
    hf_token = userdata.get('HF_TOKEN')
    login(hf_token)
    print("Logged in to Hugging Face successfully.")
except Exception as e:
    print(f"Error logging in: {e}. Please ensure 'HF_TOKEN' is set in Colab Secrets.")

Logged in to Hugging Face successfully.


In [3]:
# 3. CONFIGURATION: THE MODEL ZOO
# We select 5 representative models to test the "One Model Fits All" hypothesis.
MODELS = {
    "FinBERT (ProsusAI)": "ProsusAI/finbert",
    "FinBERT-Tone (Yiyang)": "yiyanghkust/finbert-tone",
    "Roberta-Social (CardiffNLP)": "cardiffnlp/twitter-roberta-base-sentiment-latest",
    "DistilBERT-SST2 (Generalist)": "distilbert-base-uncased-finetuned-sst-2-english",
    "Roberta-Large (Siebert)": "siebert/sentiment-roberta-large-english"
}

print(f"Loaded configuration for {len(MODELS)} models.")

Loaded configuration for 5 models.


In [4]:
# 4. DATA INGESTION & ALIGNMENT
# We load the two datasets and standardize their labels to: "positive", "negative", "neutral"

print("Loading datasets...")

# --- Dataset A: Formal News (Financial Phrasebank) ---
try:
    # trust_remote_code=True is needed for this specific dataset script
    ds_news = load_dataset("financial_phrasebank", "sentences_allagree", split="train", trust_remote_code=True)
except RuntimeError as e:
    if "Dataset scripts are no longer supported" in str(e):
        raise RuntimeError("Please restart your runtime (Runtime > Restart Session) to use the installed datasets<3.0.0 library.") from e
    raise e

df_news = pd.DataFrame(ds_news).sample(50, random_state=42) # Sample 50 for speed
df_news = df_news.rename(columns={"sentence": "text"})
df_news['source'] = 'News (Formal)'
# Map FPB labels: 0=Negative, 1=Neutral, 2=Positive
df_news['label_text'] = df_news['label'].map({0: "negative", 1: "neutral", 2: "positive"})

# --- Dataset B: Social Media (Twitter Financial News) ---
ds_social = load_dataset("zeroshot/twitter-financial-news-sentiment", split="validation")
df_social = pd.DataFrame(ds_social).sample(50, random_state=42) # Sample 50 for speed
df_social['source'] = 'Social (Informal)'
# Map Twitter labels: 0=Bearish, 1=Bullish, 2=Neutral
df_social['label_text'] = df_social['label'].map({0: "negative", 1: "positive", 2: "neutral"})

# Combine for Benchmarking
df = pd.concat([
    df_news[['text', 'source', 'label_text']],
    df_social[['text', 'source', 'label_text']]
])
df.rename(columns={'label_text': 'label'}, inplace=True)

print(f"Data Loaded: {len(df)} samples total.")
display(df.head())

Loading datasets...


Downloading builder script:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/859k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9543 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2388 [00:00<?, ? examples/s]

Data Loaded: 100 samples total.


Unnamed: 0,text,source,label
1755,The contract value amounts to EUR 2.4 million .,News (Formal),neutral
1281,Kemira shares closed at ( x20ac ) 16.66 ( $ 2...,News (Formal),neutral
350,The company slipped to an operating loss of EU...,News (Formal),negative
420,According to Atria 's President and CEO Matti ...,News (Formal),positive
56,"In 2009 , Fiskars ' cash flow from operating a...",News (Formal),positive


In [5]:
# 5. BENCHMARKING ENGINE
# A reusable function to normalize output labels from different model architectures.

def get_sentiment(text, model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    prediction_idx = torch.argmax(logits).item()

    # --- Normalization Logic ---
    name = model_name.lower()

    # 1. FinBERT (ProsusAI & Yiyang) -> [Positive, Negative, Neutral] (usually)
    # Check specific config if results look weird, but standard is:
    if "finbert" in name:
        # ProsusAI: 0=positive, 1=negative, 2=neutral
        # Yiyang: 0=neutral, 1=positive, 2=negative (Checking config... actually Yiyang is 0=Neutral, 1=Positive, 2=Negative)
        if "yiyang" in name:
             return ["neutral", "positive", "negative"][prediction_idx]
        return ["positive", "negative", "neutral"][prediction_idx]

    # 2. Twitter-Roberta -> [Negative, Neutral, Positive]
    elif "twitter-roberta" in name:
        return ["negative", "neutral", "positive"][prediction_idx]

    # 3. DistilBERT-SST2 -> [Negative, Positive] (Binary)
    elif "distilbert" in name:
        return ["negative", "positive"][prediction_idx]

    # 4. Siebert -> [Negative, Positive] (Binary usually)
    elif "siebert" in name:
        return ["negative", "positive"][prediction_idx]

    return "unknown"

# Run the Loop
results = []
print("Starting Benchmark run...")

for index, row in df.iterrows():
    for friendly_name, model_path in MODELS.items():
        try:
            pred = get_sentiment(row['text'], model_path)
            results.append({
                "Model": friendly_name,
                "Source": row['source'],
                "Correct": (pred == row['label'])
            })
        except Exception as e:
            print(f"Error with {friendly_name}: {e}")

results_df = pd.DataFrame(results)
print("Benchmark Complete.")

Starting Benchmark run...


tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassifi

Benchmark Complete.


In [6]:
# 6. VISUALIZATION: THE ACCURACY GAP
# This chart proves the hypothesis failure.

accuracy_df = results_df.groupby(['Model', 'Source'])['Correct'].mean().reset_index()
accuracy_df['Correct'] = accuracy_df['Correct'] * 100

fig = go.Figure()

for model in accuracy_df['Model'].unique():
    subset = accuracy_df[accuracy_df['Model'] == model]
    fig.add_trace(go.Bar(
        x=subset['Source'],
        y=subset['Correct'],
        name=model,
        text=subset['Correct'].apply(lambda x: f'{x:.1f}%'),
        textposition='auto'
    ))

fig.update_layout(
    title="The Reality Check: Model Accuracy vs. Data Source",
    yaxis_title="Accuracy (%)",
    barmode='group',
    template="plotly_white"
)
fig.show()