<a href="https://colab.research.google.com/github/nycoder103/financial-sentiment-analyzer-r-and-d/blob/main/notebooks/02_Agent_Quorum_POC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# %% [markdown]
# # Project ARD: Financial Sentiment Analysis
# ## Part 1: Setup and Configuration
# Install necessary libraries and import standard packages.

# %% [code]
# Install datasets<3.0.0 to allow loading scripts (required for financial_phrasebank)
# IMPORTANT: You must restart the runtime (Runtime > Restart Session) after running this cell for the first time!
!pip install transformers pandas plotly scipy torch "datasets<3.0.0" huggingface_hub

import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
import plotly.graph_objects as go
from datasets import load_dataset
import datasets
from google.colab import userdata
from huggingface_hub import login

# Authenticate with Hugging Face using Colab Secrets
# Ensure you have a secret named 'HF_TOKEN' in the key icon on the left sidebar
try:
    hf_token = userdata.get('HF_TOKEN')
    login(hf_token)
    print("Logged in to Hugging Face successfully.")
except Exception as e:
    print(f"Error logging in: {e}. Please ensure 'HF_TOKEN' is set in Colab Secrets.")

# Configuration: Define the Models we want to test
MODELS = {
    "FinBERT (The Banker)": "ProsusAI/finbert",
    "Roberta (The Socialite)": "cardiffnlp/twitter-roberta-base-sentiment-latest",
    "DistilBERT (The Generalist)": "distilbert-base-uncased-finetuned-sst-2-english"
}

# Configuration: Load Real Benchmarking Datasets
# We use 'financial_phrasebank' for formal news and 'twitter-financial-news-sentiment' for social.
print("Loading datasets...")

# 1. Formal Data: Financial PhraseBank (sentences_allagree = 100% annotator agreement)
# NOTE: trust_remote_code=True is required for this dataset as it uses a loading script
try:
    ds_news = load_dataset("financial_phrasebank", "sentences_allagree", split="train", trust_remote_code=True)
except RuntimeError as e:
    # Catch the specific error regarding dataset scripts in newer library versions
    if "Dataset scripts are no longer supported" in str(e):
        raise RuntimeError(f"Current datasets version ({datasets.__version__}) does not support scripts. Please restart your runtime (Runtime > Restart Session) to use the downgraded version installed above.") from e
    raise e

df_news = pd.DataFrame(ds_news).sample(50, random_state=42) # Sample 50 rows for speed
# RENAME 'sentence' to 'text' to match the other dataset structure
df_news = df_news.rename(columns={"sentence": "text"})
df_news['source'] = 'News (Formal)'
# Map FPB labels: 0=Negative, 1=Neutral, 2=Positive
map_fpb = {0: "negative", 1: "neutral", 2: "positive"}
df_news['label_text'] = df_news['label'].map(map_fpb)

# 2. Social Data: Twitter Financial News
ds_social = load_dataset("zeroshot/twitter-financial-news-sentiment", split="validation")
df_social = pd.DataFrame(ds_social).sample(50, random_state=42) # Sample 50 rows for speed
df_social['source'] = 'Social (Informal)'
# Map Twitter labels: 0=Bearish, 1=Bullish, 2=Neutral
map_twitter = {0: "negative", 1: "positive", 2: "neutral"}
df_social['label_text'] = df_social['label'].map(map_twitter)

# Combine into one test dataframe
df = pd.concat([
    df_news[['text', 'source', 'label_text']],
    df_social[['text', 'source', 'label_text']]
])
df.rename(columns={'label_text': 'label'}, inplace=True)

print(f"Loaded {len(df)} total samples for benchmarking.")
df.head()

# %% [markdown]
# ## Part 2: The "Reality Check" (Benchmarking)
# We run both models on all data to visualize the "Accuracy Gap" described in the README.

# %% [code]
def get_sentiment(text, model_name):
    """
    Runs a specific model on a piece of text and returns the label.
    Includes logic to normalize different model output formats.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits

    # Get the predicted index
    prediction_idx = torch.argmax(logits).item()

    # MAPPING LOGIC: Standardize outputs to [positive, negative, neutral]

    # Logic for FinBERT (ProsusAI)
    # Config: 0=positive, 1=negative, 2=neutral
    if "finbert" in model_name.lower():
        labels = ["positive", "negative", "neutral"]
        return labels[prediction_idx]

    # Logic for Twitter-Roberta (CardiffNLP)
    # Config: 0=negative, 1=neutral, 2=positive
    elif "twitter-roberta" in model_name.lower():
        labels = ["negative", "neutral", "positive"]
        return labels[prediction_idx]

    # Logic for DistilBERT (The Generalist)
    # Config: 0=negative, 1=positive (No neutral class)
    elif "distilbert" in model_name.lower():
        labels = ["negative", "positive"]
        return labels[prediction_idx]

    else:
        # Fallback for other models (assuming pos/neg/neu structure)
        return "unknown"

def get_majority_vote(votes):
    """Simple Majority Vote Logic"""
    return max(set(votes), key=votes.count)

# Run the Benchmark
results = []
print("Running Benchmark (this may take a minute)...")

for index, row in df.iterrows():
    row_votes = []

    # 1. Run Individual Agents
    for friendly_name, model_path in MODELS.items():
        prediction = get_sentiment(row['text'], model_path)
        row_votes.append(prediction) # Collect vote

        is_correct = (prediction == row['label'])
        results.append({
            "Text": row['text'],
            "Source": row['source'],
            "Model": friendly_name,
            "Prediction": prediction,
            "Actual": row['label'],
            "Correct": is_correct
        })

    # 2. Run Quorum (Meta-Agent) using the collected votes
    quorum_prediction = get_majority_vote(row_votes)
    quorum_correct = (quorum_prediction == row['label'])

    results.append({
        "Text": row['text'],
        "Source": row['source'],
        "Model": "Agent Quorum (Consensus)", # This will appear as a 4th bar in the chart
        "Prediction": quorum_prediction,
        "Actual": row['label'],
        "Correct": quorum_correct
    })

results_df = pd.DataFrame(results)
print("Benchmark Complete.")
results_df.head()

# %% [markdown]
# ## Part 3: Visualization
# Show the accuracy gap between formal and social data.

# %% [code]
# Calculate accuracy per model per source
accuracy_df = results_df.groupby(['Model', 'Source'])['Correct'].mean().reset_index()
accuracy_df['Correct'] = accuracy_df['Correct'] * 100  # Convert to percentage

# Plot
fig = go.Figure()

for model in accuracy_df['Model'].unique():
    subset = accuracy_df[accuracy_df['Model'] == model]
    fig.add_trace(go.Bar(
        x=subset['Source'],
        y=subset['Correct'],
        name=model,
        text=subset['Correct'].apply(lambda x: f'{x:.1f}%'),
        textposition='auto'
    ))

fig.update_layout(
    title="The Reality Check: Model Accuracy by Data Source (Including Quorum)",
    yaxis_title="Accuracy (%)",
    barmode='group'
)
fig.show()

# %% [markdown]
# ## Part 4: Testing the Quorum Live
# We feed it a tricky example to see the agents debate.

# %% [code]
class AgentQuorum:
    def __init__(self):
        self.agents = MODELS # Reusing our loaded models

    def analyze(self, text):
        votes = []
        print(f"--- Analyzing: '{text}' ---")

        for agent_name, model_path in self.agents.items():
            # Get the opinion
            opinion = get_sentiment(text, model_path)
            print(f"  > {agent_name} says: {opinion}")
            votes.append(opinion)

        # Simple Quorum Logic: Majority Vote
        final_verdict = max(set(votes), key=votes.count)

        # Conflict Detection
        if len(set(votes)) > 1:
            print(f"  [!] Conflict Detected. Quorum resolves to: {final_verdict}")
        else:
            print(f"  [+] Unanimous Decision: {final_verdict}")

        return final_verdict

# Initialize the System
quorum = AgentQuorum()

# Test Case: A confusing tweet that FinBERT usually gets wrong
tricky_text = "My portfolio is bleeding out but I'm still holding because I have diamond hands ðŸ’Ž"
final_decision = quorum.analyze(tricky_text)

Logged in to Hugging Face successfully.
Loading datasets...
Loaded 100 total samples for benchmarking.
Running Benchmark (this may take a minute)...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassifi

Benchmark Complete.


--- Analyzing: 'My portfolio is bleeding out but I'm still holding because I have diamond hands ðŸ’Ž' ---
  > FinBERT (The Banker) says: neutral


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  > Roberta (The Socialite) says: positive
  > DistilBERT (The Generalist) says: positive
  [!] Conflict Detected. Quorum resolves to: positive


# Summary:
### The Quorum acts as a safety net. It captures the high accuracy of specialists (like FinBERT on news) while filtering out the noise from bad generalist models, ensuring our dashboard never relies on a single, potentially flawed perspective.