# FinBERT Sentiment Pipeline for Numerai Signals

## Context
Numerai Signals requires per-stock predictions scored on their **orthogonality** to existing factors. This notebook builds the simplest possible text→feature pipeline: financial sentiment analysis using FinBERT.

## Pipeline
Text headlines → FinBERT sentiment scores → Per-stock aggregation → Numerai Signals format

## Why Start Here
- Establishes the end-to-end pipeline architecture
- All other approaches (embeddings, graphs, probing) plug into this same pipeline
- FinBERT (ProsusAI) achieves ~89% F1 on financial sentiment, far better than VADER (~70%)

## Limitations (important for interview)
- Sentiment alone is probably NOT orthogonal enough — Numerai likely already has sentiment features
- Real value comes from combining with other approaches (notebooks 02-08)

In [1]:
import torch
from transformers import AutoModelForSequenceClassification, pipeline, AutoTokenizer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## A: What will I have to do?

Simple sentiment analysis with FinBERT. Grab Finbert from huggingface, and shove it through AutoModelForSentenceClassification. We need a bunch of training docs; I guess FinBERT just classifies entire docs, e.g., elements of the corpus? 

So, pipeline is:
- summarize text
- sentiment scores from text
- associate each text with a stock, aggregate sentiment for stock
- convert to Numerai Signals format

## 1. Load FinBERT
ProsusAI/finbert is a BERT model fine-tuned on financial text (Financial PhraseBank + analyst reports).
Outputs: positive, negative, neutral probabilities per text input.

In [2]:
# TODO: implement
model = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model)
pipe = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)



Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: ProsusAI/finbert
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [5]:
pipe([
    "APPL acquires OpenAI, exceeds revenue expectations",
    "NVDA drops 10% after disappointing earnings",
])

[{'label': 'positive', 'score': 0.8593773245811462},
 {'label': 'negative', 'score': 0.9692803621292114}]

## 2. Sample Financial Headlines
In production, these would come from news APIs (Tiingo, Polygon.io, Yahoo Finance) or Common Crawl.
For this demo, we use realistic synthetic headlines mapped to tickers.

In [26]:
# Synthetic financial headlines with tickers and dates
# In production: replace with news API (Tiingo, Polygon.io, Yahoo Finance, or Common Crawl)
import json
from pathlib import Path

with Path("../headlines_data.json").open("r") as f:
    headlines_data = json.load(f)


df = pd.DataFrame(headlines_data)
df.head()

Unnamed: 0,ticker,date,headline
0,AAPL,2024-01-15,Apple reports record Q1 revenue driven by iPho...
1,AAPL,2024-01-16,Apple's Vision Pro pre-orders exceed analyst e...
2,AAPL,2024-01-17,Apple faces antitrust scrutiny from EU regulat...
3,AAPL,2024-01-18,Apple cuts prices in China amid fierce competi...
4,AAPL,2024-01-19,Apple announces expanded share buyback program


## 3. Score Headlines with FinBERT

take all these headlines, get scores for them.
Then we're going to take the scores and aggregate them by stock ticker.
Then convert aggregated scores into a submission format.

In [27]:
df.head()

Unnamed: 0,ticker,date,headline
0,AAPL,2024-01-15,Apple reports record Q1 revenue driven by iPho...
1,AAPL,2024-01-16,Apple's Vision Pro pre-orders exceed analyst e...
2,AAPL,2024-01-17,Apple faces antitrust scrutiny from EU regulat...
3,AAPL,2024-01-18,Apple cuts prices in China amid fierce competi...
4,AAPL,2024-01-19,Apple announces expanded share buyback program


In [28]:
# Score all headlines with FinBERT

scores = pipe(df["headline"].tolist())

# extract sentiment scores
df["sentiment_label"] = [r["label"] for r in scores]
df["sentiment_score"] = [r["score"] for r in scores]
mapping = {"positive": 1, "neutral": 0, "negative": -1}

df["sentiment_label_numeric"] = df["sentiment_label"].map(mapping)
df["sentiment_numeric"] = df["sentiment_label_numeric"] * df["sentiment_score"]


print(df.head())

  ticker        date                                           headline  \
0   AAPL  2024-01-15  Apple reports record Q1 revenue driven by iPho...   
1   AAPL  2024-01-16  Apple's Vision Pro pre-orders exceed analyst e...   
2   AAPL  2024-01-17  Apple faces antitrust scrutiny from EU regulat...   
3   AAPL  2024-01-18  Apple cuts prices in China amid fierce competi...   
4   AAPL  2024-01-19     Apple announces expanded share buyback program   

  sentiment_label  sentiment_score  sentiment_label_numeric  sentiment_numeric  
0        positive         0.931807                        1           0.931807  
1        positive         0.927904                        1           0.927904  
2        negative         0.949503                       -1          -0.949503  
3        negative         0.957857                       -1          -0.957857  
4         neutral         0.713084                        0           0.000000  


## 4. Aggregate Per-Stock Features

For Numerai Signals, we need a single prediction per stock. We aggregate headline-level sentiment into stock-level features using multiple statistics.

In [29]:
# Aggregate sentiment features per ticker

agg_features = (
    df
    .groupby("ticker")
    .agg(
        n_headlines=("headline", "count"),
        mean_sentiment=("sentiment_numeric", "mean"),
        std_sentiment=("sentiment_numeric", "std"),
        min_sentiment=("sentiment_numeric", "min"),
        max_sentiment=("sentiment_numeric", "max"),
        pct_positive=("sentiment_label_numeric", lambda x: (x == 1).mean()),
        pct_negative=("sentiment_label_numeric", lambda x: (x == 0).mean()),
        sentiment_range=(
            "sentiment_label_numeric",
            lambda x: x.max() - x.min(),
        ),
    )
    .reset_index()
)

In [30]:
agg_features

Unnamed: 0,ticker,n_headlines,mean_sentiment,std_sentiment,min_sentiment,max_sentiment,pct_positive,pct_negative,sentiment_range
0,AAPL,5,-0.00953,0.941789,-0.957857,0.931807,0.4,0.2,2
1,AMZN,4,-0.236972,0.911346,-0.958162,0.953333,0.25,0.25,2
2,GOOGL,4,-0.03799,0.714928,-0.949934,0.797975,0.25,0.5,2
3,JPM,4,-0.242033,0.91356,-0.959657,0.948946,0.25,0.25,2
4,META,4,-0.243983,0.897991,-0.956758,0.923607,0.25,0.25,2
5,MSFT,4,0.162655,0.833288,-0.937173,0.952163,0.5,0.25,2
6,NVDA,4,0.246268,0.77125,-0.733701,0.935749,0.5,0.25,2
7,PFE,4,0.37535,0.906086,-0.973435,0.933383,0.75,0.0,2
8,TSLA,5,-0.757631,0.423847,-0.966551,0.0,0.0,0.2,1
9,XOM,4,-0.215015,0.799013,-0.954127,0.816907,0.25,0.25,2


## 5. Format as Numerai Signals Submission

Numerai Signals expects: ticker (from their universe of ~5,000 stocks) + prediction (0 to 1).
We rank-normalize the mean sentiment to [0, 1].

In [31]:
# Create Numerai Signals-style submission
mean_sentiment = agg_features["mean_sentiment"]
min_sentiment = agg_features["min_sentiment"]
range_sentiment = agg_features["sentiment_range"]


mean_sentiment_normalized = (mean_sentiment - min_sentiment) / range_sentiment


In [32]:
submission = agg_features[["ticker", "mean_sentiment"]]
submission.columns = ["ticker", "signal"]
submission

Unnamed: 0,ticker,signal
0,AAPL,-0.00953
1,AMZN,-0.236972
2,GOOGL,-0.03799
3,JPM,-0.242033
4,META,-0.243983
5,MSFT,0.162655
6,NVDA,0.246268
7,PFE,0.37535
8,TSLA,-0.757631
9,XOM,-0.215015


## 7. Temporal Rolling Features

In production, sentiment should be aggregated over rolling windows to capture momentum and mean-reversion in sentiment.

In [33]:
# Demonstrate rolling sentiment aggregation for a single ticker

df.groupby("ticker").rolling(2).agg(
    n_headlines=("headline", "count"),
    mean_sentiment=("sentiment_numeric", "mean"),
    std_sentiment=("sentiment_numeric", "std"),
    min_sentiment=("sentiment_numeric", "min"),
    max_sentiment=("sentiment_numeric", "max"),
    pct_positive=("sentiment_label_numeric", lambda x: (x == 1).mean()),
    pct_negative=("sentiment_label_numeric", lambda x: (x == 0).mean()),
    sentiment_range=(
        "sentiment_label_numeric",
        lambda x: x.max() - x.min(),
    ),
).dropna().reset_index()

Unnamed: 0,ticker,level_1,n_headlines,mean_sentiment,std_sentiment,min_sentiment,max_sentiment,pct_positive,pct_negative,sentiment_range
0,AAPL,1,2.0,0.929855,0.00276,0.927904,0.931807,1.0,0.0,0.0
1,AAPL,2,2.0,-0.0108,1.327527,-0.949503,0.927904,0.5,0.0,2.0
2,AAPL,3,2.0,-0.95368,0.005907,-0.957857,-0.949503,0.0,0.0,0.0
3,AAPL,4,2.0,-0.478929,0.677307,-0.957857,0.0,0.0,0.5,1.0
4,AMZN,23,2.0,-0.002414,1.351632,-0.958162,0.953333,0.5,0.0,2.0
5,AMZN,24,2.0,-0.479081,0.677523,-0.958162,0.0,0.0,0.5,1.0
6,AMZN,25,2.0,-0.471529,0.666843,-0.943058,0.0,0.0,0.5,1.0
7,GOOGL,31,2.0,-0.07598,1.235958,-0.949934,0.797975,0.5,0.0,2.0
8,GOOGL,32,2.0,-0.474967,0.671705,-0.949934,0.0,0.0,0.5,1.0
9,GOOGL,33,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## Discussion & Interview Talking Points

### Strengths
- **Simple, interpretable pipeline**: Easy to debug, explain, and iterate
- **Fast to deploy**: FinBERT inference is fast (~100 headlines/sec on GPU)
- **Well-validated**: FinBERT achieves 89% F1 on Financial PhraseBank

### Weaknesses & Why This Alone Won't Win
- **Probably not orthogonal**: Numerai likely already has sentiment features. After neutralization, most of this signal will be removed.
- **Coarse granularity**: Positive/negative/neutral misses nuance (e.g., "beat estimates by 1%" vs "beat estimates by 50%")
- **Headline-only**: Headlines are noisy and may not capture the full story

### What Makes This Valuable
- **It's the pipeline, not the model**: This same pipeline (text → score → aggregate → submit) is reused by every other notebook
- **Baseline to beat**: If embeddings (NB02), graphs (NB04), or probing (NB06) can't beat FinBERT sentiment, they're not worth the complexity

### Numerai-Specific Considerations
- **Point-in-time**: Only use headlines published BEFORE the prediction date
- **Low churn**: Smooth features over time (exponential decay) to avoid high-turnover signals
- **Ticker mapping**: In production, must map company names to Numerai's stock universe (~5,000 tickers)

### Extensions (TODO)
- [ ] Replace synthetic headlines with real news API (Tiingo, Polygon.io, Yahoo Finance)
- [ ] Add exponential decay weighting (recent headlines matter more)
- [ ] Compare FinBERT vs VADER vs LLM-based sentiment
- [ ] Backtest against actual stock returns
- [ ] Submit to Numerai Signals and measure orthogonality score