 Dataset A (S&P500 + headlines)

We build a next-day direction label from S&P 500 close prices and compares two simple text-based approaches

 BOW baseline hashed bag-of-words from daily headlines then logistic regression  

FinBERT features daily mean sentiment probabilities and headline count then logistic regression

Leakage note: 
I use a time split (train ≤ 2018, val 2019–2021, test ≥ 2022).
 If any AI were used for, I will cite them in the written report.


Imports

In [47]:
import os
import re
import hashlib

import numpy as np
import pandas as pd
import torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


Tokenization + stable hashing for BOW

Baseline tokenizer: keep only [A-Za-z] tokens

For hashing, I avoid Python's built-in hash() because it changes between runs, which took longer to realize than i am proud of. 
Instead I use md5 and map tokens into a fixed number of bins.


In [48]:
word_re = re.compile(r"[a-zA-Z]+")

def toks(s):
    if not isinstance(s, str):
        return []
    return word_re.findall(s.lower())

def stable_hash(word: str, bins: int) -> int:
    # stable across runs.... unlike python hash
    h = hashlib.md5(word.encode("utf-8")).hexdigest()
    return int(h[:8], 16) % bins

def make_hash_X(texts, bins=5000):
    # hashed bag-of-words with log1p counts
    X = np.zeros((len(texts), bins), dtype=np.float32)
    for i, t in enumerate(texts):
        for w in toks(t):
            X[i, stable_hash(w, bins)] += 1.0
    X = np.log1p(X)
    return torch.tensor(X, dtype=torch.float32)


Logistic regression in PyTorch + evaluation setups

Model: one Linear layer with BCEWithLogitsLoss binary classification

I print validation accuracy each epoch just to make sure it works as it should.
I ahd problems in the beginings making sure it would do it right. 


print_balance shows the label balance (up-rate) in each split
majority_baseline_acc predicts the most common label in train


In [49]:
def train_lr(Xtr, ytr, Xva, yva, epochs=15, lr=0.2):
    dev = "cuda" if torch.cuda.is_available() else "cpu"
    Xtr, ytr = Xtr.to(dev), ytr.to(dev)
    Xva, yva = Xva.to(dev), yva.to(dev)

    m = torch.nn.Linear(Xtr.shape[1], 1).to(dev)
    opt = torch.optim.SGD(m.parameters(), lr=lr)
    loss_fn = torch.nn.BCEWithLogitsLoss()

#trainning loop
    for ep in range(1, epochs + 1):
        m.train()
        opt.zero_grad()
        logits = m(Xtr).squeeze(1)
        loss = loss_fn(logits, ytr)
        loss.backward()
        opt.step()

        m.eval()
        with torch.no_grad():
            p = torch.sigmoid(m(Xva).squeeze(1))
            pred = (p >= 0.5).float()
            acc = (pred == yva).float().mean().item()

        print(f"ep {ep:02d} | loss {loss.item():.4f} | val_acc {acc:.4f}")

    return m

def acc(m, X, y, name="test"):
    dev = next(m.parameters()).device
    X, y = X.to(dev), y.to(dev)
    m.eval()
    with torch.no_grad():
        p = torch.sigmoid(m(X).squeeze(1))
        pred = (p >= 0.5).float()
        a = (pred == y).float().mean().item()
    print(f"{name}_acc {a:.4f}")
    return a

def print_balance(y, name):
    y = np.asarray(y).astype(float)
    if len(y) == 0:
        print(f"{name}: n=0")
        return
    print(f"{name}: n={len(y)} | up_rate={y.mean():.3f} | down_rate={1-y.mean():.3f}")

def majority_baseline_acc(y_train, y_test, name="baseline"):
    # predict the major class from train
    p = 1.0 if np.mean(y_train) >= 0.5 else 0.0
    pred = np.full_like(y_test, p, dtype=float)
    a = (pred == y_test).mean()
    print(f"{name}_acc {a:.4f} (predict={int(p)})")
    return a


FinBERT daily features + caching

For each date:
runs finbert on each headline
average the predicted probabilities across headlines
features per day: p_neg, p_neu, p_pos
plus n_headlines as a simple intensity feature

OBS AutoTokenizer fir finbgert


In [50]:
def finbert_feats(day_to_titles, model_name="ProsusAI/finbert", bs=16, max_len=64):
    dev = "cuda" if torch.cuda.is_available() else "cpu"
    tokz = AutoTokenizer.from_pretrained(model_name)
    mdl = AutoModelForSequenceClassification.from_pretrained(model_name).to(dev)
    mdl.eval()

    dates = []
    feats = []
    n_news = []

    for d, titles in day_to_titles.items():
        titles = [t for t in titles if isinstance(t, str) and t.strip() != ""]
        n_news.append(len(titles))

        if len(titles) == 0:
            dates.append(d)
            feats.append([0.0, 1.0, 0.0])  # default neutral
            continue

        probs_all = []
        for i in range(0, len(titles), bs):
            batch = titles[i:i + bs]
            enc = tokz(
                batch,
                padding=True,
                truncation=True,
                max_length=max_len,
                return_tensors="pt",
            )
            enc = {k: v.to(dev) for k, v in enc.items()}

            with torch.no_grad():
                out = mdl(**enc)
                probs = torch.softmax(out.logits, dim=1)  # neg/neu/pos
                probs_all.append(probs.detach().cpu())

        probs_all = torch.cat(probs_all, dim=0)
        mean_probs = probs_all.mean(dim=0).tolist()

        dates.append(d)
        feats.append(mean_probs)

    f = pd.DataFrame(feats, columns=["p_neg", "p_neu", "p_pos"])
    f["date"] = dates
    f["n_headlines"] = n_news
    return f

def get_finbert_daily(day_df, cache_path="finbert_daily_cache.csv"):
    # expects columns: date, title_list
    if os.path.exists(cache_path):
        f = pd.read_csv(cache_path)
        f["date"] = pd.to_datetime(f["date"]).dt.date
        return f

    day_map = dict(zip(day_df["date"], day_df["title_list"]))
    f = finbert_feats(day_map)
    f.to_csv(cache_path, index=False)
    return f


Load data + build daily dataset

OBS: Change bow or finbert in below mode!!!

Columns in the CSV:
- Title = headline
- Date = headline date
- CP = close price for the day

Daily aggregation:
text: headlines joined for BOW
title_list: list of headlines for FinBERT

Label:
y = 1 if next day's close is higher than today's close


In [None]:
CSV_PATH = "sp500_headlines_2008_2024.csv" 
MODE = "finbert"  # "bow" or finbert
SEED = 42

np.random.seed(SEED)
torch.manual_seed(SEED)

df = pd.read_csv(CSV_PATH)

df["date"]  = pd.to_datetime(df["Date"], errors="coerce").dt.date
df["close"] = pd.to_numeric(df["CP"], errors="coerce")
df["title"] = df["Title"].astype(str)

df = df.dropna(subset=["date", "close", "title"]).copy()

g = df.groupby("date")
day_text  = g["title"].apply(lambda s: " . ".join(s.tolist())).reset_index(name="text")
day_list  = g["title"].apply(lambda s: s.tolist()).reset_index(name="title_list")
day_close = g["close"].last().reset_index(name="close")

day = day_text.merge(day_close, on="date").merge(day_list, on="date")
day = day.sort_values("date").reset_index(drop=True)

# label: next day up/down
day["close_next"] = day["close"].shift(-1)
day = day.dropna(subset=["close_next"]).copy()
day["y"] = (day["close_next"] > day["close"]).astype(np.float32)

day["year"] = pd.Series(day["date"]).apply(lambda d: d.year)

train = day[day["year"] <= 2018].copy()
val   = day[(day["year"] >= 2019) & (day["year"] <= 2021)].copy()
test  = day[day["year"] >= 2022].copy()

print("rows:", len(train), len(val), len(test))
print("date ranges:",
      min(train["date"]), "→", max(train["date"]),
      "|", min(val["date"]), "→", max(val["date"]),
      "|", min(test["date"]), "→", max(test["date"]))

print_balance(train["y"].values, "train")
print_balance(val["y"].values, "val")
print_balance(test["y"].values, "test")

majority_baseline_acc(train["y"].values, test["y"].values, name="majority_baseline")

# tensors
ytr = torch.tensor(train["y"].values, dtype=torch.float32)
yva = torch.tensor(val["y"].values, dtype=torch.float32)
yte = torch.tensor(test["y"].values, dtype=torch.float32)

# quick check for safty
print(day[["date", "close", "close_next", "y"]].head(3))


rows: 2222 747 537
date ranges: 2008-01-02 → 2018-12-31 | 2019-01-02 → 2021-12-31 | 2022-01-03 → 2024-03-01
train: n=2222 | up_rate=0.543 | down_rate=0.457
val: n=747 | up_rate=0.577 | down_rate=0.423
test: n=537 | up_rate=0.495 | down_rate=0.505
majority_baseline_acc 0.4953 (predict=1)
         date    close  close_next    y
0  2008-01-02  1447.16     1447.16  0.0
1  2008-01-03  1447.16     1416.18  0.0
2  2008-01-07  1416.18     1409.13  0.0




If MODE == "bow": hashed bag-of-words (5000 bins) + logistic regression  
If MODE == "finbert": daily FinBERT sentiment probs (+ headline count) + logistic regression





In [54]:
if MODE == "bow":
    Xtr = make_hash_X(train["text"].tolist(), bins=5000)
    Xva = make_hash_X(val["text"].tolist(), bins=5000)
    Xte = make_hash_X(test["text"].tolist(), bins=5000)

    m = train_lr(Xtr, ytr, Xva, yva, epochs=15, lr=0.2)
    acc(m, Xte, yte, name="test")

elif MODE == "finbert":
    all_f = get_finbert_daily(day[["date", "title_list"]].copy(), cache_path="finbert_daily_cache.csv")

    ftr = train[["date"]].merge(all_f, on="date", how="left")
    fva = val[["date"]].merge(all_f, on="date", how="left")
    fte = test[["date"]].merge(all_f, on="date", how="left")

    Xtr = torch.tensor(ftr[["p_neg", "p_neu", "p_pos", "n_headlines"]].values, dtype=torch.float32)
    Xva = torch.tensor(fva[["p_neg", "p_neu", "p_pos", "n_headlines"]].values, dtype=torch.float32)
    Xte = torch.tensor(fte[["p_neg", "p_neu", "p_pos", "n_headlines"]].values, dtype=torch.float32)

    m = train_lr(Xtr, ytr, Xva, yva, epochs=30, lr=0.5)
    acc(m, Xte, yte, name="test")

else:
    raise ValueError("mistake were made")


ep 01 | loss 0.6928 | val_acc 0.5502
ep 02 | loss 0.6920 | val_acc 0.5622
ep 03 | loss 0.6914 | val_acc 0.5676
ep 04 | loss 0.6908 | val_acc 0.5783
ep 05 | loss 0.6903 | val_acc 0.5770
ep 06 | loss 0.6899 | val_acc 0.5770
ep 07 | loss 0.6895 | val_acc 0.5770
ep 08 | loss 0.6891 | val_acc 0.5770
ep 09 | loss 0.6888 | val_acc 0.5770
ep 10 | loss 0.6884 | val_acc 0.5770
ep 11 | loss 0.6881 | val_acc 0.5770
ep 12 | loss 0.6878 | val_acc 0.5770
ep 13 | loss 0.6875 | val_acc 0.5770
ep 14 | loss 0.6872 | val_acc 0.5770
ep 15 | loss 0.6869 | val_acc 0.5770
test_acc 0.4953



Note to self

Why time split matter

Why daily aggregation is a simpli

Why accuracyclose to 50%   

