# Forecasting Showdown: Kalshi Weather Markets vs Frontier LLMs

## Overview
We evaluate whether frontier LLMs can beat **Kalshi daily high-temperature markets**
across six US cities.

### Experimental Design

| Forecaster | Prompt date | Tool access |
|---|---|---|
| **AI @ T-7** | event_date − 7 days | None — pure climatological priors |
| **AI @ T-1** | event_date − 1 day | `get_recent_weather` via Open-Meteo |
| **Kalshi Market** | Final settlement price | Live crowd wisdom |

**Resolution**: Kalshi's own `expiration_value` field — the exact NWS-recorded
temperature used to settle each contract.

**Metric**: Brier Score  $BS = \frac{1}{N}\sum(p_i - o_i)^2$
(0 = perfect, 0.25 = always-50% baseline, 1 = maximally wrong)

**Train / Validate / Test split** (fixed chronological boundaries):
- **Train**: before 2024-01-01 — pre-cutoff for all three models
- **Validate**: 2024-01-01 – 2025-08-31 — straddles per-model cutoffs (GPT-4o crosses at Jun 2024; Gemini 2.5 Flash at Jan 2025)
- **Test**: 2025-09-01 onward — post-cutoff for all three models

The fixed splits enable a chronological hold-out evaluation. Within each split the
`post_cutoff` column (per row in results) provides the precise per-model boundary:
a market is post-cutoff for a given model if its event date falls on or after that
model's knowledge cutoff date.

**Model Knowledge Cutoffs** (training data upper bounds):

| Model | Knowledge Cutoff | First post-cutoff split |
|---|---|---|
| GPT-4o | June 2024 | Validate (Jul 2024 onward) |
| Gemini 2.5 Flash | January 2025 | Validate (Jan 2025 onward) |
| Claude Sonnet 4.6 | August 2025 | Test (Sep 2025 onward) |

In [None]:
# ── Install dependencies (safe to re-run; skips already-installed packages) ──
%pip install -q \
    python-dotenv \
    langchain-core langchain-google-genai langchain-openai langchain-anthropic \
    pandas pyarrow numpy matplotlib seaborn requests tqdm
print("Dependencies ready.")

In [None]:
# ── Environment Setup ─────────────────────────────────────────────────────
import os
from pathlib import Path

try:
    from dotenv import load_dotenv
    if Path(".env").exists():
        load_dotenv(".env", override=True)
        print("Loaded .env")
except ImportError:
    print("python-dotenv not installed — using environment variables directly")

KALSHI_API_KEY    = os.environ.get("KALSHI_API_KEY",    "")
OPENAI_API_KEY    = os.environ.get("OPENAI_API_KEY",    "")
GOOGLE_API_KEY    = os.environ.get("GOOGLE_API_KEY",    "")
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")

for k, v in {
    "KALSHI_API_KEY":    KALSHI_API_KEY,
    "OPENAI_API_KEY":    OPENAI_API_KEY,
    "GOOGLE_API_KEY":    GOOGLE_API_KEY,
    "ANTHROPIC_API_KEY": ANTHROPIC_API_KEY,
}.items():
    print(f"  {'SET    ' if v else 'MISSING'}  {k}")


In [None]:
# ── Imports & Constants ───────────────────────────────────────────────────
import hashlib
import json
import re
import time
from datetime import date, datetime, timedelta

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
from tqdm.auto import tqdm

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage, ToolMessage
from langchain_core.tools import tool
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import ChatOpenAI

# ── Kalshi API ────────────────────────────────────────────────────────────
KALSHI_BASE    = "https://api.elections.kalshi.com/trade-api/v2"
KALSHI_HEADERS = {"Authorization": f"Bearer {KALSHI_API_KEY}"}

# ── Open-Meteo endpoints (no key required) ────────────────────────────────
OPEN_METEO_ARCHIVE  = "https://archive-api.open-meteo.com/v1/archive"
OPEN_METEO_FORECAST = "https://api.open-meteo.com/v1/forecast"

# ── City config: Kalshi series + coordinates ─────────────────────────────
# Primary series (KXHIGH*) includes all historical HIGHXX markets via pagination
CITY_SERIES = {
    "New York City": {"series": "KXHIGHNY",   "lat": 40.71, "lon": -74.01},
    "Chicago":       {"series": "KXHIGHCHI",  "lat": 41.85, "lon": -87.65},
    "Miami":         {"series": "KXHIGHMIA",  "lat": 25.77, "lon": -80.19},
    "Los Angeles":   {"series": "KXHIGHLAX",  "lat": 34.05, "lon": -118.24},
    "Denver":        {"series": "KXHIGHDEN",  "lat": 39.74, "lon": -104.98},
    "Seattle":       {"series": "KXHIGHTSEA", "lat": 47.61, "lon": -122.33},
}

# ── LLM models ────────────────────────────────────────────────────────────
# knowledge_cutoff: upper bound of each model's training data (exclusive).
# A market with event_date >= knowledge_cutoff is "post-cutoff" for that model —
# the model cannot have seen patterns from that period during training.
MODELS = {
    "gpt":    {"name": "GPT-4o",            "provider": "openai",    "model_id": "gpt-4o",            "knowledge_cutoff": date(2024, 6, 1)},
    "gemini": {"name": "Gemini 2.5 Flash",  "provider": "google",    "model_id": "gemini-2.5-flash",  "knowledge_cutoff": date(2025, 1, 1)},
    "claude": {"name": "Claude Sonnet 4.6", "provider": "anthropic", "model_id": "claude-sonnet-4-6", "knowledge_cutoff": date(2025, 8, 1)},
}

# ── Split boundaries ──────────────────────────────────────────────────────
TRAIN_END    = date(2024, 1, 1)   # train: before 2024
VALIDATE_END = date(2025, 9, 1)   # validate: 2024-Jan through 2025-Aug
                                   # test: 2025-Sep onward

# ── Kalshi date parsing ───────────────────────────────────────────────────
MONTH_MAP = {m: i for i, m in enumerate(
    ["JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"], start=1)}

# ── File paths ────────────────────────────────────────────────────────────
Path("cache").mkdir(exist_ok=True)
CACHE_FILE    = Path("cache/response_cache.json")
MARKETS_CACHE = Path("cache/markets.parquet")
RESULTS_FILE  = Path("cache/results.parquet")

print("Config loaded.")
print(f"Models:  {[MODELS[k]['name'] for k in MODELS]}")
print(f"Cities:  {list(CITY_SERIES.keys())}")
print(f"Splits:  train < 2024 | validate 2024–Aug-2025 | test Sep-2025+")
print(f"Cutoffs: {', '.join(f\"{MODELS[k]['name']}: {MODELS[k]['knowledge_cutoff'].strftime('%b %Y')}\" for k in MODELS)}")


In [None]:
# ── Caching Helpers ───────────────────────────────────────────────────────
def load_cache():
    if CACHE_FILE.exists():
        return json.loads(CACHE_FILE.read_text(encoding="utf-8"))
    return {}

def save_cache(cache):
    CACHE_FILE.write_text(json.dumps(cache, indent=2, ensure_ascii=False),
                          encoding="utf-8")

def ck(*args):
    """Stable 16-char cache key from arbitrary args."""
    return hashlib.sha256(":".join(str(a) for a in args).encode()).hexdigest()[:16]

print("Cache ready.")


In [None]:
# ── Fetch Settled Kalshi Markets ──────────────────────────────────────────

def parse_event_date(event_ticker: str):
    """Parse event date from Kalshi event ticker (e.g. KXHIGHNY-26FEB21)."""
    m = re.search(r"-(\d{2})([A-Z]{3})(\d{2})$", event_ticker)
    if not m:
        return None
    yy, mon, dd = m.groups()
    mo = MONTH_MAP.get(mon)
    if not mo:
        return None
    try:
        return date(2000 + int(yy), mo, int(dd))
    except ValueError:
        return None


def fetch_settled_series(series_ticker: str, max_pages: int = 80,
                          delay: float = 0.15) -> list:
    """Page through all settled markets for a series."""
    markets, cursor = [], None
    for _ in range(max_pages):
        params = {"series_ticker": series_ticker,
                  "status": "settled", "limit": 200}
        if cursor:
            params["cursor"] = cursor
        try:
            r = requests.get(KALSHI_BASE + "/markets",
                             headers=KALSHI_HEADERS, params=params, timeout=20)
            r.raise_for_status()
            data = r.json()
        except Exception as e:
            print(f"    warn: {e}")
            break
        batch = data.get("markets", [])
        if not batch:
            break
        markets.extend(batch)
        cursor = data.get("cursor")
        if not cursor:
            break
        time.sleep(delay)
    return markets


def market_to_row(m: dict, city: str, info: dict) -> dict | None:
    """Convert a raw Kalshi market dict to a flat row. Returns None to skip."""
    ev = m.get("event_ticker", "")
    event_date = parse_event_date(ev)
    if event_date is None:
        return None

    floor = m.get("floor_strike")
    cap   = m.get("cap_strike")
    threshold = floor if floor is not None else cap
    direction = m.get("strike_type", "greater")   # "greater" or "less"

    try:
        actual_temp = float(m.get("expiration_value") or "nan")
    except (ValueError, TypeError):
        actual_temp = np.nan

    lp = m.get("last_price")
    kalshi_prob = float(lp) / 100.0 if lp is not None else np.nan

    result  = m.get("result", "").lower()
    outcome = 1 if result == "yes" else (0 if result == "no" else np.nan)

    return {
        "ticker":        m.get("ticker", ""),
        "event_ticker":  ev,
        "series":        info["series"],
        "city":          city,
        "lat":           info["lat"],
        "lon":           info["lon"],
        "event_date":    event_date,
        "threshold_f":   float(threshold) if threshold is not None else np.nan,
        "direction":     direction,
        "title":         m.get("title", ""),
        "rules_primary": m.get("rules_primary", ""),
        "result":        result,
        "outcome":       outcome,
        "actual_temp_f": actual_temp,
        "kalshi_prob":   kalshi_prob,
        "volume":        int(m.get("volume", 0) or 0),
        "open_time":     m.get("open_time", ""),
        "close_time":    m.get("close_time", ""),
    }


def build_markets_df() -> pd.DataFrame:
    rows = []
    for city, info in CITY_SERIES.items():
        print(f"  {info['series']:12s} ({city})...", end=" ", flush=True)
        raw = fetch_settled_series(info["series"])
        print(f"{len(raw):,}")
        for m in raw:
            row = market_to_row(m, city, info)
            if row:
                rows.append(row)

    df = pd.DataFrame(rows)
    df["event_date"] = pd.to_datetime(df["event_date"])
    df = df.drop_duplicates(subset="ticker")
    df = df.sort_values(["city", "event_date", "threshold_f"]).reset_index(drop=True)
    return df


def assign_split(event_date) -> str:
    d = pd.Timestamp(event_date).date()
    if d < TRAIN_END:
        return "train"
    if d < VALIDATE_END:
        return "validate"
    return "test"


# ── Load or fetch ─────────────────────────────────────────────────────────
if MARKETS_CACHE.exists():
    print("Loading markets from parquet cache...")
    markets_df = pd.read_parquet(MARKETS_CACHE)
    print(f"  {len(markets_df):,} markets loaded from {MARKETS_CACHE}")
else:
    print("Fetching all settled markets from Kalshi (~2 min)...")
    markets_df = build_markets_df()
    markets_df.to_parquet(MARKETS_CACHE, index=False)
    print(f"\nCached {len(markets_df):,} markets → {MARKETS_CACHE}")

markets_df["split"] = markets_df["event_date"].apply(assign_split)

print(f"\nDate range: {markets_df['event_date'].min().date()} "
      f"→ {markets_df['event_date'].max().date()}")
print("\nMarket counts by city × split:")
print(markets_df.groupby(["city","split"]).size().unstack(fill_value=0).to_string())


## Section 1: Market Data Overview

Loads all settled Kalshi high-temperature markets across 6 US cities, assigns
train / validate / test splits by event date, and computes the Kalshi market
Brier score as a baseline. Then draws a stratified sample (up to 50 markets per
split, highest-volume per city × month) for LLM evaluation.

In [None]:
# ── Descriptive Statistics ────────────────────────────────────────────────
print(f"Total settled markets : {len(markets_df):,}")
print(f"With actual_temp_f    : {markets_df['actual_temp_f'].notna().sum():,}")
print(f"With kalshi_prob      : {markets_df['kalshi_prob'].notna().sum():,}")
print(f"Direction breakdown   : {markets_df['direction'].value_counts().to_dict()}")

valid = markets_df.dropna(subset=["kalshi_prob","outcome"])
print(f"\nKalshi market Brier (full dataset, N={len(valid):,}):")
print(f"  Overall : {((valid['kalshi_prob'] - valid['outcome'])**2).mean():.4f}")
for split, g in valid.groupby("split"):
    bs = ((g["kalshi_prob"] - g["outcome"])**2).mean()
    n  = len(g)
    print(f"  {split:10s}: {bs:.4f}  (N={n:,})")

print("\nOutcome base rates:")
for direction, g in valid.groupby("direction"):
    print(f"  {direction}: YES rate = {g['outcome'].mean():.3f}  (N={len(g):,})")

print("\nActual temperature sample (first 6):")
print(valid[["ticker","city","event_date","threshold_f","direction",
             "actual_temp_f","result","kalshi_prob"]].head(6).to_string(index=False))


In [None]:
# ── Stratified Sample for AI Evaluation ──────────────────────────────────
# Per city × calendar-month: take the highest-volume market.
# Then cap each split at TARGET_PER_SPLIT to limit LLM API cost.
# Total ≈ 3 × TARGET_PER_SPLIT rows → 3 × 3 × TARGET_PER_SPLIT AI calls.

TARGET_PER_SPLIT = 50   # markets per split (train / validate / test)

eligible = markets_df.dropna(
    subset=["kalshi_prob","outcome","actual_temp_f","rules_primary"]
).copy()
# Must have real resolution rules (not empty / placeholder)
eligible = eligible[eligible["rules_primary"].str.len() > 30]
eligible["ym"] = eligible["event_date"].dt.to_period("M")

parts = []
for split in ["train", "validate", "test"]:
    sp = eligible[eligible["split"] == split]
    # Best (highest-volume) market per city × month
    best = (sp.sort_values("volume", ascending=False)
              .groupby(["city","ym"], as_index=False)
              .first())
    if len(best) > TARGET_PER_SPLIT:
        # Prefer recency: keep most-recent months
        best = best.nlargest(TARGET_PER_SPLIT, "event_date")
    parts.append(best)

sample_df = (pd.concat(parts, ignore_index=True)
               .sort_values(["city","event_date"])
               .reset_index(drop=True))

print(f"Sample: {len(sample_df)} markets  "
      f"(→ {len(sample_df) * len(MODELS) * 2:,} AI calls total, all cached after first run)")
print()
print(sample_df.groupby(["city","split"]).size().unstack(fill_value=0).to_string())
print()
print("Example rows:")
cols = ["ticker","city","event_date","threshold_f","direction",
        "title","result","actual_temp_f","kalshi_prob","split"]
print(sample_df[cols].head(6).to_string(index=False))


## Section 2: LLM Forecasting

Each sampled market is evaluated by three frontier LLMs at two snapshot dates:

- **T-7 (vanilla)**: prompt date is 7 days before resolution; no tools. The model
  relies entirely on its training knowledge of climatological patterns — making
  knowledge cutoff the key variable.
- **T-1 (tool-augmented)**: prompt date is 1 day before resolution; the model can
  call `get_recent_weather` to fetch the last 10 days of observed highs, giving
  it live situational awareness that partially offsets any cutoff disadvantage.

The exact **Kalshi market title** and **resolution criteria** (`rules_primary`) are
fed verbatim into every prompt — no paraphrasing. This means:

1. The AI sees the same question the market participants traded on.
2. Any ambiguity in the market language affects AI and market equally.
3. Results generalise to any Kalshi binary market, not just weather.

In [None]:
# ── Prompts & Model Factory ───────────────────────────────────────────────
SYSTEM_PROMPT = (
    "You are an expert weather forecaster and superforecaster specialising in "
    "prediction market calibration. Your sole task is to estimate the probability "
    "that a specific Kalshi binary weather market resolves YES.\n\n"
    "Guidelines:\n"
    "- Ground your estimate in historical base rates for the city and month.\n"
    "- Consider the specific threshold relative to seasonal climatology.\n"
    "- Be well-calibrated: 70% probability should resolve YES ~70% of the time.\n"
    "- Avoid round-number anchoring (0.25, 0.50, 0.75) unless clearly justified.\n\n"
    "End your response with EXACTLY one line:\n"
    "PROBABILITY: X.XX\n"
    "where X.XX is a decimal in [0.00, 1.00]."
)

USER_T7 = (
    "Today is {forecast_date} — the market resolves in 7 days.\n\n"
    "Market title: {title}\n\n"
    "Resolution criteria: {rules_primary}\n\n"
    "Using historical climatological patterns for {city} in {month_name}, "
    "estimate the probability this market resolves YES. "
    "Think step by step, then state your probability."
)

USER_T1 = (
    "Today is {forecast_date} — the market resolves tomorrow.\n\n"
    "Market title: {title}\n\n"
    "Resolution criteria: {rules_primary}\n\n"
    "Call get_recent_weather for {city} to check current conditions, "
    "then estimate the probability this market resolves YES. "
    "Think step by step, then state your probability."
)


def parse_prob(text: str) -> float:
    text = str(text)
    m = re.search(r"PROBABILITY:\s*(0\.\d+|1\.0+|0\.0+)", text)
    if m:
        return float(m.group(1))
    hits = re.findall(r"\b(0\.\d+|1\.0)\b", text[-500:])
    return float(hits[-1]) if hits else np.nan


def get_llm(model_key: str):
    cfg = MODELS[model_key]
    if cfg["provider"] == "openai":
        return ChatOpenAI(model=cfg["model_id"], temperature=0,
                          api_key=OPENAI_API_KEY)
    if cfg["provider"] == "google":
        return ChatGoogleGenerativeAI(model=cfg["model_id"], temperature=0,
                                      google_api_key=GOOGLE_API_KEY)
    if cfg["provider"] == "anthropic":
        return ChatAnthropic(model=cfg["model_id"], temperature=0,
                             api_key=ANTHROPIC_API_KEY)
    raise ValueError(f"Unknown provider: {cfg['provider']}")


@tool
def get_recent_weather(city: str, days_back: int = 10) -> str:
    """Fetch recent daily maximum temperatures (degF) for a US city from Open-Meteo.

    Args:
        city: One of: New York City, Chicago, Miami, Los Angeles, Denver, Seattle.
        days_back: How many past days to return (default 10).

    Returns:
        Daily high temperatures as a text table.
    """
    info = None
    city_lower = city.lower().strip()
    for k, v in CITY_SERIES.items():
        if city_lower in k.lower() or k.lower() in city_lower:
            info = v
            break
    if info is None:
        return f"City not recognised. Available: {list(CITY_SERIES.keys())}"

    end   = datetime.now()
    start = end - timedelta(days=max(days_back, 1))
    params = {
        "latitude":          info["lat"],
        "longitude":         info["lon"],
        "start_date":        start.strftime("%Y-%m-%d"),
        "end_date":          end.strftime("%Y-%m-%d"),
        "daily":             "temperature_2m_max",
        "temperature_unit":  "fahrenheit",
        "timezone":          "auto",
    }
    for url in [OPEN_METEO_ARCHIVE, OPEN_METEO_FORECAST]:
        try:
            r = requests.get(url, params=params, timeout=12)
            if r.ok:
                data  = r.json()
                times = data["daily"]["time"]
                temps = data["daily"]["temperature_2m_max"]
                lines = [f"Recent daily highs for {city} (degF):"]
                for t, temp in zip(times, temps):
                    if temp is not None:
                        lines.append(f"  {t}: {float(temp):.1f}")
                return "\n".join(lines)
        except Exception:
            continue
    return f"Could not fetch weather data for {city}."


TOOLS    = [get_recent_weather]
TOOL_MAP = {t.name: t for t in TOOLS}

print(f"Setup complete. Tools: {[t.name for t in TOOLS]}")
print(f"Models: {[MODELS[k]['name'] for k in MODELS]}")


In [None]:
# ── T-7 Vanilla Forecasting (parallel, no tools, pure climatological priors) ──
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

CACHE_LOCK  = threading.Lock()
T7_WORKERS  = 10   # parallel LLM calls — lower if you hit rate limits
SAVE_EVERY  = 10   # flush cache to disk every N new completions


def run_t7(df: pd.DataFrame, cache: dict) -> pd.DataFrame:
    # Create one LLM client per model; LangChain clients are thread-safe.
    llm_pool = {mk: get_llm(mk) for mk in MODELS}
    tasks    = [(mk, row) for mk in MODELS for _, row in df.iterrows()]
    new_count = [0]   # mutable so the closure can update it

    def forecast_one(args):
        model_key, row = args
        cfg = MODELS[model_key]
        key = ck("t7", model_key, row["ticker"])

        with CACHE_LOCK:
            if key in cache:
                output = cache[key]["output"]
                return {"ticker":      row["ticker"],
                        "model":       cfg["name"],
                        "model_key":   model_key,
                        "snapshot":    "T-7",
                        "method":      "vanilla",
                        "probability": parse_prob(output),
                        "raw_output":  str(output)[-400:]}, None

        fd  = (pd.Timestamp(row["event_date"]) - timedelta(days=7)).strftime("%Y-%m-%d")
        mon = pd.Timestamp(row["event_date"]).strftime("%B")
        msgs = [
            SystemMessage(content=SYSTEM_PROMPT),
            HumanMessage(content=USER_T7.format(
                forecast_date=fd,
                title=row["title"],
                rules_primary=row["rules_primary"],
                city=row["city"],
                month_name=mon,
            )),
        ]
        try:
            output = llm_pool[model_key].invoke(msgs).content
        except Exception as e:
            output = f"ERROR: {e}"

        return {"ticker":      row["ticker"],
                "model":       cfg["name"],
                "model_key":   model_key,
                "snapshot":    "T-7",
                "method":      "vanilla",
                "probability": parse_prob(output),
                "raw_output":  str(output)[-400:]}, (key, output)

    results = []
    with ThreadPoolExecutor(max_workers=T7_WORKERS) as executor:
        futures = {executor.submit(forecast_one, t): t for t in tasks}
        for future in tqdm(as_completed(futures), total=len(futures), desc="T-7 forecasts"):
            result, update = future.result()
            results.append(result)
            if update:
                key, output = update
                with CACHE_LOCK:
                    cache[key] = {"output": output}
                    new_count[0] += 1
                    if new_count[0] % SAVE_EVERY == 0:
                        save_cache(cache)

    save_cache(cache)   # final flush

    t7_results = pd.DataFrame(results)
    valid_t7   = t7_results.dropna(subset=["probability"])
    print(f"\nT-7 forecasts: {len(t7_results)} | Parse success: {len(valid_t7)}/{len(t7_results)}")
    print(t7_results.groupby("model")["probability"].describe().round(3))
    return t7_results


cache      = load_cache()
t7_results = run_t7(sample_df, cache)


In [None]:
# ── T-1 Tool-Augmented Forecasting (parallel) ────────────────────────────
T1_WORKERS = 10   # parallel LLM calls — lower if you hit rate limits


def run_t1(df: pd.DataFrame, cache: dict) -> pd.DataFrame:
    llm_pool            = {mk: get_llm(mk) for mk in MODELS}
    llm_with_tools_pool = {mk: llm_pool[mk].bind_tools(TOOLS) for mk in MODELS}
    tasks    = [(mk, row) for mk in MODELS for _, row in df.iterrows()]
    new_count = [0]

    def forecast_one(args):
        model_key, row = args
        cfg = MODELS[model_key]
        key = ck("t1", model_key, row["ticker"])

        with CACHE_LOCK:
            if key in cache:
                output = cache[key]["output"]
                return {"ticker":      row["ticker"],
                        "model":       cfg["name"],
                        "model_key":   model_key,
                        "snapshot":    "T-1",
                        "method":      "tool_augmented",
                        "probability": parse_prob(output),
                        "raw_output":  str(output)[-400:]}, None

        fd   = (pd.Timestamp(row["event_date"]) - timedelta(days=1)).strftime("%Y-%m-%d")
        msgs = [
            SystemMessage(content=SYSTEM_PROMPT),
            HumanMessage(content=USER_T1.format(
                forecast_date=fd,
                title=row["title"],
                rules_primary=row["rules_primary"],
                city=row["city"],
            )),
        ]
        output = "ERROR: max_iterations"
        for _ in range(6):
            try:
                resp = llm_with_tools_pool[model_key].invoke(msgs)
            except Exception as e:
                output = f"ERROR: {e}"
                break
            msgs.append(resp)
            if resp.tool_calls:
                for tc in resp.tool_calls:
                    fn = TOOL_MAP.get(tc["name"])
                    if fn:
                        tool_out = fn.invoke(tc["args"])
                        msgs.append(ToolMessage(
                            content=str(tool_out),
                            tool_call_id=tc["id"],
                        ))
            else:
                output = resp.content
                break

        return {"ticker":      row["ticker"],
                "model":       cfg["name"],
                "model_key":   model_key,
                "snapshot":    "T-1",
                "method":      "tool_augmented",
                "probability": parse_prob(output),
                "raw_output":  str(output)[-400:]}, (key, output)

    results = []
    with ThreadPoolExecutor(max_workers=T1_WORKERS) as executor:
        futures = {executor.submit(forecast_one, t): t for t in tasks}
        for future in tqdm(as_completed(futures), total=len(futures), desc="T-1 forecasts"):
            result, update = future.result()
            results.append(result)
            if update:
                key, output = update
                with CACHE_LOCK:
                    cache[key] = {"output": output}
                    new_count[0] += 1
                    if new_count[0] % SAVE_EVERY == 0:
                        save_cache(cache)

    save_cache(cache)   # final flush

    t1_results = pd.DataFrame(results)
    valid_t1   = t1_results.dropna(subset=["probability"])
    print(f"\nT-1 forecasts: {len(t1_results)} | Parse success: {len(valid_t1)}/{len(t1_results)}")
    print(t1_results.groupby("model")["probability"].describe().round(3))
    return t1_results


t1_results = run_t1(sample_df, cache)


In [None]:
# ── Assemble Master Results DataFrame ────────────────────────────────────
META_COLS = [
    "ticker","event_ticker","city","event_date","threshold_f","direction",
    "title","result","outcome","actual_temp_f","kalshi_prob","volume","split",
]

all_ai     = pd.concat([t7_results, t1_results], ignore_index=True)
results_df = all_ai.merge(sample_df[META_COLS], on="ticker", how="left")

results_df["brier"] = np.where(
    results_df[["probability","outcome"]].notna().all(axis=1),
    (results_df["probability"] - results_df["outcome"]) ** 2,
    np.nan,
)

# ── Add per-model post-cutoff flag ────────────────────────────────────────
# post_cutoff=True means event_date >= that model's knowledge_cutoff,
# i.e. the model had no training data from this time period.
cutoff_map = {cfg["name"]: cfg["knowledge_cutoff"] for cfg in MODELS.values()}
results_df["post_cutoff"] = results_df.apply(
    lambda row: (pd.Timestamp(row["event_date"]).date() >= cutoff_map[row["model"]])
    if row["model"] in cutoff_map else False,
    axis=1,
)

# ── Add Kalshi market as its own forecaster row ───────────────────────────
# Kalshi final price is used as the market baseline at both snapshots.
kalshi_rows = []
for _, row in sample_df.dropna(subset=["kalshi_prob","outcome"]).iterrows():
    for snap in ["T-7", "T-1"]:
        kalshi_rows.append({
            **{c: row[c] for c in META_COLS},
            "model":        "Kalshi Market",
            "model_key":    "kalshi",
            "snapshot":     snap,
            "method":       "prediction_market",
            "probability":  row["kalshi_prob"],
            "raw_output":   "",
            "brier":        (row["kalshi_prob"] - row["outcome"]) ** 2,
            "post_cutoff":  False,
        })

results_df = pd.concat([results_df, pd.DataFrame(kalshi_rows)], ignore_index=True)

# ── Persist ───────────────────────────────────────────────────────────────
results_df.to_parquet(RESULTS_FILE, index=False)
(results_df.drop(columns=["raw_output"], errors="ignore")
           .to_csv("cache/results.csv", index=False))

print(f"Saved: {RESULTS_FILE}  ({len(results_df):,} rows)")
print(f"       cache/results.csv")
print(f"\nColumns: {list(results_df.columns)}")
print(f"\nForecasters: {sorted(results_df['model'].unique())}")

print(f"\nPost-cutoff market counts per AI model (T-7 snapshot):")
t7_ai = results_df[(results_df["snapshot"] == "T-7") & (results_df["model"] != "Kalshi Market")]
print(t7_ai.groupby("model")["post_cutoff"].value_counts().unstack(fill_value=0).to_string())

print()
print(results_df[["ticker","city","event_date","snapshot","model",
                  "probability","kalshi_prob","outcome","brier","split","post_cutoff"]
                ].head(12).to_string(index=False))


## Section 3: Brier Score Analysis

Five views of forecast quality (lower Brier = better; always-0.5 baseline = 0.25):

1. **Overall** — mean Brier by model × snapshot across all splits.
2. **By split** — train / validate / test breakdown revealing chronological degradation.
3. **By city** — geographic variation in forecast difficulty.
4. **Generalisation gap** — test minus validate Brier; the primary out-of-distribution metric.
5. **Knowledge cutoff effect** — for T-7 vanilla forecasts, pre-cutoff vs post-cutoff
   Brier per model using each model's own training-data boundary (`post_cutoff` flag).

In [None]:
# ── Brier Score Analysis ──────────────────────────────────────────────────
scored = results_df.dropna(subset=["brier"]).copy()
print(f"Scored rows: {len(scored):,}  (of {len(results_df):,} total)")

SEP = "=" * 72

# ── Overall ───────────────────────────────────────────────────────────────
print(f"\n{SEP}")
print("OVERALL BRIER SCORES  (lower = better | baseline always-0.5 = 0.2500)")
print(SEP)
overall = (
    scored.groupby(["snapshot","model"])["brier"]
    .agg(["mean","std","count"])
    .rename(columns={"mean":"Mean Brier","std":"Std","count":"N"})
    .round(4)
    .sort_values(["snapshot","Mean Brier"])
)
print(overall.to_string())

# ── By split ──────────────────────────────────────────────────────────────
print(f"\n{SEP}")
print("BRIER BY DATA SPLIT")
print(SEP)
by_split = (
    scored.groupby(["split","snapshot","model"])["brier"]
    .agg(["mean","count"])
    .rename(columns={"mean":"Mean Brier","count":"N"})
    .round(4)
    .sort_index()
)
print(by_split.to_string())

# ── By city ───────────────────────────────────────────────────────────────
print(f"\n{SEP}")
print("MEAN BRIER BY CITY (all snapshots)")
print(SEP)
print(
    scored.groupby(["city","model"])["brier"]
    .mean().round(4).unstack().to_string()
)

# ── Generalisation gap ────────────────────────────────────────────────────
print(f"\n{SEP}")
print("GENERALISATION GAP  (test Brier − validate Brier)")
print(SEP)
for (snap, model), grp in scored.groupby(["snapshot","model"]):
    val = grp.loc[grp["split"] == "validate", "brier"].mean()
    tst = grp.loc[grp["split"] == "test",     "brier"].mean()
    if pd.notna(val) and pd.notna(tst):
        print(f"  {model:25s} @ {snap}: "
              f"validate={val:.4f}  test={tst:.4f}  gap={tst-val:+.4f}")

# ── Knowledge cutoff effect (T-7 vanilla, per-model) ─────────────────────
print(f"\n{SEP}")
print("KNOWLEDGE CUTOFF EFFECT  (T-7 vanilla — pre-cutoff vs post-cutoff per model)")
print(SEP)
t7_ai = scored[(scored["snapshot"] == "T-7") & (scored["model"] != "Kalshi Market")]
for model, grp in t7_ai.groupby("model"):
    cutoff = cutoff_map.get(model, "N/A")
    pre    = grp.loc[~grp["post_cutoff"], "brier"]
    post   = grp.loc[ grp["post_cutoff"], "brier"]
    print(f"  {model}  (cutoff: {cutoff})")
    if len(pre):
        print(f"    pre-cutoff  N={len(pre):3d}  Brier={pre.mean():.4f}")
    if len(post):
        gap = post.mean() - pre.mean() if len(pre) else float("nan")
        print(f"    post-cutoff N={len(post):3d}  Brier={post.mean():.4f}  "
              f"gap={gap:+.4f}")


## Section 4: Visualisations

Four panels saved to `cache/results_plots.png`:

1. **Mean Brier by model × snapshot** — bar chart comparing T-7 and T-1 performance against the no-skill baseline.
2. **Brier by split** — grouped bar chart showing train / validate / test degradation (generalisation gap) per model.
3. **AI @ T-1 vs Kalshi** — scatter of AI probabilities against Kalshi final prices; alignment indicates agreement with crowd wisdom.
4. **Calibration curves** — observed YES rate vs predicted probability; a well-calibrated forecaster tracks the diagonal.

In [None]:
# ── Plots ─────────────────────────────────────────────────────────────────
fig, axs = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle("Forecasting Showdown: Kalshi Weather Markets vs LLMs", fontsize=14)

# ── 1. Mean Brier by model × snapshot ────────────────────────────────────
ax = axs[0, 0]
piv = (scored.groupby(["model","snapshot"])["brier"]
       .mean().unstack().round(4))
piv.plot(kind="bar", ax=ax, rot=30, width=0.65)
ax.axhline(0.25, color="red", ls="--", alpha=0.6, label="No-skill (0.25)")
ax.set_title("Mean Brier: Model × Snapshot")
ax.set_ylabel("Mean Brier Score")
ax.set_xlabel("")
ax.legend(fontsize=8)

# ── 2. Generalisation gap (split × model) ────────────────────────────────
ax = axs[0, 1]
gap_data = (scored[scored["model"] != "Kalshi Market"]
            .groupby(["split","model"])["brier"]
            .mean().reset_index())
sns.barplot(data=gap_data, x="split", y="brier", hue="model",
            order=["train","validate","test"], ax=ax)
ax.axhline(0.25, color="red", ls="--", alpha=0.6)
ax.set_title("Brier by Split — Generalisation Gap")
ax.set_ylabel("Mean Brier Score")
ax.set_xlabel("")
ax.legend(fontsize=7)

# ── 3. AI T-1 prob vs Kalshi final price ─────────────────────────────────
ax = axs[1, 0]
markers = {"GPT-4o": "o", "Gemini 2.5 Flash": "s", "Claude Sonnet 4.6": "^"}
for model_key, cfg in MODELS.items():
    sub = scored[(scored["model"] == cfg["name"]) & (scored["snapshot"] == "T-1")]
    if len(sub):
        mk = markers.get(cfg["name"], "o")
        ax.scatter(sub["kalshi_prob"], sub["probability"],
                   label=cfg["name"], alpha=0.4, s=22, marker=mk)
ax.plot([0, 1], [0, 1], "k--", lw=1, label="Perfect agreement")
ax.set_xlabel("Kalshi Final Price")
ax.set_ylabel("AI Probability @ T-1")
ax.set_title("AI @ T-1  vs  Kalshi Market Consensus")
ax.legend(fontsize=7)

# ── 4. Calibration curves ─────────────────────────────────────────────────
ax = axs[1, 1]
bins = np.linspace(0, 1, 11)

def calibration_curve(sub):
    means, freqs = [], []
    for i in range(len(bins) - 1):
        mask = (sub["probability"] >= bins[i]) & (sub["probability"] < bins[i+1])
        if mask.sum() >= 3:
            means.append(sub.loc[mask, "probability"].mean())
            freqs.append(sub.loc[mask, "outcome"].mean())
    return means, freqs

for model_key, cfg in MODELS.items():
    for snap, ls in [("T-7", "--"), ("T-1", "-")]:
        sub = scored[(scored["model"] == cfg["name"]) &
                     (scored["snapshot"] == snap)].dropna(subset=["outcome"])
        if len(sub) < 10:
            continue
        mx, fy = calibration_curve(sub)
        if mx:
            ax.plot(mx, fy, marker="o", ls=ls, ms=4,
                    label=f"{cfg['name']} ({snap})")

# Kalshi calibration
ksub = scored[(scored["model"] == "Kalshi Market") &
              (scored["snapshot"] == "T-1")].dropna(subset=["outcome"])
if len(ksub) >= 10:
    km, kf = calibration_curve(ksub)
    if km:
        ax.plot(km, kf, marker="s", lw=2, color="black", label="Kalshi Market")

ax.plot([0, 1], [0, 1], "k--", lw=1, alpha=0.35)
ax.set_xlabel("Predicted Probability")
ax.set_ylabel("Observed Frequency")
ax.set_title("Calibration  (solid=T-1, dashed=T-7)")
ax.legend(fontsize=6, ncol=2)

plt.tight_layout()
plt.savefig("cache/results_plots.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved: cache/results_plots.png")


## Section 5: Summary

Final leaderboard ranking all forecasters by mean Brier score at each snapshot,
best forecaster per city at T-1, and instructions for reloading the persisted
results from `cache/results.parquet` or `cache/results.csv`.

In [None]:
# ── Final Summary ─────────────────────────────────────────────────────────
SEP = "=" * 72
print(SEP)
print("FORECASTING SHOWDOWN — FINAL LEADERBOARD")
print(SEP)
print(f"Markets: {sample_df['ticker'].nunique():,} | "
      f"Cities: {sample_df['city'].nunique()} | "
      f"Span: {sample_df['event_date'].min().date()} "
      f"→ {sample_df['event_date'].max().date()}")
print()

leaderboard = (
    scored.groupby(["snapshot","model"])["brier"]
    .mean().round(4).reset_index()
    .rename(columns={"brier":"Mean Brier"})
    .sort_values(["snapshot","Mean Brier"])
)
print(leaderboard.to_string(index=False))
print()
print(f"  Baseline (always 0.50):  0.2500")
print(f"  Perfect forecaster:      0.0000")

print(f"\n{SEP}")
print("BEST FORECASTER PER CITY  (T-1 snapshot)")
print(SEP)
t1 = scored[scored["snapshot"] == "T-1"]
for city, g in t1.groupby("city"):
    best = g.groupby("model")["brier"].mean().idxmin()
    bs   = g.groupby("model")["brier"].mean().min()
    print(f"  {city:15s}  {best:25s}  Brier={bs:.4f}")

print(f"\n{SEP}")
print("HOW TO RELOAD RESULTS")
print(SEP)
print("  import pandas as pd")
print("  results_df = pd.read_parquet('cache/results.parquet')")
print("  # or:  pd.read_csv('cache/results.csv')")
print()
print("Key columns: ticker, city, event_date, threshold_f, direction,")
print("             actual_temp_f, kalshi_prob, outcome, result,")
print("             snapshot, model, method, probability, brier, split, post_cutoff")
print()
print("post_cutoff: True when event_date >= that model's knowledge cutoff date.")
print("             Always False for Kalshi Market rows (no training cutoff).")
