# Forecasting Showdown: LLMs vs Kalshi Weather Prediction Markets

## Overview
We evaluate whether frontier LLMs are well-calibrated on **Kalshi daily
high-temperature binary markets** across six US cities, and benchmark them
against the **Kalshi prediction market itself** at the same T-1 snapshot.

### Experimental Design

| Forecaster | Snapshot | Weather context |
|---|---|---|
| **GPT-4o** | T-1 (eve of event) | 10-day historical highs injected into prompt |
| **Gemini 2.0 Flash** | T-1 | Same |
| **Claude 3.5 Sonnet** | T-1 | Same |
| **Kalshi T-1** | T-1 (last pre-midnight candle) | N/A — live prediction market |
| **Baseline: Always-50%** | — | None |
| **Baseline: City-Month Rate** | — | Historical YES rate for that city × month |

**Why T-1?**  Kalshi weather markets open ~10 hours before event_date midnight
and close ~29 hours *after* event_date midnight — by which point the NWS
official temperature is already on record.  The final `last_price` is **NOT**
a pre-event forecast (it reflects post-outcome knowledge).

Instead, we use the **Kalshi T-1 price**: the last hourly candlestick that
closed at or before event_date midnight UTC, fetched from Kalshi's historical
candlestick API.  This is the market's last pre-event consensus — a genuine
crowd forecast made before the temperature was observed — and constitutes the
primary market benchmark that all three AI models compete against.

**Why competitive markets only?**  ~94% of all settled Kalshi weather markets
have `last_price` near 0 or 1 (trivially certain outcomes — e.g. "Will NYC high
exceed 100°F in January?").  Including them makes every forecaster look
artificially accurate.  We restrict to markets where `last_price` fell in
[0.10, 0.90] — markets where the outcome was genuinely near-threshold.

**Weather context:**  For each T-1 prompt the model receives the 10 observed
daily high temperatures immediately preceding the event, fetched from the
Open-Meteo ERA5 reanalysis archive.  This is historically accurate for every
market date, not relative to today.

**Metric:** Brier Score  $BS = \frac{1}{N}\sum(p_i - o_i)^2$
(0 = perfect · 0.25 = always-50% baseline · 1 = maximally wrong)

**Knowledge-cutoff analysis:**

| Model | Knowledge Cutoff | Provider |
|---|---|---|
| GPT-4o | October 2023 | OpenAI |
| Claude 3.5 Sonnet | April 2024 | Anthropic |
| Gemini 2.0 Flash | August 2024 | Google |

All three cutoffs fall within the Kalshi data window (Aug 2021 – present),
providing substantial data *both before and after* each model's cutoff.
GPT-4o has the most balanced pre/post split: its Oct 2023 cutoff sits near the midpoint of the Aug 2021–present data window.

In [None]:
# ── Install dependencies (safe to re-run) ────────────────────────────────
%pip install -q \
    python-dotenv \
    langchain-core langchain-google-genai langchain-openai langchain-anthropic \
    pandas pyarrow numpy matplotlib seaborn requests tqdm
print("Dependencies ready.")


In [None]:
# ── Environment Setup ─────────────────────────────────────────────────────
import os
from pathlib import Path

try:
    from dotenv import load_dotenv
    if Path(".env").exists():
        load_dotenv(".env", override=True)
        print("Loaded .env")
except ImportError:
    pass

KALSHI_API_KEY    = os.environ.get("KALSHI_API_KEY",    "")
OPENAI_API_KEY    = os.environ.get("OPENAI_API_KEY",    "")
GOOGLE_API_KEY    = os.environ.get("GOOGLE_API_KEY",    "")
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")

for k, v in {"KALSHI_API_KEY": KALSHI_API_KEY, "OPENAI_API_KEY": OPENAI_API_KEY,
             "GOOGLE_API_KEY": GOOGLE_API_KEY, "ANTHROPIC_API_KEY": ANTHROPIC_API_KEY}.items():
    print(f"  {'SET    ' if v else 'MISSING'}  {k}")


In [None]:
# ── Imports & Constants ───────────────────────────────────────────────────
import hashlib, json, re, time, threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import date, datetime, timedelta

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns
from tqdm.auto import tqdm

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import ChatOpenAI

# ── Kalshi API ────────────────────────────────────────────────────────────
KALSHI_BASE    = "https://api.elections.kalshi.com/trade-api/v2"
KALSHI_HEADERS = {"Authorization": f"Bearer {KALSHI_API_KEY}"}

# ── Open-Meteo (no key required) ─────────────────────────────────────────
OPEN_METEO_ARCHIVE = "https://archive-api.open-meteo.com/v1/archive"

# ── Cities ────────────────────────────────────────────────────────────────
CITY_SERIES = {
    "New York City": {"series": "KXHIGHNY",   "lat": 40.71, "lon": -74.01},
    "Chicago":       {"series": "KXHIGHCHI",  "lat": 41.85, "lon": -87.65},
    "Miami":         {"series": "KXHIGHMIA",  "lat": 25.77, "lon": -80.19},
    "Los Angeles":   {"series": "KXHIGHLAX",  "lat": 34.05, "lon": -118.24},
    "Denver":        {"series": "KXHIGHDEN",  "lat": 39.74, "lon": -104.98},
    "Seattle":       {"series": "KXHIGHTSEA", "lat": 47.61, "lon": -122.33},
}

# ── LLM models  (cutoffs chosen to be close & straddle the data window) ──
MODELS = {
    "gpt":    {"name": "GPT-4o",           "provider": "openai",    "model_id": "gpt-4o",
               "knowledge_cutoff": date(2023, 10, 1)},
    "gemini": {"name": "Gemini 2.0 Flash", "provider": "google",    "model_id": "gemini-2.0-flash",
               "knowledge_cutoff": date(2024, 8, 1)},
    "claude": {"name": "Claude 3.5 Sonnet","provider": "anthropic", "model_id": "claude-3-5-sonnet-20241022",
               "knowledge_cutoff": date(2024, 4, 1)},
}

# ── Sample cap ────────────────────────────────────────────────────────────
# Max competitive markets per city to send to LLMs.
# Lower to reduce first-run API cost; raise for richer analysis.
TARGET_PER_CITY = 150

# ── Competitive-market filter ─────────────────────────────────────────────
# Only markets where Kalshi's final price was in [0.10, 0.90].
# These are the near-threshold events that actually test forecasting skill.
COMP_MIN, COMP_MAX = 0.10, 0.90

# ── File paths ────────────────────────────────────────────────────────────
Path("cache").mkdir(exist_ok=True)
CACHE_FILE      = Path("cache/response_cache.json")
MARKETS_CACHE   = Path("cache/markets.parquet")
WEATHER_CACHE   = Path("cache/weather.parquet")
RESULTS_FILE    = Path("cache/results.parquet")

MONTH_MAP = {m: i for i, m in enumerate(
    ["JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"], start=1)}

print("Config loaded.")
print(f"Models: {[MODELS[k]['name'] for k in MODELS]}")
cutoffs_str = ", ".join(
    MODELS[k]["name"] + ": " + MODELS[k]["knowledge_cutoff"].strftime("%b %Y")
    for k in MODELS
)
print(f"Cutoffs: {cutoffs_str}")
print(f"Cities: {list(CITY_SERIES.keys())}")
print(f"Competitive filter: kalshi_prob in [{COMP_MIN}, {COMP_MAX}]")


In [None]:
# ── Caching Helpers ───────────────────────────────────────────────────────
CACHE_LOCK = threading.Lock()

def load_cache():
    if CACHE_FILE.exists():
        return json.loads(CACHE_FILE.read_text(encoding="utf-8"))
    return {}

def save_cache(cache):
    CACHE_FILE.write_text(json.dumps(cache, indent=2, ensure_ascii=False),
                          encoding="utf-8")

def ck(*args):
    """16-char SHA-256 cache key."""
    return hashlib.sha256(":".join(str(a) for a in args).encode()).hexdigest()[:16]

print("Cache ready.")


In [None]:
# ── Fetch Settled Kalshi Markets ──────────────────────────────────────────

def parse_event_date(event_ticker: str):
    """Parse event date from ticker like KXHIGHNY-24FEB21."""
    m = re.search(r"-(\d{2})([A-Z]{3})(\d{2})$", event_ticker)
    if not m:
        return None
    yy, mon, dd = m.groups()
    mo = MONTH_MAP.get(mon)
    if not mo:
        return None
    try:
        return date(2000 + int(yy), mo, int(dd))
    except ValueError:
        return None


def fetch_settled_series(series_ticker: str, max_pages: int = 80,
                         delay: float = 0.15) -> list:
    """Page through all settled markets for a Kalshi series."""
    markets, cursor = [], None
    for _ in range(max_pages):
        params = {"series_ticker": series_ticker, "status": "settled", "limit": 200}
        if cursor:
            params["cursor"] = cursor
        try:
            r = requests.get(KALSHI_BASE + "/markets",
                             headers=KALSHI_HEADERS, params=params, timeout=20)
            r.raise_for_status()
            data = r.json()
        except Exception as e:
            print(f"    warn: {e}")
            break
        batch = data.get("markets", [])
        if not batch:
            break
        markets.extend(batch)
        cursor = data.get("cursor")
        if not cursor:
            break
        time.sleep(delay)
    return markets


def market_to_row(m: dict, city: str, info: dict):
    """Convert raw Kalshi market dict to a flat row. Returns None to skip."""
    ev         = m.get("event_ticker", "")
    event_date = parse_event_date(ev)
    if event_date is None:
        return None

    floor     = m.get("floor_strike")
    cap       = m.get("cap_strike")
    threshold = floor if floor is not None else cap
    direction = m.get("strike_type", "greater")

    try:
        actual_temp = float(m.get("expiration_value") or "nan")
    except (ValueError, TypeError):
        actual_temp = np.nan

    lp          = m.get("last_price")
    kalshi_prob = float(lp) / 100.0 if lp is not None else np.nan

    result  = m.get("result", "").lower()
    outcome = 1 if result == "yes" else (0 if result == "no" else np.nan)

    return {
        "ticker":        m.get("ticker", ""),
        "event_ticker":  ev,
        "series":        info["series"],
        "city":          city,
        "lat":           info["lat"],
        "lon":           info["lon"],
        "event_date":    event_date,
        "threshold_f":   float(threshold) if threshold is not None else np.nan,
        "direction":     direction,
        "title":         m.get("title", ""),
        "rules_primary": m.get("rules_primary", ""),
        "result":        result,
        "outcome":       outcome,
        "actual_temp_f": actual_temp,
        "kalshi_prob":   kalshi_prob,
        "volume":        int(m.get("volume", 0) or 0),
        "open_time":     m.get("open_time", ""),
        "close_time":    m.get("close_time", ""),
    }


def build_markets_df() -> pd.DataFrame:
    rows = []
    for city, info in CITY_SERIES.items():
        print(f"  {info['series']:12s} ({city})...", end=" ", flush=True)
        raw = fetch_settled_series(info["series"])
        print(f"{len(raw):,}")
        for m in raw:
            row = market_to_row(m, city, info)
            if row:
                rows.append(row)
    df = pd.DataFrame(rows)
    df["event_date"] = pd.to_datetime(df["event_date"])
    df = df.drop_duplicates(subset="ticker")
    df = df.sort_values(["city", "event_date", "threshold_f"]).reset_index(drop=True)
    return df


if MARKETS_CACHE.exists():
    print("Loading markets from parquet cache...")
    markets_df = pd.read_parquet(MARKETS_CACHE)
    print(f"  {len(markets_df):,} markets loaded")
else:
    print("Fetching all settled markets from Kalshi (~2 min)...")
    markets_df = build_markets_df()
    markets_df.to_parquet(MARKETS_CACHE, index=False)
    print(f"Cached {len(markets_df):,} markets")

print(f"Date range: {markets_df['event_date'].min().date()} to "
      f"{markets_df['event_date'].max().date()}")


In [None]:
# ── Fetch Historical Weather from Open-Meteo ERA5 Archive ─────────────────
# Pre-fetch full daily high temperature history per city.
# Injected into T-1 prompts as accurate historical context —
# avoids the tool-call problem where get_recent_weather() returns today's
# temperatures for a December 2022 question.

def fetch_city_weather(city: str, info: dict,
                       start: str = "2021-07-01") -> pd.DataFrame:
    """Fetch daily max temperatures from Open-Meteo archive."""
    end = datetime.now().strftime("%Y-%m-%d")
    params = {
        "latitude":          info["lat"],
        "longitude":         info["lon"],
        "start_date":        start,
        "end_date":          end,
        "daily":             "temperature_2m_max",
        "temperature_unit":  "fahrenheit",
        "timezone":          "auto",
    }
    r = requests.get(OPEN_METEO_ARCHIVE, params=params, timeout=30)
    r.raise_for_status()
    data = r.json()
    return pd.DataFrame({
        "city": city,
        "date": pd.to_datetime(data["daily"]["time"]),
        "temp_f": data["daily"]["temperature_2m_max"],
    })


if WEATHER_CACHE.exists():
    print("Loading weather from cache...")
    weather_df = pd.read_parquet(WEATHER_CACHE)
else:
    print("Fetching historical weather from Open-Meteo archive (6 cities)...")
    frames = []
    for city, info in CITY_SERIES.items():
        print(f"  {city}...", end=" ", flush=True)
        try:
            df = fetch_city_weather(city, info)
            frames.append(df)
            print(f"{len(df)} days")
        except Exception as e:
            print(f"ERROR: {e}")
    weather_df = pd.concat(frames, ignore_index=True)
    weather_df.to_parquet(WEATHER_CACHE, index=False)
    print(f"Cached {len(weather_df):,} rows")


def get_weather_context(city: str, forecast_date: str, days_back: int = 10) -> str:
    """Return the N observed daily highs immediately before forecast_date."""
    fd  = pd.Timestamp(forecast_date)
    cw  = weather_df[(weather_df["city"] == city) &
                     (weather_df["date"] < fd)].sort_values("date").tail(days_back)
    if len(cw) == 0:
        return f"(No weather history available for {city} before {forecast_date})"
    lines = [f"Observed daily high temperatures for {city} (degF):"]
    for _, row in cw.iterrows():
        if pd.notna(row["temp_f"]):
            lines.append(f"  {row['date'].strftime('%Y-%m-%d')}: {row['temp_f']:.1f}")
    return "\n".join(lines)


print(f"Weather coverage: {weather_df['date'].min().date()} to "
      f"{weather_df['date'].max().date()}")
print(f"Cities: {sorted(weather_df['city'].unique())}")
# Quick sanity check
sample_ctx = get_weather_context("New York City", "2024-02-15", days_back=5)
print("\nSample context (NYC, 5 days before 2024-02-15):")
print(sample_ctx)


## Section 1: Market Data

### Why Kalshi last_price cannot be used as a forecaster

Kalshi daily high-temperature markets close approximately **29 hours after
the event_date midnight** — early the next morning, after the NWS official
daily maximum has been recorded.  At close time, any trader can look up the
exact temperature on weather.com or the NWS website.  As a result:

- **~71% of markets** settle with `last_price < 0.10` (near-certain NO).
- **~16% settle** with `last_price > 0.90` (near-certain YES).
- Kalshi's overall Brier score across all markets is **≈ 0.014** — not "crowd
  wisdom" but post-outcome pricing.

### Competitive market selection

We restrict to markets where `last_price ∈ [0.10, 0.90]` — the cases where
the temperature was genuinely near the threshold.  This ensures we test
*calibration skill* rather than the ability to distinguish 95°F from 50°F in
January.

Note: Kalshi weather markets come in three direction types:
- **greater** — "Will high exceed X°F?"
- **less** — "Will high stay below X°F?"
- **between** — "Will high be in the X–Y°F range?" (2-degree bins)

All three are included; the exact `rules_primary` text is fed verbatim to the
LLM so it sees the same question market participants traded on.


In [None]:
# ── Market Data Overview ──────────────────────────────────────────────────
print(f"Total settled markets : {len(markets_df):,}")
print(f"Direction breakdown   : {markets_df['direction'].value_counts().to_dict()}")

# ── Timing: last_price is post-event ─────────────────────────────────────
df_t = markets_df.copy()
df_t["close_dt"]  = pd.to_datetime(df_t["close_time"], utc=True, errors="coerce")
df_t["event_dt"]  = pd.to_datetime(df_t["event_date"], utc=True)
df_t["hrs_after"] = (df_t["close_dt"] - df_t["event_dt"]).dt.total_seconds() / 3600
print(f"\nHours market closes AFTER event_date midnight:")
print(f"  median = {df_t['hrs_after'].median():.1f} h  "
      f"(range {df_t['hrs_after'].min():.0f}–{df_t['hrs_after'].max():.0f} h)")
print("  => last_price reflects post-outcome knowledge; not used as a forecaster.")

# ── Probability distribution ─────────────────────────────────────────────
valid = markets_df.dropna(subset=["kalshi_prob","outcome"])
print(f"\nkalshi_prob distribution (all {len(valid):,} settled markets):")
buckets = pd.cut(valid["kalshi_prob"],
                 bins=[0, 0.10, 0.20, 0.40, 0.60, 0.80, 0.90, 1.0])
print(buckets.value_counts().sort_index().to_string())
print(f"\nKalshi Brier on ALL markets: {((valid['kalshi_prob']-valid['outcome'])**2).mean():.4f}"
      f"  (deceptively low — post-event pricing)")

# ── Competitive subset ────────────────────────────────────────────────────
comp = valid[(valid["kalshi_prob"] >= COMP_MIN) & (valid["kalshi_prob"] <= COMP_MAX)]
print(f"\nCompetitive markets ({COMP_MIN}–{COMP_MAX}): {len(comp):,} "
      f"({100*len(comp)/len(valid):.1f}% of all)")
print(f"Kalshi Brier on competitive only: {((comp['kalshi_prob']-comp['outcome'])**2).mean():.4f}")
print(f"YES rate (competitive): {comp['outcome'].mean():.3f}")
print("\nCompetitive markets per city:")
print(comp.groupby("city").size().sort_values(ascending=False).to_string())
print("\nCompetitive markets per year:")
comp2 = comp.copy()
comp2["year"] = pd.to_datetime(comp2["event_date"]).dt.year
print(comp2.groupby(["city","year"]).size().unstack(fill_value=0).to_string())


In [None]:
# ── Competitive Market Sample for LLM Evaluation ─────────────────────────
# Filter to competitive markets, then cap per city for API cost control.
# All directions (greater / less / between) are included — the LLM sees
# the exact rules_primary text, so "between X-Y" markets are handled.

eligible = markets_df[
    markets_df["kalshi_prob"].between(COMP_MIN, COMP_MAX) &
    markets_df["outcome"].notna() &
    markets_df["rules_primary"].str.len().gt(30) &
    markets_df["actual_temp_f"].notna()
].copy()

# Assign knowledge-cutoff split labels
eligible["event_dt"] = pd.to_datetime(eligible["event_date"], utc=True)

earliest_cutoff = min(cfg["knowledge_cutoff"] for cfg in MODELS.values())   # Oct 2023
latest_cutoff   = max(cfg["knowledge_cutoff"] for cfg in MODELS.values())   # Aug 2024

def cutoff_period(event_date):
    d = pd.Timestamp(event_date).date()
    if d < earliest_cutoff:
        return "pre_all"         # before ALL model cutoffs
    if d >= latest_cutoff:
        return "post_all"        # after ALL model cutoffs
    return "transition"          # in the Oct-2023 to Apr-2024 window

eligible["cutoff_period"] = eligible["event_date"].apply(cutoff_period)

# Per-city cap: sample by highest volume within competitive range
# Prefer even spread across cutoff_period labels
parts = []
for city, grp in eligible.groupby("city"):
    if len(grp) == 0:
        continue
    # Sort by volume descending within each cutoff_period, take evenly
    sampled = (grp.sort_values("volume", ascending=False)
                  .groupby("cutoff_period", group_keys=False)
                  .apply(lambda g: g.head(TARGET_PER_CITY // 3))
                  .reset_index(drop=True))
    # Top up with remaining if we have room
    already = set(sampled["ticker"])
    remaining = grp[~grp["ticker"].isin(already)].sort_values("volume", ascending=False)
    fill = max(0, TARGET_PER_CITY - len(sampled))
    sampled = pd.concat([sampled, remaining.head(fill)], ignore_index=True)
    parts.append(sampled)

sample_df = (pd.concat(parts, ignore_index=True)
               .sort_values(["city", "event_date"])
               .reset_index(drop=True))

# Add per-model post_cutoff flag
for mk, cfg in MODELS.items():
    col = f"post_cutoff_{mk}"
    sample_df[col] = pd.to_datetime(sample_df["event_date"]).dt.date >= cfg["knowledge_cutoff"]

print(f"Sample: {len(sample_df)} markets  x  {len(MODELS)} models  = "
      f"{len(sample_df)*len(MODELS):,} LLM calls (all cached after first run)")
print()
print("Per city:")
print(sample_df.groupby("city").size().sort_values(ascending=False).to_string())
print()
print("Per cutoff_period:")
print(sample_df["cutoff_period"].value_counts().to_string())
print()
print("Per-model post-cutoff counts:")
for mk in MODELS:
    col = f"post_cutoff_{mk}"
    post = sample_df[col].sum()
    print(f"  {MODELS[mk]['name']:22s}: post={post}  pre={len(sample_df)-post}")
print()
print("Example rows:")
cols = ["ticker","city","event_date","direction","threshold_f","title",
        "outcome","actual_temp_f","kalshi_prob","cutoff_period"]
print(sample_df[cols].head(6).to_string(index=False))


In [None]:
# -- Fetch Kalshi T-1 Prices via Candlestick API (v2: timezone-aware) ------
# Bug fix: original code used midnight UTC as T-1 cutoff.
# US Kalshi weather markets measure temperature in LOCAL time, so the correct
# T-1 cutoff is midnight LOCAL (= 5-8am UTC depending on city/DST).
# Using midnight UTC: no pre-event candles found -> fallback any_[0] grabbed
# the opening candle (naive initial price), which was anti-correlated with
# outcomes (Brier > 0.42).  Fix: city-local midnight via zoneinfo.
# No fallback: NaN if no pre-local-midnight candle rather than corrupt data.
# Rate limiting: 2 workers + exponential backoff; 429s are NOT cached.

from zoneinfo import ZoneInfo
from datetime import datetime as _dt

CITY_TZ = {
    "New York City": "America/New_York",
    "Chicago":       "America/Chicago",
    "Miami":         "America/New_York",
    "Los Angeles":   "America/Los_Angeles",
    "Denver":        "America/Denver",
    "Seattle":       "America/Los_Angeles",
}

KALSHI_T1_CACHE = Path("cache/kalshi_t1_prices.parquet")


def fetch_t1_price(row: dict) -> tuple:
    """Return (result_dict, should_cache) for one market T-1 price."""
    ticker         = row["ticker"]
    series         = row["series"]
    city           = row.get("city", "Chicago")
    event_date_str = str(row["event_date"])[:10]  # "YYYY-MM-DD"

    # Compute local-midnight timestamp for this city
    tz   = ZoneInfo(CITY_TZ.get(city, "America/Chicago"))
    lmid = _dt.strptime(event_date_str, "%Y-%m-%d").replace(tzinfo=tz)
    t1_ts    = int(lmid.timestamp())                          # midnight LOCAL
    open_ts  = int((lmid - timedelta(hours=48)).timestamp())  # 48h look-back
    close_ts = int((lmid + timedelta(hours=36)).timestamp())  # 36h post

    url    = f"{KALSHI_BASE}/series/{series}/markets/{ticker}/candlesticks"
    params = {"start_ts": open_ts, "end_ts": close_ts, "period_interval": 60}

    for attempt in range(3):
        try:
            r = requests.get(url, headers=KALSHI_HEADERS,
                             params=params, timeout=15)
            if r.status_code == 429:
                time.sleep(5 * 2 ** attempt)
                continue   # retry; do NOT cache 429
            r.raise_for_status()
            candles = r.json().get("candlesticks", [])
            break
        except requests.exceptions.HTTPError as e:
            if attempt < 2:
                time.sleep(3)
                continue
            return {"ticker": ticker, "kalshi_t1_prob": float("nan"),
                    "t1_error": str(e)[:80]}, True
        except Exception as e:
            return {"ticker": ticker, "kalshi_t1_prob": float("nan"),
                    "t1_error": str(e)[:80]}, True
    else:
        # All 3 attempts 429 -- do NOT cache, will retry next run
        return {"ticker": ticker, "kalshi_t1_prob": float("nan"),
                "t1_error": "429_rate_limit"}, False

    # Last candle ending at or before local midnight
    pre = [c for c in candles
           if c.get("end_period_ts", 0) <= t1_ts
           and c.get("price", {}).get("close") is not None]

    if not pre:
        # No genuine pre-event candle -- do NOT use opening-day price as proxy
        return {"ticker": ticker, "kalshi_t1_prob": float("nan"),
                "t1_error": "no_pre_candle"}, True

    raw = pre[-1]["price"]["close"]
    return {"ticker": ticker, "kalshi_t1_prob": float(raw) / 100.0,
            "t1_error": None}, True


# -- Load or fetch T-1 prices ------------------------------------------
if KALSHI_T1_CACHE.exists():
    print("Loading Kalshi T-1 prices from cache...")
    t1_prices_df = pd.read_parquet(KALSHI_T1_CACHE)
    fetched_tickers = set(t1_prices_df["ticker"])
    # Permanent NaN = no_pre_candle or HTTP error (don't retry)
    permanent_nan = set(
        t1_prices_df.loc[
            t1_prices_df["kalshi_t1_prob"].isna() &
            (t1_prices_df["t1_error"] != "429_rate_limit"),
            "ticker"
        ]
    )
    # Retry 429s and markets not yet attempted
    retry_429       = set(t1_prices_df.loc[t1_prices_df["t1_error"] == "429_rate_limit", "ticker"])
    missing_tickers = set(sample_df["ticker"]) - fetched_tickers - permanent_nan
    to_fetch = missing_tickers | retry_429
    if to_fetch:
        print(f"  Fetching {len(to_fetch)} missing/429-limited markets...")
        fetch_recs = (sample_df[sample_df["ticker"].isin(to_fetch)]
                      [["ticker", "series", "event_date", "city"]].to_dict("records"))
        new_rows = []
        with ThreadPoolExecutor(max_workers=2) as executor:
            futs = {executor.submit(fetch_t1_price, r): r for r in fetch_recs}
            for f in tqdm(as_completed(futs), total=len(futs), desc="T-1 prices"):
                result, should_cache = f.result()
                if should_cache:
                    new_rows.append(result)
        # Drop old 429 rows for tickers we just retried
        retried = {r["ticker"] for r in new_rows if r["ticker"] in retry_429}
        t1_prices_df = t1_prices_df[~t1_prices_df["ticker"].isin(retried)]
        if new_rows:
            t1_prices_df = pd.concat([t1_prices_df, pd.DataFrame(new_rows)],
                                      ignore_index=True)
        t1_prices_df.to_parquet(KALSHI_T1_CACHE, index=False)
    print(f"  {len(t1_prices_df):,} prices loaded")
else:
    print(f"Fetching Kalshi T-1 prices for {len(sample_df):,} markets (2 workers)...")
    records = sample_df[["ticker", "series", "event_date", "city"]].to_dict("records")
    t1_raw  = []
    with ThreadPoolExecutor(max_workers=2) as executor:
        futs = {executor.submit(fetch_t1_price, r): r for r in records}
        for f in tqdm(as_completed(futs), total=len(futs), desc="T-1 prices"):
            result, should_cache = f.result()
            if should_cache:
                t1_raw.append(result)
    t1_prices_df = pd.DataFrame(t1_raw)
    t1_prices_df.to_parquet(KALSHI_T1_CACHE, index=False)
    print(f"Cached {len(t1_prices_df):,} prices")

# -- Coverage & sanity checks -------------------------------------------
n_ok  = t1_prices_df["kalshi_t1_prob"].notna().sum()
n_tot = len(t1_prices_df)
print(f"\nT-1 price coverage: {n_ok}/{n_tot} ({100*n_ok/n_tot:.1f}%)")

diag = (t1_prices_df
        .merge(sample_df[["ticker", "kalshi_prob", "outcome"]], on="ticker")
        .dropna(subset=["kalshi_t1_prob", "outcome"]))
delta_mean = (diag["kalshi_t1_prob"] - diag["kalshi_prob"]).abs().mean()
corr       = diag["kalshi_t1_prob"].corr(diag["kalshi_prob"])
print(f"Mean |T-1 - final|: {delta_mean:.3f}   T-1 vs final corr: {corr:.3f}")
t1_brier  = ((diag["kalshi_t1_prob"] - diag["outcome"]) ** 2).mean()
fin_brier = ((diag["kalshi_prob"]    - diag["outcome"]) ** 2).mean()
print(f"Kalshi T-1 Brier   : {t1_brier:.4f}  (pre-event market price)")
print(f"Kalshi final Brier : {fin_brier:.4f}  (post-event -- NOT used as a forecast)")
print()
print("T-1 price distribution:")
bins_ = pd.cut(diag["kalshi_t1_prob"],
               bins=[0, .10, .20, .40, .60, .80, .90, 1.0]).value_counts().sort_index()
print(bins_.to_string())
print()
print("t1_error breakdown (for NaN markets):")
print(t1_prices_df[t1_prices_df["kalshi_t1_prob"].isna()]["t1_error"].value_counts().to_string())


## Section 2: LLM Forecasting

### What the model sees

Each model receives:
1. **System prompt** — role as a calibrated superforecaster, format instructions.
2. **Exact Kalshi market title** — verbatim, no paraphrasing.
3. **Exact resolution criteria** (`rules_primary`) — the NWS source and threshold.
4. **10-day weather context** — observed daily high temperatures immediately
   before the forecast date, fetched from the Open-Meteo ERA5 archive.

No tools are used.  Weather context is pre-fetched and injected directly,
ensuring historical accuracy (a December 2022 market gets December 2022
temperatures, not today's).

### Knowledge-cutoff hypothesis

For weather forecasting, the cutoff effect should be **small**: seasonal
temperature patterns for these cities are stable and well-represented in all
models' training data regardless of cutoff.  The weather context further
reduces the advantage of recent training data.  If we find no significant
cutoff effect, that is itself an interesting result — LLM weather calibration
appears robust to training-data recency.


In [None]:
# ── Prompts & Model Factory ───────────────────────────────────────────────
SYSTEM_PROMPT = (
    "You are an expert weather forecaster and superforecaster specialising in "
    "prediction market calibration. Your sole task is to estimate the probability "
    "that a specific Kalshi binary weather market resolves YES.\n\n"
    "Guidelines:\n"
    "- Use the provided recent temperature observations as your primary evidence.\n"
    "- Ground your estimate in seasonal climatology for the city and month.\n"
    "- Be well-calibrated: a 70% probability should resolve YES ~70% of the time.\n"
    "- Avoid anchoring to round numbers (0.25, 0.50, 0.75) unless clearly justified.\n\n"
    "End your response with EXACTLY one line:\n"
    "PROBABILITY: X.XX\n"
    "where X.XX is a decimal in [0.00, 1.00]."
)

USER_T1 = (
    "Today is {forecast_date} — the market resolves tomorrow.\n\n"
    "Market title: {title}\n\n"
    "Resolution criteria: {rules_primary}\n\n"
    "{weather_context}\n\n"
    "Based on the temperature observations above and your knowledge of "
    "seasonal patterns for {city} in {month_name}, "
    "estimate the probability this market resolves YES. "
    "Think step by step, then state your probability."
)


def parse_prob(text: str) -> float:
    text = str(text)
    m = re.search(r"PROBABILITY:\s*(0\.\d+|1\.0+|0\.0+)", text)
    if m:
        return float(m.group(1))
    hits = re.findall(r"\b(0\.\d+|1\.0)\b", text[-500:])
    return float(hits[-1]) if hits else np.nan


def get_llm(model_key: str):
    cfg = MODELS[model_key]
    if cfg["provider"] == "openai":
        return ChatOpenAI(model=cfg["model_id"], temperature=0,
                          api_key=OPENAI_API_KEY)
    if cfg["provider"] == "google":
        return ChatGoogleGenerativeAI(model=cfg["model_id"], temperature=0,
                                      google_api_key=GOOGLE_API_KEY)
    if cfg["provider"] == "anthropic":
        return ChatAnthropic(model=cfg["model_id"], temperature=0,
                             api_key=ANTHROPIC_API_KEY)
    raise ValueError(f"Unknown provider: {cfg['provider']}")


print("AI setup complete.")
print(f"Models: {[MODELS[k]['name'] for k in MODELS]}")
print("No tools — weather context injected directly into prompt.")


In [None]:
# ── T-1 Forecasting (parallel, weather context in prompt) ─────────────────
N_WORKERS  = 4     # conservative: avoids GPT-4o 30k TPM rate limit
SAVE_EVERY = 10    # flush cache every N new completions


def run_t1(df, cache):
    llm_pool  = {mk: get_llm(mk) for mk in MODELS}
    tasks     = [(mk, row) for mk in MODELS for _, row in df.iterrows()]
    new_count = [0]

    def forecast_one(args):
        model_key, row = args
        cfg = MODELS[model_key]
        key = ck("t1v2", model_key, row["ticker"])

        with CACHE_LOCK:
            if key in cache:
                output = cache[key]["output"]
                return {
                    "ticker": row["ticker"], "model": cfg["name"],
                    "model_key": model_key, "snapshot": "T-1",
                    "probability": parse_prob(output), "raw_output": str(output)[-400:],
                }, None

        fd  = (pd.Timestamp(row["event_date"]) - timedelta(days=1)).strftime("%Y-%m-%d")
        mon = pd.Timestamp(row["event_date"]).strftime("%B")
        wx  = get_weather_context(row["city"], fd)

        msgs = [
            SystemMessage(content=SYSTEM_PROMPT),
            HumanMessage(content=USER_T1.format(
                forecast_date=fd, title=row["title"],
                rules_primary=row["rules_primary"],
                city=row["city"], month_name=mon, weather_context=wx,
            )),
        ]

        # Retry up to 3 times with backoff on rate-limit errors
        output = None
        for attempt in range(3):
            try:
                output = llm_pool[model_key].invoke(msgs).content
                break
            except Exception as e:
                err = str(e)
                if attempt < 2 and ("rate_limit" in err.lower() or "429" in err):
                    time.sleep(5 * (2 ** attempt))   # 5s then 10s
                else:
                    output = f"ERROR: {e}"
                    break
        if output is None:
            output = "ERROR: max retries exceeded"

        # Never cache error strings — let them retry on the next run
        save_entry = None if str(output).startswith("ERROR:") else (key, output)

        return {
            "ticker": row["ticker"], "model": cfg["name"],
            "model_key": model_key, "snapshot": "T-1",
            "probability": parse_prob(output), "raw_output": str(output)[-400:],
        }, save_entry

    results = []
    with ThreadPoolExecutor(max_workers=N_WORKERS) as executor:
        futures = {executor.submit(forecast_one, t): t for t in tasks}
        for future in tqdm(as_completed(futures), total=len(futures), desc="T-1 forecasts"):
            result, update = future.result()
            results.append(result)
            if update:
                k, out = update
                with CACHE_LOCK:
                    cache[k] = {"output": out}
                    new_count[0] += 1
                    if new_count[0] % SAVE_EVERY == 0:
                        save_cache(cache)
    save_cache(cache)

    t1 = pd.DataFrame(results)
    valid = t1.dropna(subset=["probability"])
    print(f"T-1 forecasts: {len(t1)} | parse success: {len(valid)}/{len(t1)}")
    print(t1.groupby("model")["probability"].describe().round(3))
    return t1


cache      = load_cache()
t1_results = run_t1(sample_df, cache)

In [None]:
# ── Assemble Master Results DataFrame ─────────────────────────────────────
META_COLS = [
    "ticker","city","event_date","direction","threshold_f","title",
    "rules_primary","result","outcome","actual_temp_f","kalshi_prob",
    "volume","cutoff_period",
] + [f"post_cutoff_{mk}" for mk in MODELS]

results_df = t1_results.merge(sample_df[META_COLS], on="ticker", how="left")

results_df["brier"] = np.where(
    results_df[["probability","outcome"]].notna().all(axis=1),
    (results_df["probability"] - results_df["outcome"]) ** 2,
    np.nan,
)

# ── Add per-model post_cutoff column (denormalised for easy filtering) ────
results_df["post_cutoff"] = results_df.apply(
    lambda r: bool(r.get(f"post_cutoff_{r['model_key']}", False))
              if r["model_key"] in MODELS else False,
    axis=1,
)

# ── Compute city × month historical YES base rates (pre_all period only) ──
pre_data = sample_df[sample_df["cutoff_period"] == "pre_all"].dropna(subset=["outcome"])
pre_data = pre_data.copy()
pre_data["month"] = pd.to_datetime(pre_data["event_date"]).dt.month
base_rate_map = {}
for (city, month), g in pre_data.groupby(["city","month"]):
    if len(g) >= 3:
        base_rate_map[(city, month)] = g["outcome"].mean()
    else:
        base_rate_map[(city, month)] = sample_df["outcome"].mean()

global_yes_rate = sample_df["outcome"].dropna().mean()

def get_base_rate(city, event_date):
    month = pd.Timestamp(event_date).month
    return base_rate_map.get((city, month), global_yes_rate)

# ── Add baseline rows ─────────────────────────────────────────────────────
baseline_rows = []
for _, row in sample_df.dropna(subset=["outcome"]).iterrows():
    p50    = 0.50
    p_base = get_base_rate(row["city"], row["event_date"])
    for model_name, prob in [("Baseline: Always-50%", p50),
                              ("Baseline: City-Month Rate", p_base)]:
        baseline_rows.append({
            **{c: row[c] for c in META_COLS if c in row.index},
            "model":       model_name,
            "model_key":   "baseline",
            "snapshot":    "T-1",
            "probability": prob,
            "raw_output":  "",
            "brier":       (prob - row["outcome"]) ** 2,
            "post_cutoff": False,
        })

# ── Add Kalshi T-1 rows ───────────────────────────────────────────────────
# Kalshi T-1 = last pre-midnight hourly candlestick price.
# This is the market's genuine pre-event consensus, fetched via the
# /series/{series}/markets/{ticker}/candlesticks endpoint.
t1_map = t1_prices_df.set_index("ticker")["kalshi_t1_prob"].to_dict()
kalshi_t1_rows = []
for _, row in sample_df.dropna(subset=["outcome"]).iterrows():
    prob = t1_map.get(row["ticker"], np.nan)
    if pd.isna(prob):
        continue
    kalshi_t1_rows.append({
        **{c: row[c] for c in META_COLS if c in row.index},
        "model":       "Kalshi T-1",
        "model_key":   "kalshi_t1",
        "snapshot":    "T-1",
        "probability": prob,
        "raw_output":  "",
        "brier":       (prob - row["outcome"]) ** 2,
        "post_cutoff": False,
    })

results_df = pd.concat(
    [results_df, pd.DataFrame(baseline_rows), pd.DataFrame(kalshi_t1_rows)],
    ignore_index=True,
)

# ── Persist ───────────────────────────────────────────────────────────────
results_df.to_parquet(RESULTS_FILE, index=False)
results_df.drop(columns=["raw_output","rules_primary"], errors="ignore").to_csv(
    "cache/results.csv", index=False)

print(f"Saved: {RESULTS_FILE}  ({len(results_df):,} rows)")
print(f"Forecasters: {sorted(results_df['model'].unique())}")
print(f"\nBase rates (city x month, from pre_all period):")
print(f"  Used {len(base_rate_map)} city-month combos  |  global fallback={global_yes_rate:.3f}")
print(f"Kalshi T-1 rows: {len(kalshi_t1_rows)}")
print()
print(results_df[["ticker","city","event_date","direction","model","probability",
                  "outcome","brier","cutoff_period","post_cutoff"]
                ].head(12).to_string(index=False))

## Section 3: Brier Score Analysis

Five views of forecast quality — all on **competitive markets only**
( 10–90%) so results measure genuine calibration skill:

1. **Overall** — mean Brier across all models and periods.
2. **AI vs Kalshi T-1** — head-to-head at the same T-1 snapshot.
3. **By city** — geographic variation in forecast difficulty.
4. **Pre- vs post-cutoff** — per-model comparison using each model's own
   knowledge-cutoff date.
5. **Cutoff-period cohorts** — pre_all / transition / post_all analysis.
6. **Calibration** — reliability diagram; a well-calibrated forecaster lies on
   the diagonal.

In [None]:
# ── Brier Score Analysis ───────────────────────────────────────────────────
scored = results_df.dropna(subset=["brier"]).copy()
SEP = "=" * 72

# ── Overall ───────────────────────────────────────────────────────────────
print(f"\n{SEP}")
print("OVERALL BRIER SCORES  (lower = better | always-50% baseline = 0.2500)")
print(SEP)
overall = (
    scored.groupby("model")["brier"]
    .agg(["mean","std","count"])
    .rename(columns={"mean":"Mean Brier","std":"Std","count":"N"})
    .round(4)
    .sort_values("Mean Brier")
)
print(overall.to_string())

# ── AI models vs Kalshi T-1 ───────────────────────────────────────────────
print(f"\n{SEP}")
print("AI MODELS vs KALSHI T-1  (same pre-event T-1 snapshot)")
print(SEP)
kt1_briers = scored[scored["model"] == "Kalshi T-1"]["brier"]
print(f"  Kalshi T-1 (market): {kt1_briers.mean():.4f}  (N={len(kt1_briers)})")
for mk, cfg in MODELS.items():
    ai_grp  = scored[scored["model_key"] == mk]["brier"]
    diff    = ai_grp.mean() - kt1_briers.mean()
    verdict = ("BEATS market"   if diff < -0.005
               else "matches market" if abs(diff) <= 0.005
               else "trails market")
    print(f"  {cfg['name']:22s}: {ai_grp.mean():.4f}  "
          f"delta={diff:+.4f}  [{verdict}]")

# ── By city ───────────────────────────────────────────────────────────────
print(f"\n{SEP}")
print("MEAN BRIER BY CITY  (AI models + Kalshi T-1)")
print(SEP)
city_tab = (
    scored[~scored["model"].str.startswith("Baseline")]
    .groupby(["city","model"])["brier"]
    .mean().round(4).unstack()
)
if len(city_tab):
    print(city_tab.to_string())

# ── Per-model pre vs post cutoff ─────────────────────────────────────────
print(f"\n{SEP}")
print("PER-MODEL PRE- vs POST-CUTOFF BRIER  (T-1 snapshot, AI models only)")
print(SEP)
ai_scored = scored[scored["model_key"].isin(MODELS.keys())]
for model_key, cfg in MODELS.items():
    grp  = ai_scored[ai_scored["model_key"] == model_key]
    pre  = grp.loc[~grp["post_cutoff"], "brier"]
    post = grp.loc[ grp["post_cutoff"], "brier"]
    print(f"  {cfg['name']}  (cutoff: {cfg['knowledge_cutoff']})")
    if len(pre):
        print(f"    PRE-cutoff:  N={len(pre):4d}  Brier={pre.mean():.4f}  "
              f"std={pre.std():.4f}")
    if len(post):
        gap = post.mean() - pre.mean() if len(pre) else float("nan")
        print(f"    POST-cutoff: N={len(post):4d}  Brier={post.mean():.4f}  "
              f"std={post.std():.4f}  delta={gap:+.4f}")

# ── Cutoff-period cohorts ─────────────────────────────────────────────────
print(f"\n{SEP}")
print("BRIER BY CUTOFF-PERIOD COHORT  (pre_all / transition / post_all)")
print(SEP)
cohort_tab = (
    scored.groupby(["cutoff_period","model"])["brier"]
    .agg(["mean","count"])
    .rename(columns={"mean":"Brier","count":"N"})
    .round(4)
)
print(cohort_tab.to_string())

## Section 4: Visualisations

In [None]:
# ── Plots ─────────────────────────────────────────────────────────────────
from matplotlib.patches import Patch

fig, axs = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle("Forecasting Showdown: LLMs vs Kalshi Prediction Market (T-1 snapshot)",
             fontsize=14)

# ── 1. Overall Brier by model ─────────────────────────────────────────────
ax = axs[0, 0]
means = scored.groupby("model")["brier"].mean().sort_values()
colors = []
for m in means.index:
    if m == "Kalshi T-1":
        colors.append("darkorange")
    elif not m.startswith("Baseline"):
        colors.append("steelblue")
    else:
        colors.append("lightgray")
means.plot(kind="barh", ax=ax, color=colors, edgecolor="gray", linewidth=0.5)
ax.axvline(0.25, color="red", ls="--", alpha=0.7)
legend_els = [
    Patch(facecolor="steelblue",  label="AI model"),
    Patch(facecolor="darkorange", label="Kalshi T-1 market"),
    Patch(facecolor="lightgray",  label="Naive baseline"),
    plt.Line2D([0],[0], color="red", ls="--", alpha=0.7, label="No-skill (0.25)"),
]
ax.legend(handles=legend_els, fontsize=8)
ax.set_title("Mean Brier Score (competitive markets, T-1 snapshot)")
ax.set_xlabel("Mean Brier Score")

# ── 2. Pre vs post cutoff per model ──────────────────────────────────────
ax = axs[0, 1]
ai_sc = scored[scored["model_key"].isin(MODELS.keys())].copy()
pp_data = (ai_sc.groupby(["model","post_cutoff"])["brier"]
           .mean().unstack())
if True in pp_data.columns and False in pp_data.columns:
    pp_data[[False, True]].rename(columns={False:"Pre-cutoff", True:"Post-cutoff"}
                                  ).plot(kind="bar", ax=ax, rot=20, width=0.7)
    ax.axhline(0.25, color="red", ls="--", alpha=0.7, label="No-skill (0.25)")
    if "Kalshi T-1" in scored["model"].values:
        kt1_mean = scored[scored["model"] == "Kalshi T-1"]["brier"].mean()
        ax.axhline(kt1_mean, color="darkorange", ls="-.", lw=1.8,
                   label=f"Kalshi T-1 ({kt1_mean:.3f})")
    ax.set_title("Pre- vs Post-Cutoff Brier per AI Model")
    ax.set_ylabel("Mean Brier")
    ax.set_xlabel("")
    ax.legend(fontsize=8)
else:
    ax.text(0.5, 0.5, "Insufficient pre/post data", ha="center", va="center",
            transform=ax.transAxes)
    ax.set_title("Pre- vs Post-Cutoff (no data)")

# ── 3. Brier by cutoff-period cohort ─────────────────────────────────────
ax = axs[1, 0]
cohort_order = ["pre_all","transition","post_all"]
cohort_data  = (scored[scored["model_key"].isin(MODELS.keys())]
                .groupby(["cutoff_period","model"])["brier"]
                .mean().reset_index())
cohort_data["cutoff_period"] = pd.Categorical(cohort_data["cutoff_period"],
                                               categories=cohort_order, ordered=True)
cohort_data = cohort_data.sort_values("cutoff_period")
sns.barplot(data=cohort_data, x="cutoff_period", y="brier", hue="model", ax=ax)
ax.axhline(0.25, color="red", ls="--", alpha=0.7)
if "Kalshi T-1" in scored["model"].values:
    kt1_mean = scored[scored["model"] == "Kalshi T-1"]["brier"].mean()
    ax.axhline(kt1_mean, color="darkorange", ls="-.", lw=1.8,
               label=f"Kalshi T-1 ({kt1_mean:.3f})")
ax.set_title("Brier by Temporal Cohort (AI models; Kalshi T-1 = dashed orange)")
ax.set_ylabel("Mean Brier")
ax.set_xlabel("")
ax.legend(fontsize=7)

# ── 4. Calibration curves ─────────────────────────────────────────────────
ax = axs[1, 1]
bins = np.linspace(0, 1, 11)

def calibration_curve(sub):
    mx, fy = [], []
    for i in range(len(bins) - 1):
        mask = (sub["probability"] >= bins[i]) & (sub["probability"] < bins[i+1])
        if mask.sum() >= 5:
            mx.append(sub.loc[mask, "probability"].mean())
            fy.append(sub.loc[mask, "outcome"].mean())
    return mx, fy

markers = {"GPT-4o": "o", "Gemini 1.5 Flash": "s", "Claude 3.5 Sonnet": "^"}
for mk, cfg in MODELS.items():
    sub = scored[scored["model_key"] == mk].dropna(subset=["outcome"])
    if len(sub) < 10:
        continue
    mx, fy = calibration_curve(sub)
    if mx:
        ax.plot(mx, fy, marker=markers.get(cfg["name"],"o"), lw=1.5,
                label=cfg["name"])

# Kalshi T-1 calibration
kt1_sub = scored[scored["model"] == "Kalshi T-1"].dropna(subset=["outcome"])
if len(kt1_sub) >= 10:
    mx, fy = calibration_curve(kt1_sub)
    if mx:
        ax.plot(mx, fy, marker="D", lw=2, color="darkorange", label="Kalshi T-1")

# Baselines
for bname, ls_ in [("Baseline: City-Month Rate", "--"), ("Baseline: Always-50%", ":")]:
    sub = scored[scored["model"] == bname].dropna(subset=["outcome"])
    if len(sub) >= 10:
        mx, fy = calibration_curve(sub)
        if mx:
            ax.plot(mx, fy, ls=ls_, lw=1, color="gray", label=bname)

ax.plot([0,1],[0,1], "k--", lw=0.8, alpha=0.4)
ax.set_xlabel("Predicted Probability")
ax.set_ylabel("Observed YES Rate")
ax.set_title("Calibration Curves (T-1 snapshot, all forecasters)")
ax.legend(fontsize=7)

plt.tight_layout()
plt.savefig("cache/results_plots.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved: cache/results_plots.png")

## Section 5: Summary

In [None]:
# ── Final Summary ─────────────────────────────────────────────────────────
SEP = "=" * 72
print(SEP)
print("FORECASTING SHOWDOWN — FINAL LEADERBOARD")
print("LLMs (T-1) vs Kalshi Prediction Market (T-1) vs Baselines")
print(SEP)
print(f"Markets evaluated : {sample_df['ticker'].nunique():,} competitive markets")
print(f"Cities            : {', '.join(sorted(sample_df['city'].unique()))}")
ev_dates = pd.to_datetime(sample_df['event_date'])
print(f"Event date range  : {ev_dates.min().date()} to {ev_dates.max().date()}")
print()

leaderboard = (
    scored.groupby("model")["brier"]
    .mean().round(4).reset_index()
    .rename(columns={"brier":"Mean Brier"})
    .sort_values("Mean Brier")
)
print(leaderboard.to_string(index=False))
print()
print(f"  Always-50% baseline:     0.2500")
print(f"  Perfect calibration:     0.0000")

print(f"\n{SEP}")
print("AI MODELS vs KALSHI PREDICTION MARKET  (both at T-1 snapshot)")
print(SEP)
kt1_brier = scored[scored["model"] == "Kalshi T-1"]["brier"].mean()
print(f"  Kalshi T-1 (market): {kt1_brier:.4f}  ← benchmark")
for mk, cfg in MODELS.items():
    ai_brier = scored[scored["model_key"] == mk]["brier"].mean()
    diff     = ai_brier - kt1_brier
    verdict  = ("BEATS market"   if diff < -0.005
                else "matches market" if abs(diff) <= 0.005
                else "trails market")
    print(f"  {cfg['name']:22s}: {ai_brier:.4f}  "
          f"(delta={diff:+.4f}  {verdict})")

print(f"\n{SEP}")
print("KNOWLEDGE CUTOFF EFFECT SUMMARY")
print(SEP)
ai_sc2 = scored[scored["model_key"].isin(MODELS.keys())]
for mk, cfg in MODELS.items():
    grp  = ai_sc2[ai_sc2["model_key"] == mk]
    pre  = grp.loc[~grp["post_cutoff"], "brier"].mean()
    post = grp.loc[ grp["post_cutoff"], "brier"].mean()
    gap  = post - pre if pd.notna(pre) and pd.notna(post) else float("nan")
    flag = "(degraded)" if gap > 0.005 else "(stable)" if abs(gap) <= 0.005 else "(improved?)"
    print(f"  {cfg['name']:22s}: pre={pre:.4f}  post={post:.4f}  "
          f"delta={gap:+.4f}  {flag}")

print(f"\n{SEP}")
print("INTERPRETATION")
print(SEP)
print("  Core question: Can frontier LLMs match a liquid prediction market")
print("  at forecasting binary weather outcomes, given the same information")
print("  horizon (T-1 = eve of event)?")
print()
print("  Kalshi T-1 = market's last hourly candle before event_date midnight,")
print("  fetched via the series-level candlestick API.  This is 21–22 hours")
print("  before the NWS official high temperature is recorded.")
print()
print("  Knowledge cutoff: for weather markets, cutoff effect should be small.")
print("  Seasonal temperature patterns are stable; 10-day weather context")
print("  provides the key signal regardless of training-data recency.")
print()
print("  A large positive delta = performance degrades post-cutoff.")
print("  A near-zero delta = model skill is robust to training-data age.")

print(f"\n{SEP}")
print("HOW TO RELOAD RESULTS")
print(SEP)
print("  import pandas as pd")
print("  results_df = pd.read_parquet('cache/results.parquet')")
print("  # key columns: ticker, city, event_date, direction, model, model_key,")
print("  #   probability, outcome, brier, cutoff_period, post_cutoff")