2026-05-12 audit: attribution + hedge + KXHIGHDEN block + σ residuals + μ-running-high shadow by ijlu · Pull Request #3 · ijlu/autoagent

ijlu · 2026-05-12T22:47:20Z

Full work from the 2026-05-12 audit thread. Seven commits, all deployed live and observed working. Re-homing to ijlu/autoagent (was inadvertently opened against kevinrgu/autoagent as cross-fork kevinrgu#12).

Commits

commit	summary
`433ceb7`	fills_writer: recover client_order_id via posted_orders on Kalshi field drift
`b4bf869`	record_settlements: derive revenue locally; Kalshi /settlements returns 0
`17ce086`	cross_bracket: hard-block KXHIGHDEN until σ inflation lands
`27fe563`	tools: one-time backfill scripts for hedge settlements + fills source
`deb4d07`	fills_writer: drift-alert when manual fill matches a recently-posted decision
`0985437`	cross_bracket: per-family σ floor from empirical residuals (D.1)
`07339bc`	metar_observations: shadow-log running-high-only μ alternative (F.4)

Headline findings

Kalshi removed client_order_id from /portfolio/fills around 2026-05-10. Every fill since was tagging as manual, silently breaking per-strategy attribution. Same field-drift class as the 2026-05-03 count_fp / *_price_dollars fix. Recovered via local posted_orders writer + fall-back lookup in fills_writer.
Kalshi /portfolio/settlements has been returning revenue=0 for valid winning settlements since at least 2026-04-12. Hedged cross_bracket positions (1 YES + 1 NO) were silently reported as ~$0.90 losses when they were actually ~$0.10 wins. Cross_bracket's headline -$13.39 P&L was closer to -$3.35 once hedge accounting was honest.
The METAR Gaussian's μ blends NWP forecast_high into running_high, contaminating the observation channel. On days when NWP misses (cold fronts in DEN, heat domes in AUS), μ carries a +5–15°F bias which the post-peak fast-path then locks in at σ=1.0°F. 5/5 KXHIGHDEN directional cross_bracket losses in the audit window. Shadow-log path landed (commit 07339bc) emits an metar_running_only row alongside the live metar row for offline calibration comparison.
A structural-invariant test was not enforcing the rule it claimed to enforce (test_client_order_id_coverage.py was missing cross_bracket files from its FILES_THAT_POST_ORDERS list — that's why this class of bug shipped).

Backfill applied to live DB

tools/backfill_hedge_settlements.py --apply → 14 historical settlement rows corrected (+$14.00 P&L on cross_bracket's historical total)
tools/backfill_fills_source.py --apply → 26 cross_bracket + 5 cross_bracket_exit fills re-tagged from manual

What is intentionally NOT in this PR (calendar-blocked)

F.5 — analysis tool comparing metar vs metar_running_only Brier scores against settled outcomes. Run after ~3 days of shadow data.
F.6 — flip WEATHER_METAR_USE_RUNNING_HIGH_ONLY=true and revert commit 0985437 (per-family σ floors), conditional on F.5 outcome.

Test plan

pytest tests/ — 2,361 passing, 2 unrelated env-skipped (cryptography ABI on local Mac, no_secrets_in_repo)
Deploy to VPS via bash deploy/04_redeploy.sh 45.55.79.193 — daemon healthy
Verified posted_orders writer fires on first new cross_bracket POST after deploy
Verified metar_running_only shadow rows emit at 1:1 with combined_v2
cross_bracket_live:KXHIGHDEN kv flipped to false on prod
After ~3 days: run F.5 analysis, decide on F.6

🤖 Generated with Claude Code

…ld drift The 2026-05-12 audit caught a second instance of the same Kalshi format- drift pattern that triggered the 2026-05-03 dual-format fix (commit 6a39dd7). Kalshi's /portfolio/fills response stopped echoing back ``client_order_id`` around 2026-05-10: Old keys: action, count_fp, created_time, fee_cost, fill_id, is_taker, market_ticker, no_price_dollars, order_id, side, subaccount_number, ticker, trade_id, ts, yes_price_dollars Missing: client_order_id Without it, ``default_source_tagger`` returns ``manual`` for every fill, silently breaking per-strategy attribution: ``weather_mm_shadow.live_ pnl_cents`` back-fill joins zero rows, ``mm_promotion`` graduation can never fire (CANARY becomes terminal), and ``backtest_comprehensive.py`` strategy slices are wrong for everything since May 11. The 2026-05-12 audit found the bot had logged 9 cross_bracket_live POSTED decisions on May 11 against 0 ``cross_bracket`` source rows in ``fills_ledger`` — the 12 corresponding fills had all collapsed into ``source='manual'``. Recovery: record ``(order_id, client_order_id, source_hint)`` at post time in ``posted_orders``, fall back to that lookup at fill-ingest time when Kalshi's payload omits client_order_id. Changes: * bot/db.py: ``posted_orders`` table definition with order_id PK and indexes on posted_ts_unix and client_order_id. The table existed on prod (30 historical rows from an aborted backfill on May 7–10) but was never tracked in the schema. CREATE TABLE IF NOT EXISTS is idempotent against the existing prod table. * bot/daemon/fills_writer.py: ``record_posted_order(conn, **kwargs)`` helper, idempotent via INSERT OR IGNORE on order_id, holds ``DB_WRITE_LOCK`` via db_write_ctx. ``ingest_page`` falls back to posted_orders lookup when fill.client_order_id is missing. Adds ``cross_bracket`` and ``cross_bracket_exit`` to ALLOWED_SOURCES and routes ``mm_xb_*`` / ``mm_xb_exit_*`` prefixes in default_source_tagger (was collapsing to ``legacy``). * bot/daemon/cross_bracket_shadow.py: call record_posted_order after successful api_post in _post_live_order. * bot/daemon/cross_bracket_exit.py: same wiring + rename exit prefix from ``mm_xbexit_`` to ``mm_xb_exit_`` so the tagger can distinguish exits from entries. has_pending_exit accepts both prefixes for in-flight orders. * tests/test_client_order_id_coverage.py: structural-invariant test now covers cross_bracket_shadow and cross_bracket_exit (was missing them — which is why this bug shipped silently). Allowed prefixes expanded with ``mm_xb_exit_`` and ``mm_xb_``. * tests/test_posted_orders_fallback.py: new — pins (a) tagger routing for both cross_bracket prefixes, (b) record_posted_order idempotence + partial-row refusal, (c) end-to-end recovery when Kalshi omits client_order_id from the fill payload, (d) external fills still fall through to ``manual`` correctly. Going forward the daemon recovers attribution automatically. A follow-up offline backfill is required to repair existing May 11+ ``fills_ledger`` rows that already tagged ``manual`` (out of scope here — the live trading path is fixed but historical data is not). 2327 tests passing (2 unrelated env-blocked: cryptography ABI on local Mac, no_secrets_in_repo unrelated). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ns 0 Third instance of the Kalshi field-drift pattern (after 2026-05-03 count_fp / *_price_dollars and 2026-05-12 client_order_id removal). Kalshi's /portfolio/settlements endpoint has been intermittently returning ``revenue=0`` for valid winning settlements since at least 2026-04-12, and as of 2026-05-12 returns 0 for every settlement. Verified against live API on 2026-05-12: every settlement in the 500-row page reports ``revenue: 0`` regardless of contracts held or market_result. The bot's record_settlements() used this field directly in ``profit = revenue - cost - fees``, so: * Pure winners with no hedge: profit = 0 - cost - fees (looks like a total loss). For directional buys at low prices this was a small rounding error, but for high-priced legs it understated profit. * Hedged positions (cross_bracket_exit pattern: 1 YES + 1 NO on same ticker): the hedge GUARANTEES a $1 payout on the winning leg regardless of outcome. ``revenue=0`` made each hedge show as a ~$0.90 phantom loss when it was actually a ~$0.10 win. The 2026-05-12 audit found 13 cross-bracket hedged positions all mis-reported this way; the strategy's headline -$13.39 P&L was actually closer to -$3.35 after correcting hedge accounting. Fix: compute revenue locally via ``settlement_revenue_cents(yes_count, no_count, market_result)`` in bot/core/money.py. The formula is the canonical Kalshi binary contract: each winning-side contract pays $1.00, losing-side contracts pay $0. Identical for pure and hedged positions. This is BUG #5 in the test_money.py regression watchlist (added). 11 new tests covering: pure YES/NO winner+loser, balanced hedge under both outcomes, asymmetric hedge, zero position, unknown result, fractional-count rounding, and the exact record_settlements call shape with a Kalshi payload that reports ``revenue=0``. 2339 tests passing (2 skipped: cryptography ABI on local Mac and no_secrets_in_repo, both unrelated). Going-forward correctness only — historical settlements in the DB remain mis-reported (separate one-time backfill needed, out of scope here). The fix kicks in for every settlement recorded after deploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-12 audit Phase C.3 finding: the combined-Gaussian σ for KXHIGHDEN is 1.4–4.2°F across recent decisions, while actual high RMSE was 11–12°F across 5 directional losses on May 6 (actual 38.5, model 49.5, σ=1.38) and May 11 (actual 85.5, model 73.0, σ=4.23). Both are 3+σ misses; 5/5 KXHIGHDEN directional cross-bracket bets resolved against us. KXHIGHDEN is already blocked in DIRECTIONAL_BLOCKLIST and MM_BLOCKED_SERIES for the same reason (Brier 0.316 vs other weather families ≈ 0.10). The cross-bracket loop was the remaining gap; this closes it. Changes: * bot/config.py: ``CROSS_BRACKET_BLOCKLIST`` (env-overridable, default ``KXHIGHDEN``) — frozenset of family prefixes that are hard-banned regardless of kv-cache toggles. Distinct from the canary kv toggle (``cross_bracket_live:<family>``): blocklist is for known-broken families, kv is for staged rollout / temporary pauses. * bot/daemon/cross_bracket_shadow.py: ``_is_family_live`` consults the blocklist before the kv lookup. Short-circuits to False for KXHIGHDEN even if env=true and kv=true (defends against accidental re-arm). * bot/daemon/main.py: ``_run_cross_bracket_rearm`` explicitly writes False for blocked families so a leftover True from a prior arm decays out immediately rather than waiting on the 24h TTL. Log message now reports ``armed=`` / ``blocked=`` lists. * tests/test_cross_bracket_live_gates.py: two new regression tests (blocklist overrides kv-true; case-insensitive family compare). VPS kv override applied alongside deploy: UPDATE kv_cache SET value='false', expires_at=now+86400 WHERE key='cross_bracket_live:KXHIGHDEN' Plan is to remove KXHIGHDEN from the blocklist once σ inflation lands (Phase D.1 — per-family RMS-of-residuals as a floor on the Gaussian σ used at cross-bracket decision time). 2341 tests passing (2 unrelated skipped). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two companion scripts for the 2026-05-12 audit aftermath. Going-forward correctness was landed in 433ceb7 (posted_orders writer) and b4bf869 (settlement_revenue_cents). These tools sweep up the historical data. tools/backfill_hedge_settlements.py — recompute settlements rows that were written before BUG #5 was fixed: - For every settlement, look up the bot's (yes_qty, no_qty, fees) from fills_ledger and the authoritative market_result from alpha_backtest. - Recompute revenue via settlement_revenue_cents, then profit and won. UPDATE only if any field differs. - Idempotent re-run via field-equality check. - Dry-run by default; --apply writes. - Applied live on VPS: 14 rows corrected. Net P&L correction +$14.00 on historical cross_bracket totals (the 13 hedged winners that were silently bleeding $1 each, plus one asymmetric hedge that goes from -$1.21 to -$0.21). tools/backfill_fills_source.py — re-tag fills_ledger rows that landed as ``source='manual'`` because posted_orders wasn't being written during the May 11+ window: - Rule 1: match by (ticker, side, ts ±60s) against alpha_backtest cross_bracket_live posted decisions → ``source='cross_bracket'``. - Rule 2: YES buy at ≤15¢ following a same-ticker NO entry within 12h → ``source='cross_bracket_exit'`` (heuristic; the exit code path doesn't log to alpha_backtest, but the price+timing is diagnostic). - Anything that doesn't match either rule stays as ``manual``. - Applied live on VPS: 26 cross_bracket + 5 cross_bracket_exit recovered. 3 left as manual (likely real human-placed fills). 7-test regression suite for the hedge-settlements backfill covering: hedged winner correction, hedged loser with corrected (still-negative) profit, pure winner, pure loser unchanged (rows already correct stay that way), no-bot-fills rows skipped, dry-run writes nothing, idempotent re-apply. 2348 tests passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…decision The 2026-05-12 audit's most insidious finding: Kalshi removed client_order_id from /portfolio/fills around 2026-05-10, and the bot silently tagged every cross_bracket fill as 'manual' for two days before anyone noticed. Same class as the 2026-05-03 dual-format fix — field drift on a "stable" Kalshi field, no warning, just attribution quietly going dark. This adds the canary that would have caught it on day one. When FillsWriter tags a fill as 'manual', it checks alpha_backtest for a posted decision on the same (ticker, side) within the last 10 minutes. If found, log a loud DRIFT ALERT pointing at the audit Phase A so the next instance of the same bug is visible the moment it lands. Heuristic carefully tuned: * Window: 600s — covers post→fill latency plus retries, narrow enough to avoid false positives from unrelated bot+human activity on the same ticker hours apart. * Same-side match required — a fill on the opposite side could be a hedge exit or unrelated trade. * decision_outcome='posted' filter — shadow decisions don't matter. * Logged at WARNING with the exact phrase "DRIFT ALERT" so log greps and ops alerts can pattern-match without false-positive risk. * Lookup wrapped in try/except — a broken alert must not break the writer. 3 new regression tests: - Fires when a posted decision sits within window of a manual fill - Silent for genuine external fills (no matching decision) - Silent for old decisions outside the 10-min window 2351 tests passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-12 audit Phase D.1 finding: the combined-Gaussian σ that cross_bracket scores brackets against collapses to ~1°F post-peak (METAR fast-path tightens it once the daily high looks locked), but the empirical actual-vs-predicted RMSE over the last 18 days is much wider — 1.34–2.39°F across the five non-Denver weather families. Per-family RMSE (tools/sigma_residuals.py, n=10–16 per family): LAX 1.34°F MIA 1.73°F NY 1.86°F AUS 1.95°F CHI 2.39°F DEN 6.55°F (blocked) With σ collapsed to 1°F, cross_bracket's conviction gate (p ≥ 0.65 either way) fires on any bracket >1.4°F from μ — but the ACTUAL high routinely lands 1–2σ from μ at that scale. So the strategy was confidently shorting NO on brackets that genuinely had ~30% YES probability, picking up 0/29 directional bets through the audit window. Fix: add ``CROSS_BRACKET_FAMILY_SIGMA_FLOORS`` (per-family table, env-overridable), and clamp σ in ``_score_one_settlement`` to ``max(combined.sigma_f, family_floor, physical_floor)``. Wider σ → flatter bracket probabilities → conviction gate fires only on genuine model-market disagreements, not on routine variance the model already knows about. Concrete impact: with σ=2.0 instead of σ=1.0 for KXHIGHNY, a bracket that previously scored p_yes=0.005 (the conviction-gate-passing "definitely not this bracket") now scores ~0.04 — still skew, but no longer triggering the strategy's "I'm certain" branch. The σ inflation also changes shadow scoring (deliberate — future calibration training sees the new regime), so the calibration table will drift over the next few days as samples accumulate under the new σ. Watch the per-family Brier in backtest_comprehensive. KXHIGHDEN keeps its hard block in CROSS_BRACKET_BLOCKLIST: 6.5°F empirical floor would essentially flatten every bracket probability, preventing the strategy from firing anyway. Block is cleaner than "infinite σ" as a way of saying "we don't trade this". Companion tool: tools/sigma_residuals.py — recomputes the per-family RMSE from weather_forecast_snapshots × alpha_backtest. Re-run periodically; if the floors drift materially, update the constants (test_cross_bracket_live_gates.py pins them and will fail loudly). Test plan: - test_family_sigma_floors_table_defaults — pins the constants - test_family_sigma_floor_env_override — env var path works 2353 tests passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 2026-05-12 audit's deepest finding: the bot's ``metar`` source isn't actually a METAR observation channel — its μ (``expected_eventual_high``) blends running_high with the NWP-derived forecast_high and a diurnal projection. On days when NWP misses (cold fronts in DEN, heat domes in AUS), the blended μ carries a +5–15°F bias, which the post-peak fast-path then locks in by collapsing σ to 1.0°F. Cross-bracket fires confidently against well-priced markets and loses (5/5 KXHIGHDEN directional losses in the audit window). Single concrete example — KXHIGHDEN-26MAY06 at decision time: actual METAR running max = 38°F (cold front held the day's high) bot's `metar` source μ = 48°F (NWP forecast was 54°F, blended in) bot's combined_v2 μ = 48°F, σ = 1.0°F (fast-path locked) actual day's high = 38°F → bracket B38.5 won bot bought NO at 5¢ on B38.5 thinking p_yes = 0.005, lost. Josh's intuition for the fix: each source should represent ONE thing and combine_gaussian should do the weighting. The METAR channel should carry the running max with diurnal-RMSE σ; NWP feeds 5+ other sources (nbm/hrrr/nws_point/ecmwf/gem/icon/etc) which the combine already weights. Removing NWP from the METAR channel eliminates the double-counting and lets the combine widen σ when NWP and obs disagree. Shipped behavior here is shadow-only: * New side-channel ``_ALT_MU_RUNNING_HIGH`` stashes (μ=running_high, σ=diurnal_rmse_or_2.0°F_fallback) on every cycle * Snapshot writer emits a parallel ``metar_running_only`` row in weather_forecast_snapshots alongside the existing ``metar`` row * No change to the live Gaussian path (default flag off) * Live cutover is gated by ``WEATHER_METAR_USE_RUNNING_HIGH_ONLY=true`` — flip after ~1 week of shadow data confirms the alt is better- calibrated against settled outcomes * When flag flips on, the existing D.1 per-family σ floors (commit 0985437) should probably come out — both target the same symptom; the F.4 cure is upstream and cleaner The σ fallback (2.0°F when no diurnal fit) is wider than the existing past-peak clamp (0.3°F) and the post-peak fast-path floor (1.0°F). This is deliberate — claim less precision when only the running max grounds μ, not more. 8 new regression tests pin: - stash semantics (with/without diurnal fit, zero RMSE → fallback) - pop semantics (prevents stale reads on a missed cycle) - flag-off path: live μ uses blended forecast, alt still stashed - flag-on path: live μ == running_high - past-peak path: stashes the alt with the wider fallback σ - default flag is False (regression guard against premature cutover) 2361 tests passing. Follow-up plan (not landed yet): F.5 — analysis tool over weather_forecast_snapshots comparing the new ``metar_running_only`` rows vs ``metar`` rows against settled outcomes. Run after 7 days of shadow data. F.6 — once shadow validates, flip ``WEATHER_METAR_USE_RUNNING_HIGH_ONLY`` and revert commit 0985437's D.1 per-family σ floors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ijlu and others added 7 commits May 12, 2026 10:06

ijlu merged commit 59f8de4 into main May 12, 2026

ijlu mentioned this pull request May 12, 2026

fills_writer: recover client_order_id via posted_orders on Kalshi field drift kevinrgu/autoagent#12

Closed

4 tasks

ijlu deleted the claude/sleepy-hofstadter-1d66bc branch May 12, 2026 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2026-05-12 audit: attribution + hedge + KXHIGHDEN block + σ residuals + μ-running-high shadow#3

2026-05-12 audit: attribution + hedge + KXHIGHDEN block + σ residuals + μ-running-high shadow#3
ijlu merged 7 commits into
mainfrom
claude/sleepy-hofstadter-1d66bc

ijlu commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ijlu commented May 12, 2026

Commits

Headline findings

Backfill applied to live DB

What is intentionally NOT in this PR (calendar-blocked)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant