2026-05-12 audit: attribution + hedge + KXHIGHDEN block + σ residuals + μ-running-high shadow#3
Merged
Merged
Conversation
…ld drift The 2026-05-12 audit caught a second instance of the same Kalshi format- drift pattern that triggered the 2026-05-03 dual-format fix (commit 6a39dd7). Kalshi's /portfolio/fills response stopped echoing back ``client_order_id`` around 2026-05-10: Old keys: action, count_fp, created_time, fee_cost, fill_id, is_taker, market_ticker, no_price_dollars, order_id, side, subaccount_number, ticker, trade_id, ts, yes_price_dollars Missing: client_order_id Without it, ``default_source_tagger`` returns ``manual`` for every fill, silently breaking per-strategy attribution: ``weather_mm_shadow.live_ pnl_cents`` back-fill joins zero rows, ``mm_promotion`` graduation can never fire (CANARY becomes terminal), and ``backtest_comprehensive.py`` strategy slices are wrong for everything since May 11. The 2026-05-12 audit found the bot had logged 9 cross_bracket_live POSTED decisions on May 11 against 0 ``cross_bracket`` source rows in ``fills_ledger`` — the 12 corresponding fills had all collapsed into ``source='manual'``. Recovery: record ``(order_id, client_order_id, source_hint)`` at post time in ``posted_orders``, fall back to that lookup at fill-ingest time when Kalshi's payload omits client_order_id. Changes: * bot/db.py: ``posted_orders`` table definition with order_id PK and indexes on posted_ts_unix and client_order_id. The table existed on prod (30 historical rows from an aborted backfill on May 7–10) but was never tracked in the schema. CREATE TABLE IF NOT EXISTS is idempotent against the existing prod table. * bot/daemon/fills_writer.py: ``record_posted_order(conn, **kwargs)`` helper, idempotent via INSERT OR IGNORE on order_id, holds ``DB_WRITE_LOCK`` via db_write_ctx. ``ingest_page`` falls back to posted_orders lookup when fill.client_order_id is missing. Adds ``cross_bracket`` and ``cross_bracket_exit`` to ALLOWED_SOURCES and routes ``mm_xb_*`` / ``mm_xb_exit_*`` prefixes in default_source_tagger (was collapsing to ``legacy``). * bot/daemon/cross_bracket_shadow.py: call record_posted_order after successful api_post in _post_live_order. * bot/daemon/cross_bracket_exit.py: same wiring + rename exit prefix from ``mm_xbexit_`` to ``mm_xb_exit_`` so the tagger can distinguish exits from entries. has_pending_exit accepts both prefixes for in-flight orders. * tests/test_client_order_id_coverage.py: structural-invariant test now covers cross_bracket_shadow and cross_bracket_exit (was missing them — which is why this bug shipped silently). Allowed prefixes expanded with ``mm_xb_exit_`` and ``mm_xb_``. * tests/test_posted_orders_fallback.py: new — pins (a) tagger routing for both cross_bracket prefixes, (b) record_posted_order idempotence + partial-row refusal, (c) end-to-end recovery when Kalshi omits client_order_id from the fill payload, (d) external fills still fall through to ``manual`` correctly. Going forward the daemon recovers attribution automatically. A follow-up offline backfill is required to repair existing May 11+ ``fills_ledger`` rows that already tagged ``manual`` (out of scope here — the live trading path is fixed but historical data is not). 2327 tests passing (2 unrelated env-blocked: cryptography ABI on local Mac, no_secrets_in_repo unrelated). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ns 0 Third instance of the Kalshi field-drift pattern (after 2026-05-03 count_fp / *_price_dollars and 2026-05-12 client_order_id removal). Kalshi's /portfolio/settlements endpoint has been intermittently returning ``revenue=0`` for valid winning settlements since at least 2026-04-12, and as of 2026-05-12 returns 0 for every settlement. Verified against live API on 2026-05-12: every settlement in the 500-row page reports ``revenue: 0`` regardless of contracts held or market_result. The bot's record_settlements() used this field directly in ``profit = revenue - cost - fees``, so: * Pure winners with no hedge: profit = 0 - cost - fees (looks like a total loss). For directional buys at low prices this was a small rounding error, but for high-priced legs it understated profit. * Hedged positions (cross_bracket_exit pattern: 1 YES + 1 NO on same ticker): the hedge GUARANTEES a $1 payout on the winning leg regardless of outcome. ``revenue=0`` made each hedge show as a ~$0.90 phantom loss when it was actually a ~$0.10 win. The 2026-05-12 audit found 13 cross-bracket hedged positions all mis-reported this way; the strategy's headline -$13.39 P&L was actually closer to -$3.35 after correcting hedge accounting. Fix: compute revenue locally via ``settlement_revenue_cents(yes_count, no_count, market_result)`` in bot/core/money.py. The formula is the canonical Kalshi binary contract: each winning-side contract pays $1.00, losing-side contracts pay $0. Identical for pure and hedged positions. This is BUG #5 in the test_money.py regression watchlist (added). 11 new tests covering: pure YES/NO winner+loser, balanced hedge under both outcomes, asymmetric hedge, zero position, unknown result, fractional-count rounding, and the exact record_settlements call shape with a Kalshi payload that reports ``revenue=0``. 2339 tests passing (2 skipped: cryptography ABI on local Mac and no_secrets_in_repo, both unrelated). Going-forward correctness only — historical settlements in the DB remain mis-reported (separate one-time backfill needed, out of scope here). The fix kicks in for every settlement recorded after deploy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 audit Phase C.3 finding: the combined-Gaussian σ for KXHIGHDEN is 1.4–4.2°F across recent decisions, while actual high RMSE was 11–12°F across 5 directional losses on May 6 (actual 38.5, model 49.5, σ=1.38) and May 11 (actual 85.5, model 73.0, σ=4.23). Both are 3+σ misses; 5/5 KXHIGHDEN directional cross-bracket bets resolved against us. KXHIGHDEN is already blocked in DIRECTIONAL_BLOCKLIST and MM_BLOCKED_SERIES for the same reason (Brier 0.316 vs other weather families ≈ 0.10). The cross-bracket loop was the remaining gap; this closes it. Changes: * bot/config.py: ``CROSS_BRACKET_BLOCKLIST`` (env-overridable, default ``KXHIGHDEN``) — frozenset of family prefixes that are hard-banned regardless of kv-cache toggles. Distinct from the canary kv toggle (``cross_bracket_live:<family>``): blocklist is for known-broken families, kv is for staged rollout / temporary pauses. * bot/daemon/cross_bracket_shadow.py: ``_is_family_live`` consults the blocklist before the kv lookup. Short-circuits to False for KXHIGHDEN even if env=true and kv=true (defends against accidental re-arm). * bot/daemon/main.py: ``_run_cross_bracket_rearm`` explicitly writes False for blocked families so a leftover True from a prior arm decays out immediately rather than waiting on the 24h TTL. Log message now reports ``armed=`` / ``blocked=`` lists. * tests/test_cross_bracket_live_gates.py: two new regression tests (blocklist overrides kv-true; case-insensitive family compare). VPS kv override applied alongside deploy: UPDATE kv_cache SET value='false', expires_at=now+86400 WHERE key='cross_bracket_live:KXHIGHDEN' Plan is to remove KXHIGHDEN from the blocklist once σ inflation lands (Phase D.1 — per-family RMS-of-residuals as a floor on the Gaussian σ used at cross-bracket decision time). 2341 tests passing (2 unrelated skipped). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two companion scripts for the 2026-05-12 audit aftermath. Going-forward correctness was landed in 433ceb7 (posted_orders writer) and b4bf869 (settlement_revenue_cents). These tools sweep up the historical data. tools/backfill_hedge_settlements.py — recompute settlements rows that were written before BUG #5 was fixed: - For every settlement, look up the bot's (yes_qty, no_qty, fees) from fills_ledger and the authoritative market_result from alpha_backtest. - Recompute revenue via settlement_revenue_cents, then profit and won. UPDATE only if any field differs. - Idempotent re-run via field-equality check. - Dry-run by default; --apply writes. - Applied live on VPS: 14 rows corrected. Net P&L correction +$14.00 on historical cross_bracket totals (the 13 hedged winners that were silently bleeding $1 each, plus one asymmetric hedge that goes from -$1.21 to -$0.21). tools/backfill_fills_source.py — re-tag fills_ledger rows that landed as ``source='manual'`` because posted_orders wasn't being written during the May 11+ window: - Rule 1: match by (ticker, side, ts ±60s) against alpha_backtest cross_bracket_live posted decisions → ``source='cross_bracket'``. - Rule 2: YES buy at ≤15¢ following a same-ticker NO entry within 12h → ``source='cross_bracket_exit'`` (heuristic; the exit code path doesn't log to alpha_backtest, but the price+timing is diagnostic). - Anything that doesn't match either rule stays as ``manual``. - Applied live on VPS: 26 cross_bracket + 5 cross_bracket_exit recovered. 3 left as manual (likely real human-placed fills). 7-test regression suite for the hedge-settlements backfill covering: hedged winner correction, hedged loser with corrected (still-negative) profit, pure winner, pure loser unchanged (rows already correct stay that way), no-bot-fills rows skipped, dry-run writes nothing, idempotent re-apply. 2348 tests passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…decision The 2026-05-12 audit's most insidious finding: Kalshi removed client_order_id from /portfolio/fills around 2026-05-10, and the bot silently tagged every cross_bracket fill as 'manual' for two days before anyone noticed. Same class as the 2026-05-03 dual-format fix — field drift on a "stable" Kalshi field, no warning, just attribution quietly going dark. This adds the canary that would have caught it on day one. When FillsWriter tags a fill as 'manual', it checks alpha_backtest for a posted decision on the same (ticker, side) within the last 10 minutes. If found, log a loud DRIFT ALERT pointing at the audit Phase A so the next instance of the same bug is visible the moment it lands. Heuristic carefully tuned: * Window: 600s — covers post→fill latency plus retries, narrow enough to avoid false positives from unrelated bot+human activity on the same ticker hours apart. * Same-side match required — a fill on the opposite side could be a hedge exit or unrelated trade. * decision_outcome='posted' filter — shadow decisions don't matter. * Logged at WARNING with the exact phrase "DRIFT ALERT" so log greps and ops alerts can pattern-match without false-positive risk. * Lookup wrapped in try/except — a broken alert must not break the writer. 3 new regression tests: - Fires when a posted decision sits within window of a manual fill - Silent for genuine external fills (no matching decision) - Silent for old decisions outside the 10-min window 2351 tests passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 audit Phase D.1 finding: the combined-Gaussian σ that
cross_bracket scores brackets against collapses to ~1°F post-peak
(METAR fast-path tightens it once the daily high looks locked), but
the empirical actual-vs-predicted RMSE over the last 18 days is
much wider — 1.34–2.39°F across the five non-Denver weather families.
Per-family RMSE (tools/sigma_residuals.py, n=10–16 per family):
LAX 1.34°F MIA 1.73°F NY 1.86°F
AUS 1.95°F CHI 2.39°F DEN 6.55°F (blocked)
With σ collapsed to 1°F, cross_bracket's conviction gate
(p ≥ 0.65 either way) fires on any bracket >1.4°F from μ — but
the ACTUAL high routinely lands 1–2σ from μ at that scale. So the
strategy was confidently shorting NO on brackets that genuinely had
~30% YES probability, picking up 0/29 directional bets through the
audit window.
Fix: add ``CROSS_BRACKET_FAMILY_SIGMA_FLOORS`` (per-family table,
env-overridable), and clamp σ in ``_score_one_settlement`` to
``max(combined.sigma_f, family_floor, physical_floor)``. Wider σ →
flatter bracket probabilities → conviction gate fires only on
genuine model-market disagreements, not on routine variance the
model already knows about.
Concrete impact: with σ=2.0 instead of σ=1.0 for KXHIGHNY, a bracket
that previously scored p_yes=0.005 (the conviction-gate-passing
"definitely not this bracket") now scores ~0.04 — still skew, but
no longer triggering the strategy's "I'm certain" branch.
The σ inflation also changes shadow scoring (deliberate — future
calibration training sees the new regime), so the calibration table
will drift over the next few days as samples accumulate under the
new σ. Watch the per-family Brier in backtest_comprehensive.
KXHIGHDEN keeps its hard block in CROSS_BRACKET_BLOCKLIST: 6.5°F
empirical floor would essentially flatten every bracket probability,
preventing the strategy from firing anyway. Block is cleaner than
"infinite σ" as a way of saying "we don't trade this".
Companion tool: tools/sigma_residuals.py — recomputes the per-family
RMSE from weather_forecast_snapshots × alpha_backtest. Re-run
periodically; if the floors drift materially, update the constants
(test_cross_bracket_live_gates.py pins them and will fail loudly).
Test plan:
- test_family_sigma_floors_table_defaults — pins the constants
- test_family_sigma_floor_env_override — env var path works
2353 tests passing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 2026-05-12 audit's deepest finding: the bot's ``metar`` source isn't actually a METAR observation channel — its μ (``expected_eventual_high``) blends running_high with the NWP-derived forecast_high and a diurnal projection. On days when NWP misses (cold fronts in DEN, heat domes in AUS), the blended μ carries a +5–15°F bias, which the post-peak fast-path then locks in by collapsing σ to 1.0°F. Cross-bracket fires confidently against well-priced markets and loses (5/5 KXHIGHDEN directional losses in the audit window). Single concrete example — KXHIGHDEN-26MAY06 at decision time: actual METAR running max = 38°F (cold front held the day's high) bot's `metar` source μ = 48°F (NWP forecast was 54°F, blended in) bot's combined_v2 μ = 48°F, σ = 1.0°F (fast-path locked) actual day's high = 38°F → bracket B38.5 won bot bought NO at 5¢ on B38.5 thinking p_yes = 0.005, lost. Josh's intuition for the fix: each source should represent ONE thing and combine_gaussian should do the weighting. The METAR channel should carry the running max with diurnal-RMSE σ; NWP feeds 5+ other sources (nbm/hrrr/nws_point/ecmwf/gem/icon/etc) which the combine already weights. Removing NWP from the METAR channel eliminates the double-counting and lets the combine widen σ when NWP and obs disagree. Shipped behavior here is shadow-only: * New side-channel ``_ALT_MU_RUNNING_HIGH`` stashes (μ=running_high, σ=diurnal_rmse_or_2.0°F_fallback) on every cycle * Snapshot writer emits a parallel ``metar_running_only`` row in weather_forecast_snapshots alongside the existing ``metar`` row * No change to the live Gaussian path (default flag off) * Live cutover is gated by ``WEATHER_METAR_USE_RUNNING_HIGH_ONLY=true`` — flip after ~1 week of shadow data confirms the alt is better- calibrated against settled outcomes * When flag flips on, the existing D.1 per-family σ floors (commit 0985437) should probably come out — both target the same symptom; the F.4 cure is upstream and cleaner The σ fallback (2.0°F when no diurnal fit) is wider than the existing past-peak clamp (0.3°F) and the post-peak fast-path floor (1.0°F). This is deliberate — claim less precision when only the running max grounds μ, not more. 8 new regression tests pin: - stash semantics (with/without diurnal fit, zero RMSE → fallback) - pop semantics (prevents stale reads on a missed cycle) - flag-off path: live μ uses blended forecast, alt still stashed - flag-on path: live μ == running_high - past-peak path: stashes the alt with the wider fallback σ - default flag is False (regression guard against premature cutover) 2361 tests passing. Follow-up plan (not landed yet): F.5 — analysis tool over weather_forecast_snapshots comparing the new ``metar_running_only`` rows vs ``metar`` rows against settled outcomes. Run after 7 days of shadow data. F.6 — once shadow validates, flip ``WEATHER_METAR_USE_RUNNING_HIGH_ONLY`` and revert commit 0985437's D.1 per-family σ floors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closed
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Full work from the 2026-05-12 audit thread. Seven commits, all deployed live and observed working. Re-homing to ijlu/autoagent (was inadvertently opened against kevinrgu/autoagent as cross-fork kevinrgu#12).
Commits
Headline findings
client_order_idfrom /portfolio/fills around 2026-05-10. Every fill since was tagging asmanual, silently breaking per-strategy attribution. Same field-drift class as the 2026-05-03count_fp/*_price_dollarsfix. Recovered via localposted_orderswriter + fall-back lookup in fills_writer.revenue=0for valid winning settlements since at least 2026-04-12. Hedged cross_bracket positions (1 YES + 1 NO) were silently reported as ~$0.90 losses when they were actually ~$0.10 wins. Cross_bracket's headline -$13.39 P&L was closer to -$3.35 once hedge accounting was honest.metar_running_onlyrow alongside the livemetarrow for offline calibration comparison.test_client_order_id_coverage.pywas missing cross_bracket files from itsFILES_THAT_POST_ORDERSlist — that's why this class of bug shipped).Backfill applied to live DB
tools/backfill_hedge_settlements.py --apply→ 14 historical settlement rows corrected (+$14.00 P&L on cross_bracket's historical total)tools/backfill_fills_source.py --apply→ 26 cross_bracket + 5 cross_bracket_exit fills re-tagged frommanualWhat is intentionally NOT in this PR (calendar-blocked)
metarvsmetar_running_onlyBrier scores against settled outcomes. Run after ~3 days of shadow data.WEATHER_METAR_USE_RUNNING_HIGH_ONLY=trueand revert commit 0985437 (per-family σ floors), conditional on F.5 outcome.Test plan
pytest tests/— 2,361 passing, 2 unrelated env-skipped (cryptography ABI on local Mac, no_secrets_in_repo)bash deploy/04_redeploy.sh 45.55.79.193— daemon healthymetar_running_onlyshadow rows emit at 1:1 withcombined_v2cross_bracket_live:KXHIGHDENkv flipped tofalseon prod🤖 Generated with Claude Code