Skip to content

2026-05-12 audit: attribution + hedge + KXHIGHDEN block + σ residuals + μ-running-high shadow#3

Merged
ijlu merged 7 commits into
mainfrom
claude/sleepy-hofstadter-1d66bc
May 12, 2026
Merged

2026-05-12 audit: attribution + hedge + KXHIGHDEN block + σ residuals + μ-running-high shadow#3
ijlu merged 7 commits into
mainfrom
claude/sleepy-hofstadter-1d66bc

Conversation

@ijlu
Copy link
Copy Markdown
Owner

@ijlu ijlu commented May 12, 2026

Full work from the 2026-05-12 audit thread. Seven commits, all deployed live and observed working. Re-homing to ijlu/autoagent (was inadvertently opened against kevinrgu/autoagent as cross-fork kevinrgu#12).

Commits

commit summary
433ceb7 fills_writer: recover client_order_id via posted_orders on Kalshi field drift
b4bf869 record_settlements: derive revenue locally; Kalshi /settlements returns 0
17ce086 cross_bracket: hard-block KXHIGHDEN until σ inflation lands
27fe563 tools: one-time backfill scripts for hedge settlements + fills source
deb4d07 fills_writer: drift-alert when manual fill matches a recently-posted decision
0985437 cross_bracket: per-family σ floor from empirical residuals (D.1)
07339bc metar_observations: shadow-log running-high-only μ alternative (F.4)

Headline findings

  1. Kalshi removed client_order_id from /portfolio/fills around 2026-05-10. Every fill since was tagging as manual, silently breaking per-strategy attribution. Same field-drift class as the 2026-05-03 count_fp / *_price_dollars fix. Recovered via local posted_orders writer + fall-back lookup in fills_writer.
  2. Kalshi /portfolio/settlements has been returning revenue=0 for valid winning settlements since at least 2026-04-12. Hedged cross_bracket positions (1 YES + 1 NO) were silently reported as ~$0.90 losses when they were actually ~$0.10 wins. Cross_bracket's headline -$13.39 P&L was closer to -$3.35 once hedge accounting was honest.
  3. The METAR Gaussian's μ blends NWP forecast_high into running_high, contaminating the observation channel. On days when NWP misses (cold fronts in DEN, heat domes in AUS), μ carries a +5–15°F bias which the post-peak fast-path then locks in at σ=1.0°F. 5/5 KXHIGHDEN directional cross_bracket losses in the audit window. Shadow-log path landed (commit 07339bc) emits an metar_running_only row alongside the live metar row for offline calibration comparison.
  4. A structural-invariant test was not enforcing the rule it claimed to enforce (test_client_order_id_coverage.py was missing cross_bracket files from its FILES_THAT_POST_ORDERS list — that's why this class of bug shipped).

Backfill applied to live DB

  • tools/backfill_hedge_settlements.py --apply → 14 historical settlement rows corrected (+$14.00 P&L on cross_bracket's historical total)
  • tools/backfill_fills_source.py --apply → 26 cross_bracket + 5 cross_bracket_exit fills re-tagged from manual

What is intentionally NOT in this PR (calendar-blocked)

  • F.5 — analysis tool comparing metar vs metar_running_only Brier scores against settled outcomes. Run after ~3 days of shadow data.
  • F.6 — flip WEATHER_METAR_USE_RUNNING_HIGH_ONLY=true and revert commit 0985437 (per-family σ floors), conditional on F.5 outcome.

Test plan

  • pytest tests/ — 2,361 passing, 2 unrelated env-skipped (cryptography ABI on local Mac, no_secrets_in_repo)
  • Deploy to VPS via bash deploy/04_redeploy.sh 45.55.79.193 — daemon healthy
  • Verified posted_orders writer fires on first new cross_bracket POST after deploy
  • Verified metar_running_only shadow rows emit at 1:1 with combined_v2
  • cross_bracket_live:KXHIGHDEN kv flipped to false on prod
  • After ~3 days: run F.5 analysis, decide on F.6

🤖 Generated with Claude Code

ijlu and others added 7 commits May 12, 2026 10:06
…ld drift

The 2026-05-12 audit caught a second instance of the same Kalshi format-
drift pattern that triggered the 2026-05-03 dual-format fix (commit
6a39dd7). Kalshi's /portfolio/fills response stopped echoing back
``client_order_id`` around 2026-05-10:

    Old keys: action, count_fp, created_time, fee_cost, fill_id,
              is_taker, market_ticker, no_price_dollars, order_id,
              side, subaccount_number, ticker, trade_id, ts,
              yes_price_dollars
    Missing:  client_order_id

Without it, ``default_source_tagger`` returns ``manual`` for every fill,
silently breaking per-strategy attribution: ``weather_mm_shadow.live_
pnl_cents`` back-fill joins zero rows, ``mm_promotion`` graduation can
never fire (CANARY becomes terminal), and ``backtest_comprehensive.py``
strategy slices are wrong for everything since May 11. The 2026-05-12
audit found the bot had logged 9 cross_bracket_live POSTED decisions on
May 11 against 0 ``cross_bracket`` source rows in ``fills_ledger`` —
the 12 corresponding fills had all collapsed into ``source='manual'``.

Recovery: record ``(order_id, client_order_id, source_hint)`` at post
time in ``posted_orders``, fall back to that lookup at fill-ingest time
when Kalshi's payload omits client_order_id.

Changes:
* bot/db.py: ``posted_orders`` table definition with order_id PK and
  indexes on posted_ts_unix and client_order_id. The table existed on
  prod (30 historical rows from an aborted backfill on May 7–10) but
  was never tracked in the schema. CREATE TABLE IF NOT EXISTS is
  idempotent against the existing prod table.
* bot/daemon/fills_writer.py: ``record_posted_order(conn, **kwargs)``
  helper, idempotent via INSERT OR IGNORE on order_id, holds
  ``DB_WRITE_LOCK`` via db_write_ctx. ``ingest_page`` falls back to
  posted_orders lookup when fill.client_order_id is missing. Adds
  ``cross_bracket`` and ``cross_bracket_exit`` to ALLOWED_SOURCES and
  routes ``mm_xb_*`` / ``mm_xb_exit_*`` prefixes in default_source_tagger
  (was collapsing to ``legacy``).
* bot/daemon/cross_bracket_shadow.py: call record_posted_order after
  successful api_post in _post_live_order.
* bot/daemon/cross_bracket_exit.py: same wiring + rename exit prefix
  from ``mm_xbexit_`` to ``mm_xb_exit_`` so the tagger can distinguish
  exits from entries. has_pending_exit accepts both prefixes for
  in-flight orders.
* tests/test_client_order_id_coverage.py: structural-invariant test
  now covers cross_bracket_shadow and cross_bracket_exit (was missing
  them — which is why this bug shipped silently). Allowed prefixes
  expanded with ``mm_xb_exit_`` and ``mm_xb_``.
* tests/test_posted_orders_fallback.py: new — pins (a) tagger routing
  for both cross_bracket prefixes, (b) record_posted_order
  idempotence + partial-row refusal, (c) end-to-end recovery when
  Kalshi omits client_order_id from the fill payload, (d) external
  fills still fall through to ``manual`` correctly.

Going forward the daemon recovers attribution automatically. A
follow-up offline backfill is required to repair existing May 11+
``fills_ledger`` rows that already tagged ``manual`` (out of scope
here — the live trading path is fixed but historical data is not).

2327 tests passing (2 unrelated env-blocked: cryptography ABI on
local Mac, no_secrets_in_repo unrelated).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ns 0

Third instance of the Kalshi field-drift pattern (after 2026-05-03
count_fp / *_price_dollars and 2026-05-12 client_order_id removal).

Kalshi's /portfolio/settlements endpoint has been intermittently
returning ``revenue=0`` for valid winning settlements since at least
2026-04-12, and as of 2026-05-12 returns 0 for every settlement.
Verified against live API on 2026-05-12: every settlement in the
500-row page reports ``revenue: 0`` regardless of contracts held or
market_result. The bot's record_settlements() used this field directly
in ``profit = revenue - cost - fees``, so:

* Pure winners with no hedge: profit = 0 - cost - fees (looks like a
  total loss). For directional buys at low prices this was a small
  rounding error, but for high-priced legs it understated profit.
* Hedged positions (cross_bracket_exit pattern: 1 YES + 1 NO on same
  ticker): the hedge GUARANTEES a $1 payout on the winning leg
  regardless of outcome. ``revenue=0`` made each hedge show as a
  ~$0.90 phantom loss when it was actually a ~$0.10 win. The
  2026-05-12 audit found 13 cross-bracket hedged positions all
  mis-reported this way; the strategy's headline -$13.39 P&L was
  actually closer to -$3.35 after correcting hedge accounting.

Fix: compute revenue locally via ``settlement_revenue_cents(yes_count,
no_count, market_result)`` in bot/core/money.py. The formula is the
canonical Kalshi binary contract: each winning-side contract pays $1.00,
losing-side contracts pay $0. Identical for pure and hedged positions.

This is BUG #5 in the test_money.py regression watchlist (added).

11 new tests covering: pure YES/NO winner+loser, balanced hedge under
both outcomes, asymmetric hedge, zero position, unknown result,
fractional-count rounding, and the exact record_settlements call
shape with a Kalshi payload that reports ``revenue=0``.

2339 tests passing (2 skipped: cryptography ABI on local Mac and
no_secrets_in_repo, both unrelated).

Going-forward correctness only — historical settlements in the DB
remain mis-reported (separate one-time backfill needed, out of scope
here). The fix kicks in for every settlement recorded after deploy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 audit Phase C.3 finding: the combined-Gaussian σ for KXHIGHDEN
is 1.4–4.2°F across recent decisions, while actual high RMSE was 11–12°F
across 5 directional losses on May 6 (actual 38.5, model 49.5, σ=1.38) and
May 11 (actual 85.5, model 73.0, σ=4.23). Both are 3+σ misses; 5/5
KXHIGHDEN directional cross-bracket bets resolved against us.

KXHIGHDEN is already blocked in DIRECTIONAL_BLOCKLIST and MM_BLOCKED_SERIES
for the same reason (Brier 0.316 vs other weather families ≈ 0.10). The
cross-bracket loop was the remaining gap; this closes it.

Changes:
* bot/config.py: ``CROSS_BRACKET_BLOCKLIST`` (env-overridable, default
  ``KXHIGHDEN``) — frozenset of family prefixes that are hard-banned
  regardless of kv-cache toggles. Distinct from the canary kv toggle
  (``cross_bracket_live:<family>``): blocklist is for known-broken
  families, kv is for staged rollout / temporary pauses.
* bot/daemon/cross_bracket_shadow.py: ``_is_family_live`` consults the
  blocklist before the kv lookup. Short-circuits to False for
  KXHIGHDEN even if env=true and kv=true (defends against accidental
  re-arm).
* bot/daemon/main.py: ``_run_cross_bracket_rearm`` explicitly writes
  False for blocked families so a leftover True from a prior arm
  decays out immediately rather than waiting on the 24h TTL. Log
  message now reports ``armed=`` / ``blocked=`` lists.
* tests/test_cross_bracket_live_gates.py: two new regression tests
  (blocklist overrides kv-true; case-insensitive family compare).

VPS kv override applied alongside deploy:
  UPDATE kv_cache SET value='false', expires_at=now+86400
   WHERE key='cross_bracket_live:KXHIGHDEN'

Plan is to remove KXHIGHDEN from the blocklist once σ inflation lands
(Phase D.1 — per-family RMS-of-residuals as a floor on the Gaussian
σ used at cross-bracket decision time).

2341 tests passing (2 unrelated skipped).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two companion scripts for the 2026-05-12 audit aftermath. Going-forward
correctness was landed in 433ceb7 (posted_orders writer) and b4bf869
(settlement_revenue_cents). These tools sweep up the historical data.

tools/backfill_hedge_settlements.py — recompute settlements rows
that were written before BUG #5 was fixed:
  - For every settlement, look up the bot's (yes_qty, no_qty, fees)
    from fills_ledger and the authoritative market_result from
    alpha_backtest.
  - Recompute revenue via settlement_revenue_cents, then profit and
    won. UPDATE only if any field differs.
  - Idempotent re-run via field-equality check.
  - Dry-run by default; --apply writes.
  - Applied live on VPS: 14 rows corrected. Net P&L correction
    +$14.00 on historical cross_bracket totals (the 13 hedged winners
    that were silently bleeding $1 each, plus one asymmetric hedge
    that goes from -$1.21 to -$0.21).

tools/backfill_fills_source.py — re-tag fills_ledger rows that
landed as ``source='manual'`` because posted_orders wasn't being
written during the May 11+ window:
  - Rule 1: match by (ticker, side, ts ±60s) against
    alpha_backtest cross_bracket_live posted decisions →
    ``source='cross_bracket'``.
  - Rule 2: YES buy at ≤15¢ following a same-ticker NO entry within
    12h → ``source='cross_bracket_exit'`` (heuristic; the exit code
    path doesn't log to alpha_backtest, but the price+timing is
    diagnostic).
  - Anything that doesn't match either rule stays as ``manual``.
  - Applied live on VPS: 26 cross_bracket + 5 cross_bracket_exit
    recovered. 3 left as manual (likely real human-placed fills).

7-test regression suite for the hedge-settlements backfill covering:
hedged winner correction, hedged loser with corrected (still-negative)
profit, pure winner, pure loser unchanged (rows already correct stay
that way), no-bot-fills rows skipped, dry-run writes nothing,
idempotent re-apply.

2348 tests passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…decision

The 2026-05-12 audit's most insidious finding: Kalshi removed
client_order_id from /portfolio/fills around 2026-05-10, and the bot
silently tagged every cross_bracket fill as 'manual' for two days
before anyone noticed. Same class as the 2026-05-03 dual-format fix —
field drift on a "stable" Kalshi field, no warning, just attribution
quietly going dark.

This adds the canary that would have caught it on day one. When
FillsWriter tags a fill as 'manual', it checks alpha_backtest for a
posted decision on the same (ticker, side) within the last 10 minutes.
If found, log a loud DRIFT ALERT pointing at the audit Phase A so the
next instance of the same bug is visible the moment it lands.

Heuristic carefully tuned:
* Window: 600s — covers post→fill latency plus retries, narrow enough
  to avoid false positives from unrelated bot+human activity on the
  same ticker hours apart.
* Same-side match required — a fill on the opposite side could be a
  hedge exit or unrelated trade.
* decision_outcome='posted' filter — shadow decisions don't matter.
* Logged at WARNING with the exact phrase "DRIFT ALERT" so log greps
  and ops alerts can pattern-match without false-positive risk.
* Lookup wrapped in try/except — a broken alert must not break the
  writer.

3 new regression tests:
  - Fires when a posted decision sits within window of a manual fill
  - Silent for genuine external fills (no matching decision)
  - Silent for old decisions outside the 10-min window

2351 tests passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 audit Phase D.1 finding: the combined-Gaussian σ that
cross_bracket scores brackets against collapses to ~1°F post-peak
(METAR fast-path tightens it once the daily high looks locked), but
the empirical actual-vs-predicted RMSE over the last 18 days is
much wider — 1.34–2.39°F across the five non-Denver weather families.

Per-family RMSE (tools/sigma_residuals.py, n=10–16 per family):
    LAX   1.34°F    MIA   1.73°F    NY    1.86°F
    AUS   1.95°F    CHI   2.39°F    DEN   6.55°F (blocked)

With σ collapsed to 1°F, cross_bracket's conviction gate
(p ≥ 0.65 either way) fires on any bracket >1.4°F from μ — but
the ACTUAL high routinely lands 1–2σ from μ at that scale. So the
strategy was confidently shorting NO on brackets that genuinely had
~30% YES probability, picking up 0/29 directional bets through the
audit window.

Fix: add ``CROSS_BRACKET_FAMILY_SIGMA_FLOORS`` (per-family table,
env-overridable), and clamp σ in ``_score_one_settlement`` to
``max(combined.sigma_f, family_floor, physical_floor)``. Wider σ →
flatter bracket probabilities → conviction gate fires only on
genuine model-market disagreements, not on routine variance the
model already knows about.

Concrete impact: with σ=2.0 instead of σ=1.0 for KXHIGHNY, a bracket
that previously scored p_yes=0.005 (the conviction-gate-passing
"definitely not this bracket") now scores ~0.04 — still skew, but
no longer triggering the strategy's "I'm certain" branch.

The σ inflation also changes shadow scoring (deliberate — future
calibration training sees the new regime), so the calibration table
will drift over the next few days as samples accumulate under the
new σ. Watch the per-family Brier in backtest_comprehensive.

KXHIGHDEN keeps its hard block in CROSS_BRACKET_BLOCKLIST: 6.5°F
empirical floor would essentially flatten every bracket probability,
preventing the strategy from firing anyway. Block is cleaner than
"infinite σ" as a way of saying "we don't trade this".

Companion tool: tools/sigma_residuals.py — recomputes the per-family
RMSE from weather_forecast_snapshots × alpha_backtest. Re-run
periodically; if the floors drift materially, update the constants
(test_cross_bracket_live_gates.py pins them and will fail loudly).

Test plan:
  - test_family_sigma_floors_table_defaults — pins the constants
  - test_family_sigma_floor_env_override — env var path works

2353 tests passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 2026-05-12 audit's deepest finding: the bot's ``metar`` source
isn't actually a METAR observation channel — its μ
(``expected_eventual_high``) blends running_high with the NWP-derived
forecast_high and a diurnal projection. On days when NWP misses (cold
fronts in DEN, heat domes in AUS), the blended μ carries a +5–15°F
bias, which the post-peak fast-path then locks in by collapsing σ to
1.0°F. Cross-bracket fires confidently against well-priced markets
and loses (5/5 KXHIGHDEN directional losses in the audit window).

Single concrete example — KXHIGHDEN-26MAY06 at decision time:
  actual METAR running max  = 38°F (cold front held the day's high)
  bot's `metar` source μ    = 48°F (NWP forecast was 54°F, blended in)
  bot's combined_v2 μ       = 48°F, σ = 1.0°F (fast-path locked)
  actual day's high         = 38°F → bracket B38.5 won
  bot bought NO at 5¢ on B38.5 thinking p_yes = 0.005, lost.

Josh's intuition for the fix: each source should represent ONE thing
and combine_gaussian should do the weighting. The METAR channel
should carry the running max with diurnal-RMSE σ; NWP feeds 5+ other
sources (nbm/hrrr/nws_point/ecmwf/gem/icon/etc) which the combine
already weights. Removing NWP from the METAR channel eliminates the
double-counting and lets the combine widen σ when NWP and obs
disagree.

Shipped behavior here is shadow-only:
* New side-channel ``_ALT_MU_RUNNING_HIGH`` stashes
  (μ=running_high, σ=diurnal_rmse_or_2.0°F_fallback) on every cycle
* Snapshot writer emits a parallel ``metar_running_only`` row in
  weather_forecast_snapshots alongside the existing ``metar`` row
* No change to the live Gaussian path (default flag off)
* Live cutover is gated by ``WEATHER_METAR_USE_RUNNING_HIGH_ONLY=true``
  — flip after ~1 week of shadow data confirms the alt is better-
  calibrated against settled outcomes
* When flag flips on, the existing D.1 per-family σ floors
  (commit 0985437) should probably come out — both target the same
  symptom; the F.4 cure is upstream and cleaner

The σ fallback (2.0°F when no diurnal fit) is wider than the existing
past-peak clamp (0.3°F) and the post-peak fast-path floor (1.0°F).
This is deliberate — claim less precision when only the running max
grounds μ, not more.

8 new regression tests pin:
- stash semantics (with/without diurnal fit, zero RMSE → fallback)
- pop semantics (prevents stale reads on a missed cycle)
- flag-off path: live μ uses blended forecast, alt still stashed
- flag-on path: live μ == running_high
- past-peak path: stashes the alt with the wider fallback σ
- default flag is False (regression guard against premature cutover)

2361 tests passing.

Follow-up plan (not landed yet):
  F.5 — analysis tool over weather_forecast_snapshots comparing the
        new ``metar_running_only`` rows vs ``metar`` rows against
        settled outcomes. Run after 7 days of shadow data.
  F.6 — once shadow validates, flip ``WEATHER_METAR_USE_RUNNING_HIGH_ONLY``
        and revert commit 0985437's D.1 per-family σ floors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ijlu ijlu merged commit 59f8de4 into main May 12, 2026
@ijlu ijlu deleted the claude/sleepy-hofstadter-1d66bc branch May 12, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant