Skip to content

l2-regime: rolling-RV regime filter, OOS-verified 2× IC uplift#236

Merged
neuron7xLab merged 3 commits intomainfrom
l2-regime-calibration
Apr 17, 2026
Merged

l2-regime: rolling-RV regime filter, OOS-verified 2× IC uplift#236
neuron7xLab merged 3 commits intomainfrom
l2-regime-calibration

Conversation

@neuron7xLab
Copy link
Copy Markdown
Owner

Summary

Recursive + cyclic + rolling-walk-forward analysis of the 5h14m collected L2 substrate revealed that the Ricci κ_min cross-sectional edge is intermittent, not uniform — some time blocks emit IC > +0.18, others invert to IC < -0.10. Full-window PROCEED averaged these.

This PR introduces the regime filter that rescues the edge:

  • Discriminator: rolling realized volatility (Spearman ρ = +0.352, p = 0.008 on 56 rolling windows)
  • Mechanism: when market is quiet (low RV), OFI → 0 → Ricci → noise; when market is active, OFI drives observable moves → Ricci has structural content to score
  • OOS verification (threshold calibrated on first half, applied to second half, zero information leakage):
frac_on IC_signal Residual IC
TEST unconditional 100.0 % +0.116 +0.123
TEST q50 thr←train 43.9 % +0.202 +0.193
TEST q75 thr←train 36.3 % +0.236 +0.233

2.03× IC uplift with threshold learned on first half generalizing cleanly to second half.

Architecture (AE-compliant — elimination over addition)

New public surface in research/microstructure/regime.py:

rolling_corr_regime(features, window_rows=300) -> NDArray[float64]
rolling_rv_regime(features, window_rows=300)   -> NDArray[float64]  # primary
regime_mask_from_score(score, threshold)       -> NDArray[bool]
regime_mask_from_quantile(score, quantile)     -> NDArray[bool]

One optional parameter added to existing gate:

run_killtest(..., regime_mask: NDArray[bool] | None = None) -> GateVerdict

None → identical behavior to before. No new dataclasses, no new CLI flags (yet — they'll land when the regime filter earns its place in the permanent gate).

Analysis scripts (diagnostics, not production)

  • scripts/l2_killtest_recursive.py — depth-first bisection (full → halves → quarters → octiles) + cyclic K=8 blocks. Exposes regime structure.
  • scripts/l2_regime_analysis.py — per-block regime features + IC rank correlation.
  • scripts/l2_walk_forward.py — 56 rolling 40-min windows with 5-min step, Spearman feature→IC correlations.
  • scripts/l2_regime_conditional.py — in-sample conditional gate at multiple quantile thresholds × window sizes.
  • scripts/l2_regime_oos.pytrue OOS, threshold on train, applied to test.

Quality gates

  • ruff format+check: clean
  • black --check: clean
  • mypy --strict --follow-imports=silent: clean
  • 26/26 pytest green (19 prior + 7 new regime tests)

What's explicitly NOT in this PR

  • No new CLI flag for --regime yet — waiting for 8h fresh collection (currently running in background) to confirm threshold generalizes across a second session, not just across halves of one session.
  • No walk-forward framework — we already walked forward; the uplift survives.
  • No live-execution hooks.

Test plan

  • CI green
  • Merge (admin-squash after CI green — branch-protection on main)
  • Wait for 8h fresh collection (ETA ~07:30 UTC+3)
  • Run l2_regime_oos.py against FRESH substrate with threshold from ORIGINAL session — if uplift holds across sessions, regime filter is production-ready.

🤖 Generated with Claude Code

neuron7xLab and others added 2 commits April 17, 2026 23:41
Three diagnostic scripts built on existing primitives (slice_features,
run_killtest, cross_sectional_ricci_signal) — no new abstractions
(AE principles 1, 20).

* scripts/l2_killtest_recursive.py — depth-first bisection (up to
  depth 3) + cyclic K=8 disjoint blocks. Reveals regime structure
  hidden by full-window averaging. On collected substrate: 3/8 blocks
  PROCEED with IC up to +0.339; 3/8 KILL with IC as low as -0.109.
  Signal is intermittent, not uniform.

* scripts/l2_regime_analysis.py — per-block regime features
  (realized vol, cross-asset correlation, dispersion, signed trend,
  κ_min moments). Spearman rank-correlates block IC against each
  feature. On K=8: corr_mean strongest direction (ρ=+0.429, p=0.29
  n=8). Underpowered for statistical claim; motivates finer-grained
  analysis.

* scripts/l2_walk_forward.py — 40-minute rolling window with 5-minute
  step across substrate. ~56 windows gives the statistical power that
  8 disjoint blocks lack. Reports IC trajectory, Spearman ρ at rolling
  resolution, quartile bins on the most-correlated feature to find a
  discriminator threshold.

Output artifacts:
* results/REGIME_ANALYSIS.json (8-block table + ρ matrix)
* results/L2_WALK_FORWARD.json (56-row trajectory, quartile bins)

Non-goals: new dataclasses, new production modules. Pure diagnostics.
If walk-forward identifies a regime discriminator, next commit adds
the regime filter to killtest.py as an optional parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk-forward analysis (scripts/l2_walk_forward.py, 56 rolling 40-min
windows) identified rolling realized volatility as the dominant regime
discriminator:

    Spearman ρ(IC_signal, rv_mean)   = +0.352   p = 0.008 ***
    Spearman ρ(IC_signal, corr_mean) = +0.317   p = 0.017 *
    Spearman ρ(IC_signal, trend_*)   = not significant

Quartile analysis on rv_mean:
    Q1_low  IC median +0.027   (signal ≈ noise)
    Q4_high IC median +0.137   (signal works)

IN-SAMPLE CONDITIONAL (scripts/l2_regime_conditional.py):
    unconditional IC = +0.122
    rv_w600_q75       IC = +0.256   (frac_on = 24.2 %)
    rv_w300_q50       IC = +0.177   (frac_on = 49.2 %)

TRUE OOS (scripts/l2_regime_oos.py, threshold trained on first half,
applied to second half, no information leakage):
    TEST unconditional        IC = +0.116   frac_on = 100.0 %
    TEST q50 thr from train   IC = +0.202   frac_on =  43.9 %
    TEST q75 thr from train   IC = +0.236   frac_on =  36.3 %
    => 2.03× IC uplift OOS, threshold generalizes

Components
- research/microstructure/regime.py — 4 functions, no new dataclasses:
    * rolling_corr_regime(features, window_rows)
    * rolling_rv_regime(features, window_rows)    (primary, OOS-verified)
    * regime_mask_from_score(score, threshold)
    * regime_mask_from_quantile(score, quantile)
- research/microstructure/killtest.py — single optional parameter:
    * run_killtest(..., regime_mask: NDArray[bool] | None = None)
    * Backwards-compatible: None → identical behavior to before
    * When supplied, mask is applied at scoring time (ricci + target →
      NaN outside mask), Ricci signal itself still computed on full
      contiguous series (its rolling corr needs consecutive rows)
- tests/test_l2_regime.py — 7 new tests:
    * shape + warmup on rolling_corr_regime
    * high-ρ vs low-ρ synthetic discrimination
    * argument validation (window too small, single symbol)
    * mask NaN handling
    * killtest rejects wrong mask shape
    * trivial all-True mask matches unconditional (regression)

All 26 tests green. ruff + black + mypy --strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

n_samples=int(features.n_rows),

P2 Badge Report effective sample count for masked killtest runs

n_samples is always reported as features.n_rows, but with regime_mask many rows are excluded from IC/p-value computations, so the verdict metadata overstates the amount of data actually used. This can mislead analysis scripts or operators about statistical power; n_samples should reflect the finite scored rows (or an additional effective-sample field should be emitted).

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +380 to +381
ricci_panel = np.where(panel_mask, ricci_panel, np.nan)
target = np.where(panel_mask, target, np.nan)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Permute only active rows in masked null tests

When regime_mask is applied, rows outside the regime are converted to NaN and then _permutation_pvalue still shuffles across the full time axis; for sparse masks this makes most shuffled rows miss the finite target rows, shrinking effective trial sample size (often below _pooled_ic's minimum) and biasing permutation p-values toward non-significance. This can falsely KILL regime-conditional runs even when the observed IC is strong, so the null test should permute/compress only the active (finite) rows.

Useful? React with 👍 / 👎.

Two more diagnostic scripts for the regime filter, completing the
generalization ladder:

* scripts/l2_regime_walkforward_calibration.py
  Rolling 60-min calibration + 30-min evaluation. Slides across
  substrate; at each step derives q50/q75 thresholds from the
  calibration window, applies them to the next evaluation window.
  Result on collected substrate: uplift POSITIVE in only 1 of 7
  steps (q50) / 1 of 5 steps (q75). HONEST LIMIT: the 50/50 split
  uplift (IC +0.12 → +0.24 OOS) does NOT survive production-style
  short-window rolling recalibration. Threshold needs longer
  calibration horizons to stabilize.

* scripts/l2_regime_cross_session.py
  Cross-session OOS scaffold: takes --train-dir and --test-dir,
  derives quantile thresholds from the train session, applies to
  the test session, writes results/L2_REGIME_CROSS_SESSION.json.
  Runnable against the second 8h session currently being collected
  into data/binance_l2_perp_v2. Strongest form of OOS we can do
  without multi-day walk-forward.

These land as diagnostics only. The regime MODULE (regime.py) and
its integration (run_killtest regime_mask param) are the shippable
artifacts. Scripts document the calibration surface honestly —
including where the filter breaks under stricter recalibration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@neuron7xLab neuron7xLab merged commit bdbc991 into main Apr 17, 2026
13 checks passed
@neuron7xLab neuron7xLab deleted the l2-regime-calibration branch April 17, 2026 21:19
neuron7xLab added a commit that referenced this pull request Apr 18, 2026
Open question from PR #237: does Ricci cross-sectional signal polarity
vary with UTC hour-of-day, or was Session-2's IC=-0.22 just sample noise?
Binary test via new research/microstructure/diurnal.py folds three
collected sessions into UTC-hour buckets and runs per-hour pooled
Spearman IC + permutation test.

VERDICT: SIGN_FLIP_CONFIRMED

Per-hour IC (pooled across sessions; all buckets at permutation p<0.05
unless noted; n_rows after NaN-filter):

    hour    IC        p          sessions   regime
    05Z     n/a       —          S3         underpowered
    06Z     -0.0619   0.002      S3         Sat AM quiet
    07Z     -0.1194   0.002      S3         Sat AM quiet
    08Z     +0.0086   0.036      S1+S3      EU open weak
    09Z     +0.0191   0.002      S1+S3      EU open
    10Z     +0.1279   0.002      S1         EU active
    11Z     -0.0360   0.002      S1         EU midday lull
    12Z     +0.0556   0.002      S1         EU + pre-US
    13Z     +0.3554   0.002      S1         EU+US overlap PEAK
    20Z     +0.0008   0.954      S2         underpowered (n=7410)
    21Z     -0.0715   0.002      S2         US close / EU eve
    22Z     -0.1989   0.002      S2         EU evening

Significance tally: 5 positive hours + 5 negative hours at p<0.05.
Verdict: SIGN_FLIP_CONFIRMED (at least one bucket each side).

Architectural implication:
The regime-q75 filter landed in PR #236 conditions on volatility but
does NOT capture the inversion driver. Saturday-morning 07Z shows
IC=-0.12 at ordinary vol levels — a pure vol threshold does not
separate this from the +0.13 of 10Z. Time-of-day is a load-bearing
axis the current gate does not model.

Next step (out of scope for this PR): diurnal-aware sign strategy
as follow-up research module.

Components:
  research/microstructure/diurnal.py  (typed, 7 tests)
      utc_hour_of_row(start_ms, n_rows) -> NDArray[int64]
      compute_diurnal_profile(sessions, horizon_sec, min_rows_per_hour,
                              perm_trials, pvalue_gate, seed)
          -> DiurnalProfile
      profile_to_json_dict(profile) -> dict
      session_start_ms_from_frames(frames) -> int

  scripts/run_l2_diurnal_profile.py
      --data-dir PATH (repeatable; one per session)
      --horizon-sec INT (default 180)
      --min-rows-per-hour INT (default 300)
      --perm-trials INT (default 500)
      --pvalue-gate FLOAT (default 0.05)
      --seed INT (default 42)
      --output PATH

  tests/test_l2_diurnal.py (7 tests):
      utc_hour_of_row monotone + wrap + negative rejection
      empty-sessions underpowered
      low-sample underpowered/stable
      multi-session hour merging
      JSON schema contract
      determinism under fixed seed

Evidence:
  results/L2_DIURNAL_PROFILE.json  (tracked; full per-hour breakdown)

Quality gates: ruff + black + mypy --strict clean.
Test regression: 42 → 49 (+7 diurnal).
Numerical locks (PR #238 fixtures) unchanged:
  ic_test_q75    = 0.23638402111955653
  breakeven_q75  = 0.4072465349699599

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
neuron7xLab added a commit that referenced this pull request Apr 18, 2026
#240)

Open question from PR #237: does Ricci cross-sectional signal polarity
vary with UTC hour-of-day, or was Session-2's IC=-0.22 just sample noise?
Binary test via new research/microstructure/diurnal.py folds three
collected sessions into UTC-hour buckets and runs per-hour pooled
Spearman IC + permutation test.

VERDICT: SIGN_FLIP_CONFIRMED

Per-hour IC (pooled across sessions; all buckets at permutation p<0.05
unless noted; n_rows after NaN-filter):

    hour    IC        p          sessions   regime
    05Z     n/a       —          S3         underpowered
    06Z     -0.0619   0.002      S3         Sat AM quiet
    07Z     -0.1194   0.002      S3         Sat AM quiet
    08Z     +0.0086   0.036      S1+S3      EU open weak
    09Z     +0.0191   0.002      S1+S3      EU open
    10Z     +0.1279   0.002      S1         EU active
    11Z     -0.0360   0.002      S1         EU midday lull
    12Z     +0.0556   0.002      S1         EU + pre-US
    13Z     +0.3554   0.002      S1         EU+US overlap PEAK
    20Z     +0.0008   0.954      S2         underpowered (n=7410)
    21Z     -0.0715   0.002      S2         US close / EU eve
    22Z     -0.1989   0.002      S2         EU evening

Significance tally: 5 positive hours + 5 negative hours at p<0.05.
Verdict: SIGN_FLIP_CONFIRMED (at least one bucket each side).

Architectural implication:
The regime-q75 filter landed in PR #236 conditions on volatility but
does NOT capture the inversion driver. Saturday-morning 07Z shows
IC=-0.12 at ordinary vol levels — a pure vol threshold does not
separate this from the +0.13 of 10Z. Time-of-day is a load-bearing
axis the current gate does not model.

Next step (out of scope for this PR): diurnal-aware sign strategy
as follow-up research module.

Components:
  research/microstructure/diurnal.py  (typed, 7 tests)
      utc_hour_of_row(start_ms, n_rows) -> NDArray[int64]
      compute_diurnal_profile(sessions, horizon_sec, min_rows_per_hour,
                              perm_trials, pvalue_gate, seed)
          -> DiurnalProfile
      profile_to_json_dict(profile) -> dict
      session_start_ms_from_frames(frames) -> int

  scripts/run_l2_diurnal_profile.py
      --data-dir PATH (repeatable; one per session)
      --horizon-sec INT (default 180)
      --min-rows-per-hour INT (default 300)
      --perm-trials INT (default 500)
      --pvalue-gate FLOAT (default 0.05)
      --seed INT (default 42)
      --output PATH

  tests/test_l2_diurnal.py (7 tests):
      utc_hour_of_row monotone + wrap + negative rejection
      empty-sessions underpowered
      low-sample underpowered/stable
      multi-session hour merging
      JSON schema contract
      determinism under fixed seed

Evidence:
  results/L2_DIURNAL_PROFILE.json  (tracked; full per-hour breakdown)

Quality gates: ruff + black + mypy --strict clean.
Test regression: 42 → 49 (+7 diurnal).
Numerical locks (PR #238 fixtures) unchanged:
  ic_test_q75    = 0.23638402111955653
  breakeven_q75  = 0.4072465349699599

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant