l2-regime: rolling-RV regime filter, OOS-verified 2× IC uplift#236
l2-regime: rolling-RV regime filter, OOS-verified 2× IC uplift#236neuron7xLab merged 3 commits intomainfrom
Conversation
Three diagnostic scripts built on existing primitives (slice_features, run_killtest, cross_sectional_ricci_signal) — no new abstractions (AE principles 1, 20). * scripts/l2_killtest_recursive.py — depth-first bisection (up to depth 3) + cyclic K=8 disjoint blocks. Reveals regime structure hidden by full-window averaging. On collected substrate: 3/8 blocks PROCEED with IC up to +0.339; 3/8 KILL with IC as low as -0.109. Signal is intermittent, not uniform. * scripts/l2_regime_analysis.py — per-block regime features (realized vol, cross-asset correlation, dispersion, signed trend, κ_min moments). Spearman rank-correlates block IC against each feature. On K=8: corr_mean strongest direction (ρ=+0.429, p=0.29 n=8). Underpowered for statistical claim; motivates finer-grained analysis. * scripts/l2_walk_forward.py — 40-minute rolling window with 5-minute step across substrate. ~56 windows gives the statistical power that 8 disjoint blocks lack. Reports IC trajectory, Spearman ρ at rolling resolution, quartile bins on the most-correlated feature to find a discriminator threshold. Output artifacts: * results/REGIME_ANALYSIS.json (8-block table + ρ matrix) * results/L2_WALK_FORWARD.json (56-row trajectory, quartile bins) Non-goals: new dataclasses, new production modules. Pure diagnostics. If walk-forward identifies a regime discriminator, next commit adds the regime filter to killtest.py as an optional parameter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk-forward analysis (scripts/l2_walk_forward.py, 56 rolling 40-min
windows) identified rolling realized volatility as the dominant regime
discriminator:
Spearman ρ(IC_signal, rv_mean) = +0.352 p = 0.008 ***
Spearman ρ(IC_signal, corr_mean) = +0.317 p = 0.017 *
Spearman ρ(IC_signal, trend_*) = not significant
Quartile analysis on rv_mean:
Q1_low IC median +0.027 (signal ≈ noise)
Q4_high IC median +0.137 (signal works)
IN-SAMPLE CONDITIONAL (scripts/l2_regime_conditional.py):
unconditional IC = +0.122
rv_w600_q75 IC = +0.256 (frac_on = 24.2 %)
rv_w300_q50 IC = +0.177 (frac_on = 49.2 %)
TRUE OOS (scripts/l2_regime_oos.py, threshold trained on first half,
applied to second half, no information leakage):
TEST unconditional IC = +0.116 frac_on = 100.0 %
TEST q50 thr from train IC = +0.202 frac_on = 43.9 %
TEST q75 thr from train IC = +0.236 frac_on = 36.3 %
=> 2.03× IC uplift OOS, threshold generalizes
Components
- research/microstructure/regime.py — 4 functions, no new dataclasses:
* rolling_corr_regime(features, window_rows)
* rolling_rv_regime(features, window_rows) (primary, OOS-verified)
* regime_mask_from_score(score, threshold)
* regime_mask_from_quantile(score, quantile)
- research/microstructure/killtest.py — single optional parameter:
* run_killtest(..., regime_mask: NDArray[bool] | None = None)
* Backwards-compatible: None → identical behavior to before
* When supplied, mask is applied at scoring time (ricci + target →
NaN outside mask), Ricci signal itself still computed on full
contiguous series (its rolling corr needs consecutive rows)
- tests/test_l2_regime.py — 7 new tests:
* shape + warmup on rolling_corr_regime
* high-ρ vs low-ρ synthetic discrimination
* argument validation (window too small, single symbol)
* mask NaN handling
* killtest rejects wrong mask shape
* trivial all-True mask matches unconditional (regression)
All 26 tests green. ruff + black + mypy --strict clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
GeoSync/research/microstructure/killtest.py
Line 459 in 8c31f7b
n_samples is always reported as features.n_rows, but with regime_mask many rows are excluded from IC/p-value computations, so the verdict metadata overstates the amount of data actually used. This can mislead analysis scripts or operators about statistical power; n_samples should reflect the finite scored rows (or an additional effective-sample field should be emitted).
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ricci_panel = np.where(panel_mask, ricci_panel, np.nan) | ||
| target = np.where(panel_mask, target, np.nan) |
There was a problem hiding this comment.
Permute only active rows in masked null tests
When regime_mask is applied, rows outside the regime are converted to NaN and then _permutation_pvalue still shuffles across the full time axis; for sparse masks this makes most shuffled rows miss the finite target rows, shrinking effective trial sample size (often below _pooled_ic's minimum) and biasing permutation p-values toward non-significance. This can falsely KILL regime-conditional runs even when the observed IC is strong, so the null test should permute/compress only the active (finite) rows.
Useful? React with 👍 / 👎.
Two more diagnostic scripts for the regime filter, completing the generalization ladder: * scripts/l2_regime_walkforward_calibration.py Rolling 60-min calibration + 30-min evaluation. Slides across substrate; at each step derives q50/q75 thresholds from the calibration window, applies them to the next evaluation window. Result on collected substrate: uplift POSITIVE in only 1 of 7 steps (q50) / 1 of 5 steps (q75). HONEST LIMIT: the 50/50 split uplift (IC +0.12 → +0.24 OOS) does NOT survive production-style short-window rolling recalibration. Threshold needs longer calibration horizons to stabilize. * scripts/l2_regime_cross_session.py Cross-session OOS scaffold: takes --train-dir and --test-dir, derives quantile thresholds from the train session, applies to the test session, writes results/L2_REGIME_CROSS_SESSION.json. Runnable against the second 8h session currently being collected into data/binance_l2_perp_v2. Strongest form of OOS we can do without multi-day walk-forward. These land as diagnostics only. The regime MODULE (regime.py) and its integration (run_killtest regime_mask param) are the shippable artifacts. Scripts document the calibration surface honestly — including where the filter breaks under stricter recalibration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Open question from PR #237: does Ricci cross-sectional signal polarity vary with UTC hour-of-day, or was Session-2's IC=-0.22 just sample noise? Binary test via new research/microstructure/diurnal.py folds three collected sessions into UTC-hour buckets and runs per-hour pooled Spearman IC + permutation test. VERDICT: SIGN_FLIP_CONFIRMED Per-hour IC (pooled across sessions; all buckets at permutation p<0.05 unless noted; n_rows after NaN-filter): hour IC p sessions regime 05Z n/a — S3 underpowered 06Z -0.0619 0.002 S3 Sat AM quiet 07Z -0.1194 0.002 S3 Sat AM quiet 08Z +0.0086 0.036 S1+S3 EU open weak 09Z +0.0191 0.002 S1+S3 EU open 10Z +0.1279 0.002 S1 EU active 11Z -0.0360 0.002 S1 EU midday lull 12Z +0.0556 0.002 S1 EU + pre-US 13Z +0.3554 0.002 S1 EU+US overlap PEAK 20Z +0.0008 0.954 S2 underpowered (n=7410) 21Z -0.0715 0.002 S2 US close / EU eve 22Z -0.1989 0.002 S2 EU evening Significance tally: 5 positive hours + 5 negative hours at p<0.05. Verdict: SIGN_FLIP_CONFIRMED (at least one bucket each side). Architectural implication: The regime-q75 filter landed in PR #236 conditions on volatility but does NOT capture the inversion driver. Saturday-morning 07Z shows IC=-0.12 at ordinary vol levels — a pure vol threshold does not separate this from the +0.13 of 10Z. Time-of-day is a load-bearing axis the current gate does not model. Next step (out of scope for this PR): diurnal-aware sign strategy as follow-up research module. Components: research/microstructure/diurnal.py (typed, 7 tests) utc_hour_of_row(start_ms, n_rows) -> NDArray[int64] compute_diurnal_profile(sessions, horizon_sec, min_rows_per_hour, perm_trials, pvalue_gate, seed) -> DiurnalProfile profile_to_json_dict(profile) -> dict session_start_ms_from_frames(frames) -> int scripts/run_l2_diurnal_profile.py --data-dir PATH (repeatable; one per session) --horizon-sec INT (default 180) --min-rows-per-hour INT (default 300) --perm-trials INT (default 500) --pvalue-gate FLOAT (default 0.05) --seed INT (default 42) --output PATH tests/test_l2_diurnal.py (7 tests): utc_hour_of_row monotone + wrap + negative rejection empty-sessions underpowered low-sample underpowered/stable multi-session hour merging JSON schema contract determinism under fixed seed Evidence: results/L2_DIURNAL_PROFILE.json (tracked; full per-hour breakdown) Quality gates: ruff + black + mypy --strict clean. Test regression: 42 → 49 (+7 diurnal). Numerical locks (PR #238 fixtures) unchanged: ic_test_q75 = 0.23638402111955653 breakeven_q75 = 0.4072465349699599 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#240) Open question from PR #237: does Ricci cross-sectional signal polarity vary with UTC hour-of-day, or was Session-2's IC=-0.22 just sample noise? Binary test via new research/microstructure/diurnal.py folds three collected sessions into UTC-hour buckets and runs per-hour pooled Spearman IC + permutation test. VERDICT: SIGN_FLIP_CONFIRMED Per-hour IC (pooled across sessions; all buckets at permutation p<0.05 unless noted; n_rows after NaN-filter): hour IC p sessions regime 05Z n/a — S3 underpowered 06Z -0.0619 0.002 S3 Sat AM quiet 07Z -0.1194 0.002 S3 Sat AM quiet 08Z +0.0086 0.036 S1+S3 EU open weak 09Z +0.0191 0.002 S1+S3 EU open 10Z +0.1279 0.002 S1 EU active 11Z -0.0360 0.002 S1 EU midday lull 12Z +0.0556 0.002 S1 EU + pre-US 13Z +0.3554 0.002 S1 EU+US overlap PEAK 20Z +0.0008 0.954 S2 underpowered (n=7410) 21Z -0.0715 0.002 S2 US close / EU eve 22Z -0.1989 0.002 S2 EU evening Significance tally: 5 positive hours + 5 negative hours at p<0.05. Verdict: SIGN_FLIP_CONFIRMED (at least one bucket each side). Architectural implication: The regime-q75 filter landed in PR #236 conditions on volatility but does NOT capture the inversion driver. Saturday-morning 07Z shows IC=-0.12 at ordinary vol levels — a pure vol threshold does not separate this from the +0.13 of 10Z. Time-of-day is a load-bearing axis the current gate does not model. Next step (out of scope for this PR): diurnal-aware sign strategy as follow-up research module. Components: research/microstructure/diurnal.py (typed, 7 tests) utc_hour_of_row(start_ms, n_rows) -> NDArray[int64] compute_diurnal_profile(sessions, horizon_sec, min_rows_per_hour, perm_trials, pvalue_gate, seed) -> DiurnalProfile profile_to_json_dict(profile) -> dict session_start_ms_from_frames(frames) -> int scripts/run_l2_diurnal_profile.py --data-dir PATH (repeatable; one per session) --horizon-sec INT (default 180) --min-rows-per-hour INT (default 300) --perm-trials INT (default 500) --pvalue-gate FLOAT (default 0.05) --seed INT (default 42) --output PATH tests/test_l2_diurnal.py (7 tests): utc_hour_of_row monotone + wrap + negative rejection empty-sessions underpowered low-sample underpowered/stable multi-session hour merging JSON schema contract determinism under fixed seed Evidence: results/L2_DIURNAL_PROFILE.json (tracked; full per-hour breakdown) Quality gates: ruff + black + mypy --strict clean. Test regression: 42 → 49 (+7 diurnal). Numerical locks (PR #238 fixtures) unchanged: ic_test_q75 = 0.23638402111955653 breakeven_q75 = 0.4072465349699599 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Recursive + cyclic + rolling-walk-forward analysis of the 5h14m collected L2 substrate revealed that the Ricci κ_min cross-sectional edge is intermittent, not uniform — some time blocks emit IC > +0.18, others invert to IC < -0.10. Full-window PROCEED averaged these.
This PR introduces the regime filter that rescues the edge:
2.03× IC uplift with threshold learned on first half generalizing cleanly to second half.
Architecture (AE-compliant — elimination over addition)
New public surface in
research/microstructure/regime.py:One optional parameter added to existing gate:
None→ identical behavior to before. No new dataclasses, no new CLI flags (yet — they'll land when the regime filter earns its place in the permanent gate).Analysis scripts (diagnostics, not production)
scripts/l2_killtest_recursive.py— depth-first bisection (full → halves → quarters → octiles) + cyclic K=8 blocks. Exposes regime structure.scripts/l2_regime_analysis.py— per-block regime features + IC rank correlation.scripts/l2_walk_forward.py— 56 rolling 40-min windows with 5-min step, Spearman feature→IC correlations.scripts/l2_regime_conditional.py— in-sample conditional gate at multiple quantile thresholds × window sizes.scripts/l2_regime_oos.py— true OOS, threshold on train, applied to test.Quality gates
What's explicitly NOT in this PR
--regimeyet — waiting for 8h fresh collection (currently running in background) to confirm threshold generalizes across a second session, not just across halves of one session.Test plan
l2_regime_oos.pyagainst FRESH substrate with threshold from ORIGINAL session — if uplift holds across sessions, regime filter is production-ready.🤖 Generated with Claude Code