docs(dro-ara): RFC — align stationarity convention (ADF on returns)#345
docs(dro-ara): RFC — align stationarity convention (ADF on returns)#345neuron7xLab merged 1 commit intomainfrom
Conversation
…ot prices) Walk-forward H × rs grid search (7×11 cells, 69 folds on SPDR S&P 500) produced a degenerate all-zero Sharpe matrix. Diagnostic scan across 8 askar assets confirmed the binding constraint is not H_CRITICAL or RS_LONG_THRESH but the upstream `_adf_stationary(raw_prices)` test: equity/FX/commodity prices are canonical I(1) processes, so ADF fails to reject ≥93% of train windows → the engine's regime observer returns INVALID by construction on live market data. Hurst path already uses log-returns internally (engine.py:138). The ADF path is the only site still running on raw prices — a convention mismatch. This commit ships: - docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md — PROPOSED, awaiting sign-off. Root cause, invariant audit, test impact, alternatives, rollout, fail-closed audit requirement. No code change in this PR. - docs/DRO_ARA_CALIBRATION_REPORT.md — empirical evidence motivating the RFC. - experiments/dro_ara_calibration/ — grid search script (mypy --strict clean), CSV, summary.json, heatmap.png. Reproducible via `python -m experiments.dro_ara_calibration.run_grid_search --data <parquet>`. Decision required in RFC §9 before any engine patch.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13fa06fd1a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| first = MIN_HISTORY | ||
| starts: list[int] = [] | ||
| t = first | ||
| while t + TRAIN_WINDOW + TEST_WINDOW <= n: |
There was a problem hiding this comment.
Stop truncating walk-forward folds prematurely
The fold loop currently requires t + TRAIN_WINDOW + TEST_WINDOW <= n, but t is already the test-window start and training data is taken from prices[t-TRAIN_WINDOW:t]; this adds an unnecessary future-history requirement and drops valid late folds. With the current constants, this excludes the final 12 folds (69 produced vs 81 available), which materially biases the calibration summary and crisis-window statistics toward older periods.
Useful? React with 👍 / 👎.
| delta_H = abs(H_opt - H_CRITICAL) | ||
| delta_rs = abs(rs_opt - RS_LONG_THRESH) |
There was a problem hiding this comment.
Keep signed deltas in the threshold comparison table
Using abs(...) for delta_H and delta_rs removes directionality, so the report can state positive deltas even when the optimal threshold is lower (e.g., current 0.45 vs optimal 0.30 prints +0.15). This can mislead operator decisions in non-degenerate runs because it obscures whether parameters should be increased or decreased.
Useful? React with 👍 / 👎.
… 2) (#349) * fix(dro-ara): apply ADF to log-returns, not raw prices Engine previously mixed stationarity conventions: DFA computes Hurst on diff(log(price)) (engine.py:138), but ADF ran on raw prices. For canonical I(1) asset prices, ADF rejected stationarity ≈93–99% of the time across all askar assets, making INV-DRO3 a near-tautology and reducing INV-DRO4's LONG-gate to permanent INVALID. Patch (4 lines in State.from_window): - compute log_returns = np.diff(np.log(np.abs(arr) + 1e-12)) - run ADF on log_returns instead of raw arr Conventions now consistent with DFA. Empirical impact (SPDR S&P 500, 69 walk-forward folds): stationary rate: 1.4 % → 100 % (binding constraint relaxed) rs_train_max: 0.298 (now bound by RS_LONG_THRESH) gate-on at current θ: 0 → still 0 (rs threshold now binding) The patch unblocks calibration: with stationarity no longer the dominant filter, threshold tuning of (H_CRITICAL, RS_LONG_THRESH) becomes informative — the next step (T3 grid re-run, T4 calibration) is now meaningful. Test impact: 4 tests rewritten to encode post-RFC semantics, 1 added. - tests/core/dro_ara/test_falsification.py: * test_random_walk_is_invalid_or_transition → test_random_walk_returns_are_stationary_no_long (RW returns i.i.d. → stationary; INV-DRO4 forbids LONG) * test_gbm_with_drift_is_non_stationary → test_gbm_with_drift_returns_are_stationary_no_long (GBM returns N(μ,σ²) → stationary; INV-DRO4 forbids LONG) - tests/core/dro_ara/test_invariants.py: * NEW test_inv_dro3_tightening_post_rfc_ou_stationary_rate (INV-DRO3 tightening guard: OU stationary rate > 50 % across 30 seeds) - tests/core/strategies/test_dro_ara_filter.py: * test_apply_on_gbm_drifts_to_zero → test_apply_on_gbm_is_systematically_reduced (statistical: mean filter mult ≤ 0.55 across 30 seeds, ≥ 80 % reduced) - tests/research/dro_ara/test_backtest_smoke.py: * test_backtest_on_gbm_yields_flat_positions → test_backtest_on_gbm_has_flat_and_active_bars_mix (filter still zeroes DRIFT path → flat-frac > 10 %) - tests/research/dro_ara/test_power_mc.py: * test_gbm_drift_classifies_as_invalid_majority → test_gbm_drift_not_classified_as_critical_majority (false-positive guard: p_critical(GBM) ≤ 0.40) Fail-closed audit (per RFC §8 + feedback_fail_closed_audit) — all PASS: 1. tests/core/dro_ara/test_properties.py 8/8 green 2. Hypothesis fuzz (14 @given strategies) green, no crash 3. SPDR 69-fold smoke gate-on/stat ≥ 20 % PASS (100 %) 4. Full repo regression 11274 passed, 0 failed Quality gates: ruff clean · black clean · mypy --strict clean. Refs: PR #345 (RFC), docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md §3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(dro-ara): black-format 2 test files (CI python-quality) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nce (#351) T3 (grid re-run) + T4 (calibration proposal) of the DRO-ARA RFC rollout. Engine patch (PR #349) delivered as intended — stationarity rate on live assets jumped from 1–7 % (pre-patch) to 100 % (post-patch) across SPDR, USA 500, XAUUSD, EURGBP, EURUSD. INV-DRO3 is now a non-trivial filter. With the upstream filter unblocked, threshold tuning of (H_CRITICAL, RS_LONG_THRESH) becomes the binding question. This commit ships the empirical answer on five assets: asset n_folds active_cells best (H, rs) best mean Sharpe passing SPDR S&P500 69 159 (0.40, 0.10) −0.0114 0 XAUUSD 286 619 (0.30, 0.10) +0.0051 0 USA 500 Idx 150 562 (0.50, 0.35) −0.0081 0 EURGBP 297 1 251 (0.35, 0.10) −0.0199 0 EURUSD 301 599 (0.50, 0.35) −0.0096 0 Zero (H, rs) pairs pass the rejection filters (mean Sharpe ≥ 0.80, worst DD ≤ 0.25, mean trades ≥ 20) on any asset. Best active-cell Sharpe is +0.005 on one single fold of XAUUSD — statistically zero. Verdict: STRATEGY_UNPROFITABLE / REJECT. Do not modify H_CRITICAL or RS_LONG_THRESH. Root cause is not threshold choice but upstream integration: combo_v1 (AMMComboStrategy) runs with constant R/κ stubs and daily-bar granularity that fail to exercise its HFT-designed state machine. Script improvements: - Ranking by (mean_sharpe, gate_on_folds) with active-cells-first filter, so degenerate 0-Sharpe cells no longer mask the real best active pair. - New verdict branch STRATEGY_UNPROFITABLE for the "grid activates but never profits" case (distinct from NO_ACTIVITY). - Asymmetric HALT tiers: ΔH>0.20 ∨ Δrs>0.30 → escalate re-spec; ΔH>0.10 ∨ Δrs>0.10 → await operator. - `hurst_on_train` now runs ADF on log-returns to match engine fix. Fail-closed invariants preserved: - INV-DRO1..DRO5 unchanged (no engine modification). - PR #349 regression suite (11 274 tests) remains green. - Zero threshold constants modified in core/dro_ara/engine.py. Artefacts: - docs/DRO_ARA_CALIBRATION_REPORT.md — canonical SPDR run - docs/DRO_ARA_CALIBRATION_REPORT_v2.md — multi-asset synthesis - experiments/dro_ara_calibration/results/ — SPDR CSV + summary + heatmap - experiments/.../results/multi_asset/ — 5 per-asset CSV + heatmap + summary + aggregate.json Closes T3 + T4 of the RFC rollout. Demo-ready as diagnostic evidence that combo_v1 × DRO-ARA is NOT a calibratable edge on daily OHLC; the honest finding replaces the pre-patch degenerate all-zero artefact. Refs: PR #345 (RFC), PR #349 (engine patch), docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t strengthened (#352) Upgrades the v2 descriptive REJECT (#351) to a frontier-grade inferential claim by layering five statistical attachments on every (H, rs) grid cell and two per-asset baselines. Rigor layer (experiments/dro_ara_calibration/rigor.py): * Block-bootstrap (Politis–Romano) 95 % CI on mean Sharpe. * Sign-flip surrogate null → empirical two-sided p-value. * Lopez-de-Prado Deflated Sharpe → P(edge real | N trials). * 80 % power / 5 % α → min detectable Sharpe per cell. * Buy-and-hold and random-gate-at-matched-rate baselines. Report pipeline (experiments/dro_ara_calibration/rigor_report.py): * Reads v2 multi-asset grid CSVs → attaches per-cell rigor metrics. * Benjamini-Hochberg FDR correction across the (H, rs) grid. * Auto-generated docs/DRO_ARA_CALIBRATION_v3_RIGOR.md with empirical findings (not boilerplate). Empirical findings (5 assets): asset n_active n_fdr_pass bh_sharpe rand_gate best_sharpe P(real) spdr_sp500 20 0 +1.40 -0.28 -0.26 0.003 xauusd 49 0 +0.60 -0.51 +1.45 0.162 usa500 36 7 +1.21 -0.53 -0.40 0.001 eurgbp 30 20 +0.01 -1.14 -0.84 3e-6 eurusd 36 20 +0.00 -0.75 -1.07 3e-29 Stronger-than-v2 claim: * 47 (H, rs) pairs survive BH-FDR — as significantly NEGATIVE, not positive. * On 3/5 assets the filter underperforms a random gate at matched rate — the DRO-ARA composition is a *reverse-indicator* on these assets. * Zero cells clear DSR P(real) > 0.5 after Lopez-de-Prado deflation. * XAUUSD +1.45 best cell has DSR prob 0.16 — below credibility threshold. Packaging: * experiments/__init__.py added (cleans up namespace package discovery; resolves mypy "source found twice" under --strict when submodules cross-reference via relative imports). * tests/experiments/ mirrors the structure with __init__.py shims. Tests (tests/experiments/dro_ara_calibration/test_rigor.py, 16 passing): * Bootstrap CI coverage + degenerate sample handling. * Sign-flip null p-value monotonic in effect size. * DSR expected-max grows with n_trials; P(real) bounded [0, 1]. * Min detectable Sharpe shrinks with n_observations. * Buy-hold Sharpe positive on upward drift, zero on constant. * Random-gate Sharpe finite on synthetic GBM. * BH-FDR rejects all when no significance, accepts clear signals, NaN-safe. * End-to-end rigor_for_grid produces expected columns on synthetic grid. Invariants preserved: * Zero modifications to core/dro_ara/engine.py (no constants touched). * No existing tests modified. * 11 290 tests pass (v2 had 11 274; +16 new). * ruff clean, black clean, mypy --strict clean on all new files. Refs: PR #345 (RFC), PR #349 (engine patch), PR #351 (v2 calibration), docs/DRO_ARA_CALIBRATION_v3_RIGOR.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
H_CRITICALorRS_LONG_THRESH, but the upstream_adf_stationary(raw_prices)test: asset prices are canonical I(1) processes, so ≥93 % of train windows fail ADF across 8 askar assets →Regime.INVALIDby construction on live market data.diff(log(price))internally (engine.py:138); only the ADF path still runs on raw prices — a convention mismatch.core/dro_ara/engine.py.Artefacts shipped
docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.mddocs/DRO_ARA_CALIBRATION_REPORT.mdexperiments/dro_ara_calibration/run_grid_search.pymypy --strictclean, ruff clean,joblib n_jobs=-1)experiments/dro_ara_calibration/results/Stationarity scan across askar assets
Invariant audit (from RFC §4)
Test impact (from RFC §5)
Falsification battery in
tests/core/dro_ara/test_falsification.pywould require rewrites for GBM / random-walk cases — those tests currently encode price-level ADF expectations. Explicitly flagged; not loosened.Decision required (RFC §9)
Test plan
engine.py:138(DFA on log-returns) vsengine.py:81(ADF on raw input).python -m experiments.dro_ara_calibration.run_grid_search --data data/askar/archive/EURGBP_GMT+0_NO-DST.parquetand confirms gate-on folds remain <5 % even on the most stationary asset.🤖 Generated with Claude Code