docs(dro-ara): RFC — align stationarity convention (ADF on returns) by neuron7xLab · Pull Request #345 · neuron7xLab/GeoSync

neuron7xLab · 2026-04-21T12:48:51Z

Summary

Walk-forward H × rs grid search (7×11, 69 folds SPDR S&P 500) returned all-zero Sharpe across the entire grid. Root cause is not H_CRITICAL or RS_LONG_THRESH, but the upstream _adf_stationary(raw_prices) test: asset prices are canonical I(1) processes, so ≥93 % of train windows fail ADF across 8 askar assets → Regime.INVALID by construction on live market data.
Hurst path already runs on diff(log(price)) internally (engine.py:138); only the ADF path still runs on raw prices — a convention mismatch.
This PR is docs-only — RFC + evidence. No change to core/dro_ara/engine.py.

Artefacts shipped

path	purpose
`docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md`	PROPOSED RFC — root cause, invariant audit, test-impact matrix, alternatives, rollout, fail-closed audit requirement
`docs/DRO_ARA_CALIBRATION_REPORT.md`	Empirical walk-forward report motivating the RFC
`experiments/dro_ara_calibration/run_grid_search.py`	Reproducible grid runner (`mypy --strict` clean, ruff clean, `joblib n_jobs=-1`)
`experiments/dro_ara_calibration/results/`	CSV (5313 rows), summary.json, heatmap.png

Stationarity scan across askar assets

asset	folds	stationary	%
SPDR S&P 500	69	1	1.4 %
USA_500_Index	150	0	0 %
XAUUSD	286	4	1.4 %
20y Treasury ETF	70	1	1.4 %
AUDUSD	297	14	4.7 %
EURUSD	301	15	5.0 %
EURGBP	297	21	7.1 %

Invariant audit (from RFC §4)

INV-DRO1, INV-DRO2, INV-DRO5 — unchanged.
INV-DRO3 — semantics tightened (stationarity now tested on returns, not levels).
INV-DRO4 — downstream gate unchanged; more windows become admissible, but H + rs + trend still fully gate LONG.

Test impact (from RFC §5)

Falsification battery in tests/core/dro_ara/test_falsification.py would require rewrites for GBM / random-walk cases — those tests currently encode price-level ADF expectations. Explicitly flagged; not loosened.

Decision required (RFC §9)

Approve — proceed to Step 2 engine-patch PR.
Amend — specify alternative / tighter scope.
Reject — close RFC, document high INVALID rate as intentional.

Test plan

Reviewer reads RFC §1–§6 end-to-end.
Reviewer confirms the convention-mismatch claim by inspecting engine.py:138 (DFA on log-returns) vs engine.py:81 (ADF on raw input).
Reviewer runs python -m experiments.dro_ara_calibration.run_grid_search --data data/askar/archive/EURGBP_GMT+0_NO-DST.parquet and confirms gate-on folds remain <5 % even on the most stationary asset.
Reviewer returns decision in RFC §9 checklist.

🤖 Generated with Claude Code

…ot prices) Walk-forward H × rs grid search (7×11 cells, 69 folds on SPDR S&P 500) produced a degenerate all-zero Sharpe matrix. Diagnostic scan across 8 askar assets confirmed the binding constraint is not H_CRITICAL or RS_LONG_THRESH but the upstream `_adf_stationary(raw_prices)` test: equity/FX/commodity prices are canonical I(1) processes, so ADF fails to reject ≥93% of train windows → the engine's regime observer returns INVALID by construction on live market data. Hurst path already uses log-returns internally (engine.py:138). The ADF path is the only site still running on raw prices — a convention mismatch. This commit ships: - docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md — PROPOSED, awaiting sign-off. Root cause, invariant audit, test impact, alternatives, rollout, fail-closed audit requirement. No code change in this PR. - docs/DRO_ARA_CALIBRATION_REPORT.md — empirical evidence motivating the RFC. - experiments/dro_ara_calibration/ — grid search script (mypy --strict clean), CSV, summary.json, heatmap.png. Reproducible via `python -m experiments.dro_ara_calibration.run_grid_search --data <parquet>`. Decision required in RFC §9 before any engine patch.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13fa06fd1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T12:53:36Z

+    first = MIN_HISTORY
+    starts: list[int] = []
+    t = first
+    while t + TRAIN_WINDOW + TEST_WINDOW <= n:


Stop truncating walk-forward folds prematurely

The fold loop currently requires t + TRAIN_WINDOW + TEST_WINDOW <= n, but t is already the test-window start and training data is taken from prices[t-TRAIN_WINDOW:t]; this adds an unnecessary future-history requirement and drops valid late folds. With the current constants, this excludes the final 12 folds (69 produced vs 81 available), which materially biases the calibration summary and crisis-window statistics toward older periods.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T12:53:36Z

+    delta_H = abs(H_opt - H_CRITICAL)
+    delta_rs = abs(rs_opt - RS_LONG_THRESH)


Keep signed deltas in the threshold comparison table

Using abs(...) for delta_H and delta_rs removes directionality, so the report can state positive deltas even when the optimal threshold is lower (e.g., current 0.45 vs optimal 0.30 prints +0.15). This can mislead operator decisions in non-degenerate runs because it obscures whether parameters should be increased or decreased.

Useful? React with 👍 / 👎.

@given

… 2) (#349) * fix(dro-ara): apply ADF to log-returns, not raw prices Engine previously mixed stationarity conventions: DFA computes Hurst on diff(log(price)) (engine.py:138), but ADF ran on raw prices. For canonical I(1) asset prices, ADF rejected stationarity ≈93–99% of the time across all askar assets, making INV-DRO3 a near-tautology and reducing INV-DRO4's LONG-gate to permanent INVALID. Patch (4 lines in State.from_window): - compute log_returns = np.diff(np.log(np.abs(arr) + 1e-12)) - run ADF on log_returns instead of raw arr Conventions now consistent with DFA. Empirical impact (SPDR S&P 500, 69 walk-forward folds): stationary rate: 1.4 % → 100 % (binding constraint relaxed) rs_train_max: 0.298 (now bound by RS_LONG_THRESH) gate-on at current θ: 0 → still 0 (rs threshold now binding) The patch unblocks calibration: with stationarity no longer the dominant filter, threshold tuning of (H_CRITICAL, RS_LONG_THRESH) becomes informative — the next step (T3 grid re-run, T4 calibration) is now meaningful. Test impact: 4 tests rewritten to encode post-RFC semantics, 1 added. - tests/core/dro_ara/test_falsification.py: * test_random_walk_is_invalid_or_transition → test_random_walk_returns_are_stationary_no_long (RW returns i.i.d. → stationary; INV-DRO4 forbids LONG) * test_gbm_with_drift_is_non_stationary → test_gbm_with_drift_returns_are_stationary_no_long (GBM returns N(μ,σ²) → stationary; INV-DRO4 forbids LONG) - tests/core/dro_ara/test_invariants.py: * NEW test_inv_dro3_tightening_post_rfc_ou_stationary_rate (INV-DRO3 tightening guard: OU stationary rate > 50 % across 30 seeds) - tests/core/strategies/test_dro_ara_filter.py: * test_apply_on_gbm_drifts_to_zero → test_apply_on_gbm_is_systematically_reduced (statistical: mean filter mult ≤ 0.55 across 30 seeds, ≥ 80 % reduced) - tests/research/dro_ara/test_backtest_smoke.py: * test_backtest_on_gbm_yields_flat_positions → test_backtest_on_gbm_has_flat_and_active_bars_mix (filter still zeroes DRIFT path → flat-frac > 10 %) - tests/research/dro_ara/test_power_mc.py: * test_gbm_drift_classifies_as_invalid_majority → test_gbm_drift_not_classified_as_critical_majority (false-positive guard: p_critical(GBM) ≤ 0.40) Fail-closed audit (per RFC §8 + feedback_fail_closed_audit) — all PASS: 1. tests/core/dro_ara/test_properties.py 8/8 green 2. Hypothesis fuzz (14 @given strategies) green, no crash 3. SPDR 69-fold smoke gate-on/stat ≥ 20 % PASS (100 %) 4. Full repo regression 11274 passed, 0 failed Quality gates: ruff clean · black clean · mypy --strict clean. Refs: PR #345 (RFC), docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md §3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(dro-ara): black-format 2 test files (CI python-quality) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nce (#351) T3 (grid re-run) + T4 (calibration proposal) of the DRO-ARA RFC rollout. Engine patch (PR #349) delivered as intended — stationarity rate on live assets jumped from 1–7 % (pre-patch) to 100 % (post-patch) across SPDR, USA 500, XAUUSD, EURGBP, EURUSD. INV-DRO3 is now a non-trivial filter. With the upstream filter unblocked, threshold tuning of (H_CRITICAL, RS_LONG_THRESH) becomes the binding question. This commit ships the empirical answer on five assets: asset n_folds active_cells best (H, rs) best mean Sharpe passing SPDR S&P500 69 159 (0.40, 0.10) −0.0114 0 XAUUSD 286 619 (0.30, 0.10) +0.0051 0 USA 500 Idx 150 562 (0.50, 0.35) −0.0081 0 EURGBP 297 1 251 (0.35, 0.10) −0.0199 0 EURUSD 301 599 (0.50, 0.35) −0.0096 0 Zero (H, rs) pairs pass the rejection filters (mean Sharpe ≥ 0.80, worst DD ≤ 0.25, mean trades ≥ 20) on any asset. Best active-cell Sharpe is +0.005 on one single fold of XAUUSD — statistically zero. Verdict: STRATEGY_UNPROFITABLE / REJECT. Do not modify H_CRITICAL or RS_LONG_THRESH. Root cause is not threshold choice but upstream integration: combo_v1 (AMMComboStrategy) runs with constant R/κ stubs and daily-bar granularity that fail to exercise its HFT-designed state machine. Script improvements: - Ranking by (mean_sharpe, gate_on_folds) with active-cells-first filter, so degenerate 0-Sharpe cells no longer mask the real best active pair. - New verdict branch STRATEGY_UNPROFITABLE for the "grid activates but never profits" case (distinct from NO_ACTIVITY). - Asymmetric HALT tiers: ΔH>0.20 ∨ Δrs>0.30 → escalate re-spec; ΔH>0.10 ∨ Δrs>0.10 → await operator. - `hurst_on_train` now runs ADF on log-returns to match engine fix. Fail-closed invariants preserved: - INV-DRO1..DRO5 unchanged (no engine modification). - PR #349 regression suite (11 274 tests) remains green. - Zero threshold constants modified in core/dro_ara/engine.py. Artefacts: - docs/DRO_ARA_CALIBRATION_REPORT.md — canonical SPDR run - docs/DRO_ARA_CALIBRATION_REPORT_v2.md — multi-asset synthesis - experiments/dro_ara_calibration/results/ — SPDR CSV + summary + heatmap - experiments/.../results/multi_asset/ — 5 per-asset CSV + heatmap + summary + aggregate.json Closes T3 + T4 of the RFC rollout. Demo-ready as diagnostic evidence that combo_v1 × DRO-ARA is NOT a calibratable edge on daily OHLC; the honest finding replaces the pre-patch degenerate all-zero artefact. Refs: PR #345 (RFC), PR #349 (engine patch), docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t strengthened (#352) Upgrades the v2 descriptive REJECT (#351) to a frontier-grade inferential claim by layering five statistical attachments on every (H, rs) grid cell and two per-asset baselines. Rigor layer (experiments/dro_ara_calibration/rigor.py): * Block-bootstrap (Politis–Romano) 95 % CI on mean Sharpe. * Sign-flip surrogate null → empirical two-sided p-value. * Lopez-de-Prado Deflated Sharpe → P(edge real | N trials). * 80 % power / 5 % α → min detectable Sharpe per cell. * Buy-and-hold and random-gate-at-matched-rate baselines. Report pipeline (experiments/dro_ara_calibration/rigor_report.py): * Reads v2 multi-asset grid CSVs → attaches per-cell rigor metrics. * Benjamini-Hochberg FDR correction across the (H, rs) grid. * Auto-generated docs/DRO_ARA_CALIBRATION_v3_RIGOR.md with empirical findings (not boilerplate). Empirical findings (5 assets): asset n_active n_fdr_pass bh_sharpe rand_gate best_sharpe P(real) spdr_sp500 20 0 +1.40 -0.28 -0.26 0.003 xauusd 49 0 +0.60 -0.51 +1.45 0.162 usa500 36 7 +1.21 -0.53 -0.40 0.001 eurgbp 30 20 +0.01 -1.14 -0.84 3e-6 eurusd 36 20 +0.00 -0.75 -1.07 3e-29 Stronger-than-v2 claim: * 47 (H, rs) pairs survive BH-FDR — as significantly NEGATIVE, not positive. * On 3/5 assets the filter underperforms a random gate at matched rate — the DRO-ARA composition is a *reverse-indicator* on these assets. * Zero cells clear DSR P(real) > 0.5 after Lopez-de-Prado deflation. * XAUUSD +1.45 best cell has DSR prob 0.16 — below credibility threshold. Packaging: * experiments/__init__.py added (cleans up namespace package discovery; resolves mypy "source found twice" under --strict when submodules cross-reference via relative imports). * tests/experiments/ mirrors the structure with __init__.py shims. Tests (tests/experiments/dro_ara_calibration/test_rigor.py, 16 passing): * Bootstrap CI coverage + degenerate sample handling. * Sign-flip null p-value monotonic in effect size. * DSR expected-max grows with n_trials; P(real) bounded [0, 1]. * Min detectable Sharpe shrinks with n_observations. * Buy-hold Sharpe positive on upward drift, zero on constant. * Random-gate Sharpe finite on synthetic GBM. * BH-FDR rejects all when no significance, accepts clear signals, NaN-safe. * End-to-end rigor_for_grid produces expected columns on synthetic grid. Invariants preserved: * Zero modifications to core/dro_ara/engine.py (no constants touched). * No existing tests modified. * 11 290 tests pass (v2 had 11 274; +16 new). * ruff clean, black clean, mypy --strict clean on all new files. Refs: PR #345 (RFC), PR #349 (engine patch), PR #351 (v2 calibration), docs/DRO_ARA_CALIBRATION_v3_RIGOR.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

neuron7xLab merged commit c5c0f3b into main Apr 21, 2026
10 checks passed

neuron7xLab deleted the docs/dro-ara-stationarity-rfc branch April 21, 2026 12:58

neuron7xLab mentioned this pull request Apr 21, 2026

fix(dro-ara): apply ADF to log-returns, not raw prices (RFC #345 Step 2) #349

Merged

4 tasks

neuron7xLab mentioned this pull request Apr 21, 2026

docs(dro-ara): calibration v2 post-patch — REJECT tune, 5-asset evidence (T3+T4) #351

Merged

5 tasks

neuron7xLab mentioned this pull request Apr 21, 2026

feat(dro-ara): v3 rigor — null, DSR, power, baselines; strategy reject strengthened #352

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(dro-ara): RFC — align stationarity convention (ADF on returns)#345

docs(dro-ara): RFC — align stationarity convention (ADF on returns)#345
neuron7xLab merged 1 commit intomainfrom
docs/dro-ara-stationarity-rfc

neuron7xLab commented Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		delta_H = abs(H_opt - H_CRITICAL)
		delta_rs = abs(rs_opt - RS_LONG_THRESH)

Conversation

neuron7xLab commented Apr 21, 2026

Summary

Artefacts shipped

Stationarity scan across askar assets

Invariant audit (from RFC §4)

Test impact (from RFC §5)

Decision required (RFC §9)

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant