Skip to content

docs(dro-ara): RFC — align stationarity convention (ADF on returns)#345

Merged
neuron7xLab merged 1 commit intomainfrom
docs/dro-ara-stationarity-rfc
Apr 21, 2026
Merged

docs(dro-ara): RFC — align stationarity convention (ADF on returns)#345
neuron7xLab merged 1 commit intomainfrom
docs/dro-ara-stationarity-rfc

Conversation

@neuron7xLab
Copy link
Copy Markdown
Owner

Summary

  • Walk-forward H × rs grid search (7×11, 69 folds SPDR S&P 500) returned all-zero Sharpe across the entire grid. Root cause is not H_CRITICAL or RS_LONG_THRESH, but the upstream _adf_stationary(raw_prices) test: asset prices are canonical I(1) processes, so ≥93 % of train windows fail ADF across 8 askar assets → Regime.INVALID by construction on live market data.
  • Hurst path already runs on diff(log(price)) internally (engine.py:138); only the ADF path still runs on raw prices — a convention mismatch.
  • This PR is docs-only — RFC + evidence. No change to core/dro_ara/engine.py.

Artefacts shipped

path purpose
docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md PROPOSED RFC — root cause, invariant audit, test-impact matrix, alternatives, rollout, fail-closed audit requirement
docs/DRO_ARA_CALIBRATION_REPORT.md Empirical walk-forward report motivating the RFC
experiments/dro_ara_calibration/run_grid_search.py Reproducible grid runner (mypy --strict clean, ruff clean, joblib n_jobs=-1)
experiments/dro_ara_calibration/results/ CSV (5313 rows), summary.json, heatmap.png

Stationarity scan across askar assets

asset folds stationary %
SPDR S&P 500 69 1 1.4 %
USA_500_Index 150 0 0 %
XAUUSD 286 4 1.4 %
20y Treasury ETF 70 1 1.4 %
AUDUSD 297 14 4.7 %
EURUSD 301 15 5.0 %
EURGBP 297 21 7.1 %

Invariant audit (from RFC §4)

  • INV-DRO1, INV-DRO2, INV-DRO5 — unchanged.
  • INV-DRO3 — semantics tightened (stationarity now tested on returns, not levels).
  • INV-DRO4 — downstream gate unchanged; more windows become admissible, but H + rs + trend still fully gate LONG.

Test impact (from RFC §5)

Falsification battery in tests/core/dro_ara/test_falsification.py would require rewrites for GBM / random-walk cases — those tests currently encode price-level ADF expectations. Explicitly flagged; not loosened.

Decision required (RFC §9)

  • Approve — proceed to Step 2 engine-patch PR.
  • Amend — specify alternative / tighter scope.
  • Reject — close RFC, document high INVALID rate as intentional.

Test plan

  • Reviewer reads RFC §1–§6 end-to-end.
  • Reviewer confirms the convention-mismatch claim by inspecting engine.py:138 (DFA on log-returns) vs engine.py:81 (ADF on raw input).
  • Reviewer runs python -m experiments.dro_ara_calibration.run_grid_search --data data/askar/archive/EURGBP_GMT+0_NO-DST.parquet and confirms gate-on folds remain <5 % even on the most stationary asset.
  • Reviewer returns decision in RFC §9 checklist.

🤖 Generated with Claude Code

…ot prices)

Walk-forward H × rs grid search (7×11 cells, 69 folds on SPDR S&P 500) produced
a degenerate all-zero Sharpe matrix. Diagnostic scan across 8 askar assets
confirmed the binding constraint is not H_CRITICAL or RS_LONG_THRESH but the
upstream `_adf_stationary(raw_prices)` test: equity/FX/commodity prices are
canonical I(1) processes, so ADF fails to reject ≥93% of train windows → the
engine's regime observer returns INVALID by construction on live market data.

Hurst path already uses log-returns internally (engine.py:138). The ADF path
is the only site still running on raw prices — a convention mismatch.

This commit ships:
- docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md — PROPOSED, awaiting sign-off.
  Root cause, invariant audit, test impact, alternatives, rollout, fail-closed
  audit requirement. No code change in this PR.
- docs/DRO_ARA_CALIBRATION_REPORT.md — empirical evidence motivating the RFC.
- experiments/dro_ara_calibration/ — grid search script (mypy --strict clean),
  CSV, summary.json, heatmap.png. Reproducible via
  `python -m experiments.dro_ara_calibration.run_grid_search --data <parquet>`.

Decision required in RFC §9 before any engine patch.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13fa06fd1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

first = MIN_HISTORY
starts: list[int] = []
t = first
while t + TRAIN_WINDOW + TEST_WINDOW <= n:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop truncating walk-forward folds prematurely

The fold loop currently requires t + TRAIN_WINDOW + TEST_WINDOW <= n, but t is already the test-window start and training data is taken from prices[t-TRAIN_WINDOW:t]; this adds an unnecessary future-history requirement and drops valid late folds. With the current constants, this excludes the final 12 folds (69 produced vs 81 available), which materially biases the calibration summary and crisis-window statistics toward older periods.

Useful? React with 👍 / 👎.

Comment on lines +326 to +327
delta_H = abs(H_opt - H_CRITICAL)
delta_rs = abs(rs_opt - RS_LONG_THRESH)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep signed deltas in the threshold comparison table

Using abs(...) for delta_H and delta_rs removes directionality, so the report can state positive deltas even when the optimal threshold is lower (e.g., current 0.45 vs optimal 0.30 prints +0.15). This can mislead operator decisions in non-degenerate runs because it obscures whether parameters should be increased or decreased.

Useful? React with 👍 / 👎.

@neuron7xLab neuron7xLab merged commit c5c0f3b into main Apr 21, 2026
10 checks passed
@neuron7xLab neuron7xLab deleted the docs/dro-ara-stationarity-rfc branch April 21, 2026 12:58
neuron7xLab added a commit that referenced this pull request Apr 21, 2026
… 2) (#349)

* fix(dro-ara): apply ADF to log-returns, not raw prices

Engine previously mixed stationarity conventions: DFA computes Hurst on
diff(log(price)) (engine.py:138), but ADF ran on raw prices. For canonical
I(1) asset prices, ADF rejected stationarity ≈93–99% of the time across all
askar assets, making INV-DRO3 a near-tautology and reducing INV-DRO4's
LONG-gate to permanent INVALID.

Patch (4 lines in State.from_window):
- compute log_returns = np.diff(np.log(np.abs(arr) + 1e-12))
- run ADF on log_returns instead of raw arr
Conventions now consistent with DFA.

Empirical impact (SPDR S&P 500, 69 walk-forward folds):
  stationary rate:        1.4 % → 100 %    (binding constraint relaxed)
  rs_train_max:           0.298              (now bound by RS_LONG_THRESH)
  gate-on at current θ:   0 → still 0       (rs threshold now binding)

The patch unblocks calibration: with stationarity no longer the dominant
filter, threshold tuning of (H_CRITICAL, RS_LONG_THRESH) becomes
informative — the next step (T3 grid re-run, T4 calibration) is now meaningful.

Test impact: 4 tests rewritten to encode post-RFC semantics, 1 added.
- tests/core/dro_ara/test_falsification.py:
  * test_random_walk_is_invalid_or_transition →
      test_random_walk_returns_are_stationary_no_long
    (RW returns i.i.d. → stationary; INV-DRO4 forbids LONG)
  * test_gbm_with_drift_is_non_stationary →
      test_gbm_with_drift_returns_are_stationary_no_long
    (GBM returns N(μ,σ²) → stationary; INV-DRO4 forbids LONG)
- tests/core/dro_ara/test_invariants.py:
  * NEW test_inv_dro3_tightening_post_rfc_ou_stationary_rate
    (INV-DRO3 tightening guard: OU stationary rate > 50 % across 30 seeds)
- tests/core/strategies/test_dro_ara_filter.py:
  * test_apply_on_gbm_drifts_to_zero →
      test_apply_on_gbm_is_systematically_reduced
    (statistical: mean filter mult ≤ 0.55 across 30 seeds, ≥ 80 % reduced)
- tests/research/dro_ara/test_backtest_smoke.py:
  * test_backtest_on_gbm_yields_flat_positions →
      test_backtest_on_gbm_has_flat_and_active_bars_mix
    (filter still zeroes DRIFT path → flat-frac > 10 %)
- tests/research/dro_ara/test_power_mc.py:
  * test_gbm_drift_classifies_as_invalid_majority →
      test_gbm_drift_not_classified_as_critical_majority
    (false-positive guard: p_critical(GBM) ≤ 0.40)

Fail-closed audit (per RFC §8 + feedback_fail_closed_audit) — all PASS:
  1. tests/core/dro_ara/test_properties.py            8/8 green
  2. Hypothesis fuzz (14 @given strategies)           green, no crash
  3. SPDR 69-fold smoke gate-on/stat ≥ 20 %           PASS (100 %)
  4. Full repo regression                              11274 passed, 0 failed

Quality gates: ruff clean · black clean · mypy --strict clean.

Refs: PR #345 (RFC), docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md §3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(dro-ara): black-format 2 test files (CI python-quality)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
neuron7xLab added a commit that referenced this pull request Apr 21, 2026
…nce (#351)

T3 (grid re-run) + T4 (calibration proposal) of the DRO-ARA RFC rollout.
Engine patch (PR #349) delivered as intended — stationarity rate on live
assets jumped from 1–7 % (pre-patch) to 100 % (post-patch) across SPDR,
USA 500, XAUUSD, EURGBP, EURUSD. INV-DRO3 is now a non-trivial filter.

With the upstream filter unblocked, threshold tuning of
(H_CRITICAL, RS_LONG_THRESH) becomes the binding question. This commit
ships the empirical answer on five assets:

  asset       n_folds  active_cells  best (H, rs)  best mean Sharpe  passing
  SPDR S&P500      69           159  (0.40, 0.10)          −0.0114        0
  XAUUSD          286           619  (0.30, 0.10)          +0.0051        0
  USA 500 Idx     150           562  (0.50, 0.35)          −0.0081        0
  EURGBP          297         1 251  (0.35, 0.10)          −0.0199        0
  EURUSD          301           599  (0.50, 0.35)          −0.0096        0

Zero (H, rs) pairs pass the rejection filters (mean Sharpe ≥ 0.80, worst
DD ≤ 0.25, mean trades ≥ 20) on any asset. Best active-cell Sharpe is
+0.005 on one single fold of XAUUSD — statistically zero.

Verdict: STRATEGY_UNPROFITABLE / REJECT. Do not modify H_CRITICAL or
RS_LONG_THRESH. Root cause is not threshold choice but upstream integration:
combo_v1 (AMMComboStrategy) runs with constant R/κ stubs and daily-bar
granularity that fail to exercise its HFT-designed state machine.

Script improvements:
- Ranking by (mean_sharpe, gate_on_folds) with active-cells-first filter,
  so degenerate 0-Sharpe cells no longer mask the real best active pair.
- New verdict branch STRATEGY_UNPROFITABLE for the
  "grid activates but never profits" case (distinct from NO_ACTIVITY).
- Asymmetric HALT tiers: ΔH>0.20 ∨ Δrs>0.30 → escalate re-spec;
  ΔH>0.10 ∨ Δrs>0.10 → await operator.
- `hurst_on_train` now runs ADF on log-returns to match engine fix.

Fail-closed invariants preserved:
- INV-DRO1..DRO5 unchanged (no engine modification).
- PR #349 regression suite (11 274 tests) remains green.
- Zero threshold constants modified in core/dro_ara/engine.py.

Artefacts:
- docs/DRO_ARA_CALIBRATION_REPORT.md         — canonical SPDR run
- docs/DRO_ARA_CALIBRATION_REPORT_v2.md      — multi-asset synthesis
- experiments/dro_ara_calibration/results/   — SPDR CSV + summary + heatmap
- experiments/.../results/multi_asset/       — 5 per-asset CSV + heatmap +
                                               summary + aggregate.json

Closes T3 + T4 of the RFC rollout. Demo-ready as diagnostic evidence that
combo_v1 × DRO-ARA is NOT a calibratable edge on daily OHLC; the honest
finding replaces the pre-patch degenerate all-zero artefact.

Refs: PR #345 (RFC), PR #349 (engine patch),
docs/RFC_DRO_ARA_STATIONARITY_CONVENTION.md.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
neuron7xLab added a commit that referenced this pull request Apr 21, 2026
…t strengthened (#352)

Upgrades the v2 descriptive REJECT (#351) to a frontier-grade inferential
claim by layering five statistical attachments on every (H, rs) grid cell
and two per-asset baselines.

Rigor layer (experiments/dro_ara_calibration/rigor.py):
  * Block-bootstrap (Politis–Romano) 95 % CI on mean Sharpe.
  * Sign-flip surrogate null → empirical two-sided p-value.
  * Lopez-de-Prado Deflated Sharpe → P(edge real | N trials).
  * 80 % power / 5 % α → min detectable Sharpe per cell.
  * Buy-and-hold and random-gate-at-matched-rate baselines.

Report pipeline (experiments/dro_ara_calibration/rigor_report.py):
  * Reads v2 multi-asset grid CSVs → attaches per-cell rigor metrics.
  * Benjamini-Hochberg FDR correction across the (H, rs) grid.
  * Auto-generated docs/DRO_ARA_CALIBRATION_v3_RIGOR.md with empirical
    findings (not boilerplate).

Empirical findings (5 assets):

  asset       n_active  n_fdr_pass  bh_sharpe  rand_gate  best_sharpe  P(real)
  spdr_sp500        20           0     +1.40      -0.28       -0.26    0.003
  xauusd            49           0     +0.60      -0.51       +1.45    0.162
  usa500            36           7     +1.21      -0.53       -0.40    0.001
  eurgbp            30          20     +0.01      -1.14       -0.84    3e-6
  eurusd            36          20     +0.00      -0.75       -1.07    3e-29

Stronger-than-v2 claim:
  * 47 (H, rs) pairs survive BH-FDR — as significantly NEGATIVE, not positive.
  * On 3/5 assets the filter underperforms a random gate at matched rate —
    the DRO-ARA composition is a *reverse-indicator* on these assets.
  * Zero cells clear DSR P(real) > 0.5 after Lopez-de-Prado deflation.
  * XAUUSD +1.45 best cell has DSR prob 0.16 — below credibility threshold.

Packaging:
  * experiments/__init__.py added (cleans up namespace package discovery;
    resolves mypy "source found twice" under --strict when submodules
    cross-reference via relative imports).
  * tests/experiments/ mirrors the structure with __init__.py shims.

Tests (tests/experiments/dro_ara_calibration/test_rigor.py, 16 passing):
  * Bootstrap CI coverage + degenerate sample handling.
  * Sign-flip null p-value monotonic in effect size.
  * DSR expected-max grows with n_trials; P(real) bounded [0, 1].
  * Min detectable Sharpe shrinks with n_observations.
  * Buy-hold Sharpe positive on upward drift, zero on constant.
  * Random-gate Sharpe finite on synthetic GBM.
  * BH-FDR rejects all when no significance, accepts clear signals, NaN-safe.
  * End-to-end rigor_for_grid produces expected columns on synthetic grid.

Invariants preserved:
  * Zero modifications to core/dro_ara/engine.py (no constants touched).
  * No existing tests modified.
  * 11 290 tests pass (v2 had 11 274; +16 new).
  * ruff clean, black clean, mypy --strict clean on all new files.

Refs: PR #345 (RFC), PR #349 (engine patch), PR #351 (v2 calibration),
docs/DRO_ARA_CALIBRATION_v3_RIGOR.md.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant