Skip to content

fix(orchestrate): #1781 apr serve startup-ready timeout — configurable + size-aware#1782

Merged
noahgift merged 2 commits into
mainfrom
fix/apr-serve-ready-timeout-1781
May 18, 2026
Merged

fix(orchestrate): #1781 apr serve startup-ready timeout — configurable + size-aware#1782
noahgift merged 2 commits into
mainfrom
fix/apr-serve-ready-timeout-1781

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Closes #1781. The hardcoded Duration::from_secs(30) in AprServeDriver::wait_for_ready blocked large MoE GGUFs (Qwen3-Coder-30B, 18.5 GB) from ever becoming ready — cold-cache load exceeds 30s; warm-cache is ~1s.

Root cause (5-whys)

  1. Why did large MoE startups fail? → apr serve did not become ready within 30s
  2. Why? → Cold-cache load of 18.5 GB Qwen3-MoE GGUF exceeds 30s
  3. Why the 30s? → Hardcoded Duration::from_secs(30) at apr_serve.rs:143
  4. Why hardcoded? → No env-var or model-size scaling
  5. Why? → Designed for sub-2GB models that load in <5s

Fix

Two-axis: env override + size-aware default.

pub fn compute_ready_timeout_secs(
    model_size_bytes: Option<u64>,
    env_override: Option<&str>,
) -> u64 { ... }

Resolution order:

  1. APR_SERVE_READY_TIMEOUT_S=N env var (operator escape hatch; clamped ≥1s)
  2. Size-aware default: 30s baseline + 1s per 500 MB above 2 GB
  3. Unknown size → 30s baseline

Per-model budgets under the size-aware default:

Model size Budget
1 GB 30s (unchanged)
4 GB 34s
18 GB (Qwen3-Coder-30B) 62s
30 GB 87s

Error message updated:

apr serve did not become ready within Ns (override via APR_SERVE_READY_TIMEOUT_S)

Test plan

  • 8 new unit tests in apr_serve_tests.rs (env precedence, clamping, invalid fallback, baseline, scaling, unknown size, env-when-size-unknown, real Qwen3-Coder-30B 18.5 GB)
  • apr_serve module: 17 → 25 tests GREEN
  • cargo check -p aprender-orchestrate clean
  • cargo clippy -p aprender-orchestrate --tests -- -D warnings clean
  • cargo fmt --check clean

Empirical evidence

  • paiml/claude-code-parity-apr M260 dispatch produced 15/15 student-side driver_error with this timeout
  • time apr serve run Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf against already-warm-cache model → ready in ~1s
  • Companion-side pre-warm workaround at paiml/claude-code-parity-apr M262 (PR realizar: APR transformer loader lacks Q8/Q4 dequantization for attention weights #239) — this PR obsoletes the workaround for hosts that set APR_SERVE_READY_TIMEOUT_S

Doctrine

Toyota Way: fixed at root cause instead of accepting the 30s budget as a hard constraint.

…e + size-aware

Closes #1781. Hardcoded `Duration::from_secs(30)` in
`AprServeDriver::wait_for_ready` blocked large MoE GGUFs from ever
becoming ready — Qwen3-Coder-30B at 18.5 GB exceeds 30s on cold-cache
loads. Empirical evidence:

- paiml/claude-code-parity-apr M260 dispatch: 15/15 student-side
  driver_error with "apr serve did not become ready within 30s"
- Warm-cache load (after the failed bench had mmap'd the GGUF into
  page cache): time-to-ready ~1s

Root cause (5-whys):
1. Why apr 0/15? → apr serve did not become ready in 30s
2. Why? → Cold-cache load of 18.5GB Qwen3-MoE GGUF exceeds 30s
3. Why the 30s? → Hardcoded Duration::from_secs(30) at line 143
4. Why hardcoded? → No env-var or model-size scaling
5. Why? → Designed for sub-2GB models that load in <5s

Two-axis fix:

1. APR_SERVE_READY_TIMEOUT_S env override (operator escape hatch):
   `APR_SERVE_READY_TIMEOUT_S=120 apr code ...` sets the budget to
   120s verbatim. Clamped to minimum 1s to avoid pathological zero.
   Non-integer values fall through to the size-aware default.

2. Size-aware default (auto-scale by model file size):
   - 30s baseline + 1s per 500 MB above 2 GB
   - 1 GB model → 30s (unchanged)
   - 4 GB → 34s
   - 18 GB Qwen3-Coder-30B → 62s
   - 30 GB → 87s
   - Unknown size (stat failed) → 30s baseline

`AprServeDriver` gains a `model_size_bytes: Option<u64>` field
populated via `std::fs::metadata(&model_path)` at launch. Resolution
extracted to free `pub fn compute_ready_timeout_secs(...)` so the
logic is unit-testable without spawning a subprocess.

Error message updated to mention the env override:
  "apr serve did not become ready within Ns (override via APR_SERVE_READY_TIMEOUT_S)"

8 new tests (env override precedence, clamping, invalid override
fallback, small-model baseline, size-aware scaling, unknown size,
env-override-when-size-unknown, real Qwen3-Coder-30B 18.5 GB size).
apr_serve module: 17 → 25 tests GREEN. Workspace cargo check clean;
clippy clean; fmt clean (with project's nightly-required fmt config).

Companion-side workaround at paiml/claude-code-parity-apr M262 (PR #239)
shipped pre-warm step in bench scripts as an immediate measure;
this PR is the proper upstream fix that obsoletes the need for the
workaround on hosts where APR_SERVE_READY_TIMEOUT_S can be set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 07:19
@noahgift noahgift merged commit ff9d0c9 into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the fix/apr-serve-ready-timeout-1781 branch May 18, 2026 08:45
noahgift added a commit that referenced this pull request May 18, 2026
…eights (#1790)

Empty or undersized `weight.data` would cause a cryptic panic deep
in `fused_matmul_f32`:

  thread '<unnamed>' panicked at matmul_fused.rs:211:54:
  index out of bounds: the len is 0 but the index is 56311808

Stack traces fire on every rayon worker simultaneously, with no
indication that the root cause is an upstream tensor-loading bug.

Most-likely root cause (per #1789): Qwen3-MoE-style models where the
parent FFN tensor is registered with an empty data buffer because the
actual weights live in per-expert slices (`ffn_up_exps`,
`ffn_gate_exps`, `ffn_down_exps`) the GGUF loader hasn't wired in.

This PR ships the DEFENSIVE GUARD only — it does NOT fix the
underlying MoE F32 routing path (which is the deeper issue tracked
in #1789). Instead it converts the cryptic panic into an actionable
`RealizarError::InvalidShape` so the next investigator sees:

  matmul weight has EMPTY data buffer (in_dim=N, out_dim=M, qtype=0);
  likely a MoE per-expert tensor was registered with len-0 data —
  see aprender#1789

Two guards:
1. `weight.data.is_empty()` → InvalidShape with the empty-data hint
2. `weight.qtype == F32 && weight.data.len() < out_dim*in_dim*4` →
   InvalidShape with concrete have/need byte counts

Guard logic extracted to free `fn validate_matmul_weight_shape(...)`
so it's unit-testable without constructing a full
`OwnedQuantizedModel`. 6 new unit tests covering empty data,
undersized F32, correctly-sized F32, oversized F32 (padding allowed),
non-F32 only-checks-emptiness, and usize-overflow protection.
matmul_fused module: 0 → 6 tests GREEN. `cargo check -p aprender-serve`
clean; clippy clean on lib.

Empirical evidence: paiml/claude-code-parity-apr M260 dispatch +
the post-#1782 re-dispatch both hit this panic. The timeout fix
in #1782 unblocked startup but exposed this downstream MoE-weight
bug. Filed as #1789 for the deeper MoE F32 routing fix.

Does NOT fix Qwen3-Coder-30B inference yet — needs the MoE per-expert
weight slicing fix tracked in #1789. This PR only stops the cryptic
panic and gives actionable diagnostics.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 18, 2026
… PROPOSED (#1794)

Two-axis bump: catch up to companion-led v1.31.0 + ship Phase 6
gate in one PR. Gate registry: 18 → 20 entries.

v1.31.0 SKIPPED (companion-led at companion-repo M236 / PR #221
squash 188a328 without aprender-side authoring); v1.30.0 → v1.32.0
directly, same SKIP pattern v1.28.0 → v1.30.0 used for the
auto-closed aprender#1705 PR.

## FALSIFY-CCPA-019 calibration_required_before_verdict (PROPOSED)

Codifies the M196-M224 4-bug-stack lesson. Any future verdict on
CCPA-016/017/018 — promotion PROPOSED → ACTIVE_RUNTIME OR treating
an evidence file as discharging the gate — requires a fresh
calibration record (identity_pass + regression_fail, ≤30 days old)
at evidence/calibration/calibration-runs.json.

Bidirectional-sensitivity: a meter that ALWAYS-passes would pass
identity but also pass regression (caught); a meter that
ALWAYS-fails would fail regression correctly but also fail identity
(caught). Freshness window catches infrastructure drift (rustc
bumps, apr CLI changes, claude CLI changes) without weekly runs.

Test scaffold: companion-repo crates/ccpa-differ/tests/
falsify_ccpa_019_calibration.rs (7 active synthetic + 1 #[ignore]'d
live-evidence).

The M234 calibration evidence (evidence/calibration/calibration-
runs.json) records both the trivial in-house identity fixture +
decy#39 regression dispatch; discharges the gate currently.

## FALSIFY-CCPA-020 contract_compliance_per_turn (PROPOSED)

Codifies the Phase 6 operator-directive (companion-repo M250+):
the right experiment for paiml-org is claude-bound-by-pmat-comply-
and-pv vs apr-bound-by-pmat-comply-and-pv, NOT raw-vs-raw. Every
paiml commit must pass pmat comply + pv validate to merge.

Per-turn pmat comply check --strict + pv validate fire on every
Write/Edit in the under-contract regime (ArenaSession::with_compliance
(N)). Compound oracle (cargo test + pmat comply + pv validate)
gates OraclePassed.

Bidirectional sensitivity:
- Identity: clean-history-with-pass MUST satisfy
- Regression: pass-with-failing-compliance-turn MUST be falsified

Test scaffold: companion-repo crates/ccpa-arena/tests/
falsify_ccpa_020_contract_compliance.rs (7 active synthetic + 1
#[ignore]'d live-evidence).

## Companion-side ship trail (M250-M264)

M250 plan + n=20 corpus; M252 schema; M254 dispatch hook + trap;
M256 compound oracle; M258 CCPA-020 gate; M260 first valid n=15
calibration evidence; M262 Toyota-Way root-cause + upstream fixes
(#1782 timeout + #1790 matmul guard, both MERGED); M264 P6.6 bench
runner (operator-dispatchable end-to-end).

## Activation path

CCPA-019 + CCPA-020 stay PROPOSED until first operator-dispatched
Phase 6 bench produces evidence/under-contract/scores.json AND a
fresh calibration record. ACTIVE_RUNTIME flip awaits both.

`pv validate contracts/claude-code-parity-apr-v1.yaml` clean.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apr serve: 30s startup-readiness timeout is too short for large MoE GGUFs

1 participant