Fix: relax spmd_paged_attention tolerance to 5e-3#825
Merged
ChaoWao merged 1 commit intoMay 20, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the relative and absolute tolerance values (RTOL and ATOL) from 2e-3 to 5e-3 in the TestPagedAttentionUnrollTpushPop test class. A review comment suggests adding a code comment to explain that these tolerances were relaxed to accommodate hardware numerical variations, which improves maintainability and provides context for future changes.
Test is flaky on hardware. Standalone repro (8 runs at 2e-3) showed
two distinct failure modes:
A. Numerical mismatch — observed max_diff 0.0038..0.0041, only
~2x the previous RTOL/ATOL=2e-3 bound. Online-softmax + bf16
rounding can drift this far when AIC/AIV interleave varies
run-to-run.
B. AICPU stream timeout (aclrtSynchronizeStreamWithTimeout, rc=507018)
— consistent with a cross-core deadlock in the TPUSH/TPOP path.
This commit relaxes RTOL/ATOL to 5e-3, which covers mode (A): 8/8
runs pass after the change.
Mode (B) is not addressed here; it points at the disabled back-
pressure on the sij/pij/oi pipes in paged_attention_parallel.cpp
(see the "Disable reverse-dependency sync" block) and needs a
separate investigation.
fd8f386 to
72ad98d
Compare
ChaoWao
approved these changes
May 20, 2026
2 tasks
ChaoWao
added a commit
that referenced
this pull request
May 25, 2026
A2/A3 onboard `spmd_paged_attention` flake recurred after #825 had relaxed RTOL/ATOL from 2e-3 to 5e-3. PR #839 CI showed two consecutive failures at max_diff 5.35e-3 and 5.54e-3 — just past the 5e-3 bound. Online-softmax + bf16 rounding in the AIC/AIV cooperative TPUSH/TPOP pipeline is sensitive to run-to-run interleave; the prior bump was sized to the then-observed ~4.1e-3 drift, leaving little headroom. Bump to 1e-2 to match the prior 2-2.5x ratio so a single additional sample of drift doesn't trip the test again. Does not address the underlying numerical-drift source or the mode-B stream-timeout described in #825 — both remain follow-ups. Closes #848
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Relax
RTOL/ATOLonTestPagedAttentionUnrollTpushPopfrom2e-3to5e-3. The test is flaky on hardware at the tighter bound — observedmax_diffof0.0038..0.0041over 8 standalone runs, just above the previous tolerance.Failure modes seen in repro
max_diff ≈ 0.0038..0.0041vs2e-3toleranceaclrtSynchronizeStreamWithTimeoutrc=507018 (cross-core deadlock)This PR addresses only mode A. After the change, 8/8 standalone runs pass.
Mode B is a real bug — it points at the disabled back-pressure on the
sij/pij/oipipes inpaged_attention_parallel.cpp(see the "Disable reverse-dependency sync" block inrun_aic/run_aiv). The forward-dependency invariant the kernel relies on can break under scheduling jitter. That needs a separate investigation — bumpingFIFO_DEPTHor restoring back-pressure are the obvious knobs.Testing
python test_spmd_paged_attention.py -p a2a3— 8/8 PASS at 5e-3 (was 2/5 PASS at 2e-3)