Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
#839 introduced dynamic post-init callable register/unregister coverage under:
tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py
The dynamic register/unregister path is still unstable in #861 CI. Two PR861
CI runs exposed failures in the same dynamic-register ST file on the same CI
job family:
-
PR861 CI #2723
-
PR861 CI #2749
These are not the same exact failure signature: CI #2723 failed with a
segmentation fault in the two-device parallel dynamic-register case, while CI
#2749 passed that case and instead hung in the single-device
unregister/re-register reuse case. They should still be tracked together as a
PR839 dynamic register/unregister stability issue because both failures occur in
the same feature area and the same ST file.
Steps to Reproduce
Run PR861 CI on the host-device_mapped-region branch with the standard CI
workflow and inspect the Ubuntu A2A3 simulation ST job:
CI / st-sim-a2a3 (ubuntu-latest, 3.10)
The relevant full CI invocations were:
PR861 CI #2723:
https://github.com/hw-native-sys/simpler/actions/runs/26559663566
PR861 CI #2749:
https://github.com/hw-native-sys/simpler/actions/runs/26575577396
For local focused reproduction, run the dynamic-register ST cases on A2A3 sim:
pytest tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py \
--platform a2a3sim --device 0-1 -p no:xdist --pto-session-timeout 600
The two observed failing standalone cases can also be targeted directly:
pytest tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py::test_register_after_init_parallel_broadcast \
--platform a2a3sim --device 0-1 -p no:xdist --pto-session-timeout 600
pytest tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py::test_register_unregister_register_runs_each_time \
--platform a2a3sim --device 0 -p no:xdist --pto-session-timeout 600
Because the failures appear intermittent, a single local run may pass. Looping
these focused cases is likely needed to reproduce the instability.
Expected Behavior
Dynamic post-init register and unregister should be deterministic and safe in
A2A3 simulation:
test_register_after_init_parallel_broadcast should successfully broadcast a
post-init CTRL_REGISTER to both chip children, return only after each child
has prepared the callable, and then run the dynamically registered cid on
both chips without crashing.
test_register_unregister_register_runs_each_time should successfully run a
dynamically registered cid, unregister it, reuse the freed cid slot on a
subsequent register, and run the re-registered callable without hanging.
- The full
st-sim-a2a3 (ubuntu-latest, 3.10) CI job should complete without
segfaults, hangs, or session-level timeouts.
Actual Behavior
Observed in PR861 CI #2723:
[scheduler] START standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) pid=8668 devices=[6, 10]
standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) [FAIL rc=-11 57.8s, devices=[6, 10]]
tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/
test_dynamic_register.py Fatal Python error: Segmentation fault
File ".../site-packages/simpler/worker.py", line 2369 in run
File ".../dynamic_register/test_dynamic_register.py", line 234
in test_register_after_init_parallel_broadcast
Process completed with exit code 1.
Observed in PR861 CI #2749:
[scheduler] START standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) pid=8857 devices=[6, 10]
standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) [PASS 21.4s, devices=[6, 10]]
[scheduler] START standalone test_register_unregister_register_runs_each_time
(rt=tensormap_and_ringbuffer, dev=1) pid=9389 devices=[8]
[pytest] TIMEOUT: session exceeded 600s (10min) limit
HUNG standalone test_register_unregister_register_runs_each_time
(rt=tensormap_and_ringbuffer, dev=1) pid=9389 devices=[8]
elapsed=490.1s descendants=[9565, 9566]
Process completed with exit code 124.
This indicates that the PR839 dynamic register/unregister path can fail in at
least two ways under CI load: a post-register worker.run(...) segfault in the
two-device broadcast case, and a hang in the unregister/re-register cid reuse
case.
Git Commit ID
825f0fd
CANN Version
No response
Driver Version
No response
Host Platform
Linux (x86_64)
Additional Context
No response
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
#839 introduced dynamic post-init callable register/unregister coverage under:
The dynamic register/unregister path is still unstable in #861 CI. Two PR861
CI runs exposed failures in the same dynamic-register ST file on the same CI
job family:
PR861 CI #2723
st-sim-a2a3 (ubuntu-latest, 3.10)test_register_after_init_parallel_broadcastrc=-11, Python segmentation fault insimpler/worker.py, line 2369, duringworker.run(...).PR861 CI #2749
st-sim-a2a3 (ubuntu-latest, 3.10)test_register_unregister_register_runs_each_timethe standalone test as hung for 490.1s; process exited with code 124.
These are not the same exact failure signature: CI #2723 failed with a
segmentation fault in the two-device parallel dynamic-register case, while CI
#2749 passed that case and instead hung in the single-device
unregister/re-register reuse case. They should still be tracked together as a
PR839 dynamic register/unregister stability issue because both failures occur in
the same feature area and the same ST file.
Steps to Reproduce
Run PR861 CI on the
host-device_mapped-regionbranch with the standard CIworkflow and inspect the Ubuntu A2A3 simulation ST job:
The relevant full CI invocations were:
For local focused reproduction, run the dynamic-register ST cases on A2A3 sim:
The two observed failing standalone cases can also be targeted directly:
Because the failures appear intermittent, a single local run may pass. Looping
these focused cases is likely needed to reproduce the instability.
Expected Behavior
Dynamic post-init register and unregister should be deterministic and safe in
A2A3 simulation:
test_register_after_init_parallel_broadcastshould successfully broadcast apost-init
CTRL_REGISTERto both chip children, return only after each childhas prepared the callable, and then run the dynamically registered cid on
both chips without crashing.
test_register_unregister_register_runs_each_timeshould successfully run adynamically registered cid, unregister it, reuse the freed cid slot on a
subsequent register, and run the re-registered callable without hanging.
st-sim-a2a3 (ubuntu-latest, 3.10)CI job should complete withoutsegfaults, hangs, or session-level timeouts.
Actual Behavior
Observed in PR861 CI #2723:
Observed in PR861 CI #2749:
This indicates that the PR839 dynamic register/unregister path can fail in at
least two ways under CI load: a post-register
worker.run(...)segfault in thetwo-device broadcast case, and a hang in the unregister/re-register cid reuse
case.
Git Commit ID
825f0fd
CANN Version
No response
Driver Version
No response
Host Platform
Linux (x86_64)
Additional Context
No response