Skip to content

fix(emitc): drain MTE before pto.comm.tnotify lowering#718

Merged
zhangstevenunity merged 1 commit into
mainfrom
fix/issue711-tnotify-mte-drain
May 28, 2026
Merged

fix(emitc): drain MTE before pto.comm.tnotify lowering#718
zhangstevenunity merged 1 commit into
mainfrom
fix/issue711-tnotify-mte-drain

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

Summary

pto.comm.tnotify lowering currently emits pto::comm::TNOTIFY(...) with no MTE-side drain. Inside the runtime, TNOTIFY_IMPL writes the signal on the scalar pipe and only issues pipe_barrier(PIPE_ALL) after the store. Any prior pto.tload / pto.tstore (local or peer-addressed) can still be in flight on MTE2/MTE3 when the signal lands, so the receiver's matching TWAIT returns before the data is visible -- breaking the notify/wait contract.

This change inserts pipe_barrier(PIPE_ALL) immediately before the generated pto::comm::TNOTIFY(...) so the lowering itself honors the contract and callers do not need any manual sync.

Changes

  • lib/PTO/Transforms/PTOToEmitC.cpp: add emitTNotifyMteDrain helper and invoke it from PTOSignalCommToEmitC on the TNotifyOp branch only (TWAIT/TTEST do not need the drain).
  • test/lit/pto/issue711_tnotify_mte_drain.pto: new regression covering tstore -> tnotify, tload -> tnotify, and asserting twait does not receive an extra drain.
  • docs/PTO_IR_manual.md: document the lowering ordering guarantee under pto.comm.tnotify.

Fixes #711

Test plan

  • ninja -C build-main-wsl tools/ptoas/ptoas (clean build with the new helper).
  • build-main-wsl/tools/ptoas/ptoas --pto-arch=a3 test/lit/pto/issue711_tnotify_mte_drain.pto -o - shows pipe_barrier(PIPE_ALL); directly before pto::comm::TNOTIFY(...) in the emitted EmitC.
  • llvm-lit -sv on issue711_tnotify_mte_drain.pto, comm_p2p_emitc.pto, and comm_collective_emitc.pto -- all pass.
  • Full ninja check-pto -- only the two pre-existing untracked-branch failures remain (graph_sync_solver_section_scope.pto, multi_buffer_gss_dyn_event_id.pto); confirmed they also fail against pristine lib/PTO/Transforms/PTOToEmitC.cpp from main.
  • End-to-end verification on real a3 hardware: receiver's window contains the expected bytes after TWAIT, including the two-back-to-back-tstore reproducer from the issue.

Notes

🤖 Generated with Claude Code

TNOTIFY_IMPL writes the signal on the scalar pipe and only issues
pipe_barrier(PIPE_ALL) *after* the store. Prior pto.tload / pto.tstore
ops (local or peer-addressed) can still be in flight on MTE2/MTE3
when the signal lands, breaking the notify/wait handshake -- the
receiver's TWAIT returns before the data is visible.

Emit pipe_barrier(PIPE_ALL) right before the pto::comm::TNOTIFY call
in PTOToEmitC so the lowering itself honors the contract, with no
caller-side workaround required.

- lib/PTO/Transforms/PTOToEmitC.cpp: add emitTNotifyMteDrain helper
  and call it from PTOSignalCommToEmitC for the TNotifyOp branch
  only (TWAIT/TTEST do not need the extra drain).
- test/lit/pto/issue711_tnotify_mte_drain.pto: regression covering
  tstore->tnotify, tload->tnotify, and confirming twait does not get
  the new drain.
- docs/PTO_IR_manual.md: document the lowering ordering guarantee.

Fixes #711

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Issue #711 by ensuring that the lowering of pto.comm.tnotify drains MTE-side pipes before emitting the pto::comm::TNOTIFY call. This is achieved by inserting a pipe_barrier(PIPE_ALL) immediately before the call, preventing race conditions where a signal store on the scalar pipe could overtake in-flight pto.tload or pto.tstore operations. The changes include updates to the IR manual, the EmitC lowering implementation, and a new regression test suite. There are no review comments to address.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 28, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: fix(emitc): drain MTE before pto.comm.tnotify lowering #718 fix(emitc): drain MTE before pto.comm.tnotify lowering
  • Author: zhangstevenunity
  • Base/Head: main / fix/issue711-tnotify-mte-drain
  • Head SHA: ce5d99515c28
  • Trigger: 检测到新的 open PR
  • Generated At: 2026-05-28T02:35:58Z
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

git fetch origin 'refs/pull/718/head:pr-718' --depth 50
git fetch origin 'main' --depth 50 || true
git checkout -f 'pr-718'
git rev-parse HEAD
git diff --stat 'origin/main...HEAD' || true
Cloning into '/tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/repo'...
From https://github.com/hw-native-sys/PTOAS
 * [new ref]         refs/pull/718/head -> pr-718
From https://github.com/hw-native-sys/PTOAS
 * branch            main       -> FETCH_HEAD
Switched to branch 'pr-718'
ce5d99515c28e90129b48979fc021cc5a4edebca
 docs/PTO_IR_manual.md                       |  10 +++
 lib/PTO/Transforms/PTOToEmitC.cpp           |  17 ++++
 test/lit/pto/issue711_tnotify_mte_drain.pto | 129 ++++++++++++++++++++++++++++
 3 files changed, 156 insertions(+)
===== END STAGE clone rc=0 @ 2026-05-28 10:35:30 =====

===== STAGE codex-review @ 2026-05-28 10:35:30 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/review_prompt.txt'
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019e6c6f-df4b-7a92-ad3f-5bfe5c621660
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#718 fix(emitc): drain MTE before pto.comm.tnotify lowering
作者:zhangstevenunity
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #718 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 503 Service Unavailable: {"error":{"message":"auth_unavailable: no auth available (providers=codex, model=gpt-5.4)","type":"server_error","code":"internal_server_error"}}event: response.failed
data: {"type":"response.failed","response":{"id":"resp_f20b0ed6d1624240a4b5bd12786ba610","object":"response","model":"gpt-5.4","status":"failed","output":[],"error":{"code":"upstream_error","message":"Upstream request failed"}}}, url: https://codex.0u0o.com/responses, request id: f20b0ed6-d162-4240-a4b5-bd12786ba610)
Reconnecting... 2/5 (unexpected status 503 Service Unavailable: {"error":{"message":"auth_unavailable: no auth available (providers=codex, model=gpt-5.4)","type":"server_error","code":"internal_server_error"}}event: response.failed
data: {"type":"response.failed","response":{"id":"resp_73a495d215bc4876890725a948ac0e82","object":"response","model":"gpt-5.4","status":"failed","output":[],"error":{"code":"upstream_error","message":"Upstream request failed"}}}, url: https://codex.0u0o.com/responses, request id: 73a495d2-15bc-4876-8907-25a948ac0e82)
Reconnecting... 3/5 (unexpected status 503 Service Unavailable: {"error":{"message":"auth_unavailable: no auth available (providers=codex, model=gpt-5.4)","type":"server_error","code":"internal_server_error"}}event: response.failed
data: {"type":"response.failed","response":{"id":"resp_86cef5fa52ca46c0a208b691ab730479","object":"response","model":"gpt-5.4","status":"failed","output":[],"error":{"code":"upstream_error","message":"Upstream request failed"}}}, url: https://codex.0u0o.com/responses, request id: 86cef5fa-52ca-46c0-a208-b691ab730479)
Reconnecting... 4/5 (unexpected status 503 Service Unavailable: {"error":{"message":"auth_unavailable: no auth available (providers=codex, model=gpt-5.4)","type":"server_error","code":"internal_server_error"}}event: response.failed
data: {"type":"response.failed","response":{"id":"resp_260a25725782424ab59ae484c30bd5f8","object":"response","model":"gpt-5.4","status":"failed","output":[],"error":{"code":"upstream_error","message":"Upstream request failed"}}}, url: https://codex.0u0o.com/responses, request id: 260a2572-5782-424a-b59a-e484c30bd5f8)
Reconnecting... 5/5 (unexpected status 503 Service Unavailable: {"error":{"message":"auth_unavailable: no auth available (providers=codex, model=gpt-5.4)","type":"server_error","code":"internal_server_error"}}event: response.failed
data: {"type":"response.failed","response":{"id":"resp_0895b73c12874a1ba1cc94abc9582eca","object":"response","model":"gpt-5.4","status":"failed","output":[],"error":{"code":"upstream_error","message":"Upstream request failed"}}}, url: https://codex.0u0o.com/responses, request id: 0895b73c-1287-4a1b-a1cc-94abc9582eca)
ERROR: unexpected status 503 Service Unavailable: {"error":{"message":"auth_unavailable: no auth available (providers=codex, model=gpt-5.4)","type":"server_error","code":"internal_server_error"}}event: response.failed
data: {"type":"response.failed","response":{"id":"resp_24480195106c4d538e8ecd1df312f4cd","object":"response","model":"gpt-5.4","status":"failed","output":[],"error":{"code":"upstream_error","message":"Upstream request failed"}}}, url: https://codex.0u0o.com/responses, request id: 24480195-106c-4d53-8e8e-cd1df312f4cd
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260528_103526_pr718/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-05-28 10:35:58 =====

@zhangstevenunity zhangstevenunity merged commit 648672c into main May 28, 2026
14 checks passed
@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:merged
  • 源码提交:648672ce306d
  • 结果汇总:OK 21 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260528_110105_merged_pr718.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260528_110105_merged_pr718.tsv

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:merged
  • 源码提交:648672ce306d
  • 结果汇总:OK 215 / FAIL 4 / SKIP 1
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260528_130005_merged_pr718.log
  • 失败阶段:board-validation / exit=1

失败用例

  • down_proj_residual (run, exit=1)
  • out_proj_residual (run, exit=1)
  • syncall_binding (run, exit=1)
  • tprefetch_async_binding (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #718

down_proj_residual

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507015 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260528_130005_merged_pr718/npu_validation/Qwen3DecodeA3/down_proj_residual/main.cpp:116)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 420419] 2026-05-28-13:23:34.517.024 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 260, there is an exception of fftsplus aicore error, core id is 0, error code = 0, dump info: pc start: 0x124800000394, current: 0x124800000654, vec error info: 0, mte error info: 0xc503000030, ifu error info: 0x1000000000000, ccu error info: 0x40a0190000000000, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000080.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x8000, 0) errorStr: When the D-cache reads and writes data to the UB, the response value returned by the bus is a non-zero value. fixp_error0 info: 0x3000030, fixp_error1 info: 0xc5, fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:mixCtx print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z18down_proj_residualPu6__bf16PfS_S_S0_i, fault kernel info ext=_Z18down_proj_residualPu6__bf16PfS_S_S0_i, program id=0, hash=11728358990213155584.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-28 13:23:36] ERROR: testcase failed (exit 1): down_proj_residual
out_proj_residual

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507015 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260528_130005_merged_pr718/npu_validation/Qwen3DecodeA3/out_proj_residual/main.cpp:116)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 421639] 2026-05-28-13:23:41.126.851 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 261, there is an exception of fftsplus aicore error, core id is 1, error code = 0, dump info: pc start: 0x124800000394, current: 0x1248000006e8, vec error info: 0, mte error info: 0xc503000030, ifu error info: 0x1000000000000, ccu error info: 0x40a0190000000000, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000080.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x8000, 0) errorStr: When the D-cache reads and writes data to the UB, the response value returned by the bus is a non-zero value. fixp_error0 info: 0x3000030, fixp_error1 info: 0xc5, fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:mixCtx print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z17out_proj_residualPfPu6__bf16S0_S0_S_i, fault kernel info ext=_Z17out_proj_residualPfPu6__bf16S0_S0_S_i, program id=0, hash=12135711328462851009.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-28 13:23:43] ERROR: testcase failed (exit 1): out_proj_residual
syncall_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507014 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260528_130005_merged_pr718/npu_validation/SyncAll/syncall_binding/main.cpp:84)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 434094] 2026-05-28-13:42:55.151.745 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 262, there is an exception of aicore error, core id is 7, error code = 0, dump info: pc start: 0x124800000000, current: 0x124800000188, vec error info: 0, mte error info: 0xc503000030, ifu error info: 0x212c200098b00, ccu error info: 0x48000009, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0, 0) errorStr: timeout or trap error. fixp_error0 info: 0x3000030, fixp_error1 info: 0xc5, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x25, [aicore timeout].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AICORE Kernel task happen error, retCode=0x25.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z22syncall_binding_kernelPii, fault kernel info ext=_Z22syncall_binding_kernelPii, program id=0, hash=3129332313788381512.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=aicore timeout[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507014[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-28 13:42:56] ERROR: testcase failed (exit 1): syncall_binding
tprefetch_async_binding

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260528_130005_merged_pr718/npu_validation/TPrefetchAsync/tprefetch_async_binding/main.cpp:91)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 595263] 2026-05-28-13:43:31.308.300 (EZ9999):  The error from device(chipId:2, dieId:0), serial number is 263, there is an exception of aivec error, core id is 13, error code = 0, dump info: pc start: 0x124800000000, current: 0x124800000160, vec error info: 0xf023, mte error info: 0xa50312808b, ifu error info: 0x200003f800000, ccu error info: 0x52, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100000000.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:645]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x200000000000000, 0) errorStr: The MPU address access is invalid. fixp_error0 info: 0x312808b, fixp_error1 info: 0xa5, fsmId:1, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:PrintCoreInfo][FILE:device_error_core_proc.cc][LINE:658]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1729]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1475]
       [DFX_INFO]Aicore kernel execute failed, device_id=4, stream_id=46, report_stream_id=46, task_id=0, flip_num=0, fault kernel_name=_Z30tprefetch_async_binding_kernelPfPa, fault kernel info ext=_Z30tprefetch_async_binding_kernelPfPa, program id=0, hash=8435686547367685641.[FUNC:GetError][FILE:stream.cc][LINE:1475]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-28 13:43:33] ERROR: testcase failed (exit 1): tprefetch_async_binding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] No pipe sync inserted between MTE-pipe ops (pto.tstore / pto.tload, local or remote) and pto.comm.tnotify — signal can overtake in-flight data

2 participants