fix(codegen,memory): fix matmulacc output mismatch on Ascend NPU by wangqin1723-max · Pull Request #537 · hw-native-sys/pypto

wangqin1723-max · 2026-03-16T02:45:34Z

The CUBE engine's TMATMUL_ACC instruction always reads the accumulator
from the OUTPUT buffer, ignoring any separate accumulator input parameter.
Three changes ensure correct behavior:

Memory reuse pass: allow "touching" lifetimes (last_use == def_point)
to share buffers, since within a single statement inputs are consumed
before outputs are produced. This enables the acc input and output of
matmul_acc to share the same physical buffer. Also fix transitive
reuse chain tracking by following reuse chains to the root MemRef
owner when checking for conflicts.
PTO codegen: add custom codegen for tile.matmul_acc/tile.gemv_acc
that emits only lhs and rhs as ins() operands (not the accumulator),
and inserts a pto.tmov when acc and dst resolve to different buffers.
CCE codegen: use 3-arg TMATMUL_ACC(dst, lhs, rhs) form instead of
4-arg, since the ISA cannot TMOV between two Acc-space tiles.

gemini-code-assist · 2026-03-16T02:45:51Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an output mismatch issue with the matmul_acc instruction on Ascend NPU by refining memory management and code generation strategies. The changes ensure that accumulation operations correctly utilize the hardware's in-place accumulation capabilities, leading to accurate results and optimized resource usage for matrix multiplication with accumulation.

Highlights

Memory Reuse Pass Enhancement: Modified the memory reuse pass to allow 'touching' lifetimes (where last_use == def_point) to share buffers. This ensures that within a single statement, input accumulators and output destinations for operations like matmul_acc can share the same physical buffer. Additionally, improved transitive reuse chain tracking by following reuse chains to the root MemRef owner to prevent conflicts.
PTO Codegen for Accumulation Operations: Implemented custom codegen for tile.matmul_acc and tile.gemv_acc in the PTO backend. This codegen now emits only the lhs and rhs as input operands and inserts a pto.tmov instruction to copy the accumulator input to the output buffer if they resolve to different physical buffers, ensuring correct in-place accumulation behavior.
CCE Codegen for TMATMUL_ACC: Updated the CCE codegen for tile.matmul_acc to use the 3-argument TMATMUL_ACC(dst, lhs, rhs) form. This change is necessary because the CUBE engine on Ascend NPUs always reads the accumulator from the output buffer, and the ISA does not support TMOV between two Acc-space tiles for a separate accumulator input.

Changelog

examples/language/beginner/matmul.py
- Added a new MatmulaccProgram class to demonstrate matrix multiplication with accumulation, splitting the K-dimension into two chunks using pl.matmul and pl.matmul_acc.
src/backend/910B_CCE/backend_910b_cce_ops.cpp
- Modified the tile.matmul_acc codegen to emit the 3-argument TMATMUL_ACC(dst, lhs, rhs) instruction, removing the explicit accumulator input argument.
src/backend/common/pto_ops_common.cpp
- Removed tile.matmul_acc and tile.gemv_acc from the list of simple PTO operations, indicating they now have custom codegen.
- Introduced custom codegen for tile.matmul_acc and tile.gemv_acc to handle in-place accumulation, including an explicit pto.tmov instruction when the accumulator input and output buffers are distinct.
src/ir/transforms/basic_memory_reuse_pass.cpp
- Adjusted the lifetime overlap check to consider 'touching' lifetimes (where last_use_point == def_point) as non-overlapping, enabling more aggressive memory reuse.
- Enhanced the transitive reuse conflict detection by traversing the reuse chain to the root MemRef owner to ensure accurate conflict checking.
tests/st/runtime/test_matmul.py
- Imported the new MatmulaccProgram example.
- Added TestMatmulAcc and TestMatmulAccPTO classes to test the matmul_acc functionality with and without the PTO backend.
tests/ut/ir/transforms/test_basic_memory_reuse.py
- Updated comments in test_simple and test_with_dependencies to clarify the behavior of touching lifetimes in memory reuse.
- Revised the description and assertions in test_transitive_conflict to accurately reflect the updated transitive reuse logic.
- Modified the assertion in test_partial_reuse_with_overlapping_lifetimes to reflect the expected number of alloc statements after reuse.

Activity

Initial commit of changes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-16T02:45:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new MatmulaccProgram implementing two-stage K-splitting accumulation for 64x64 matmul, updates PTO and 910B_CCE codegen for in-place 3-argument TMATMUL_ACC, refines memory-reuse semantics to allow reuse when lifetimes touch, and adds tests exercising the accumulation path and updated reuse behavior.

Changes

Cohort / File(s)	Summary
Matmul accumulation feature `examples/language/beginner/matmul.py`, `tests/st/runtime/test_matmul.py`	Introduces `MatmulaccProgram` with `matmul_acc` and `orchestrator` implementing K-splitting (K[0:32], K[32:64]) accumulation flow; adds `TestMatmulAcc` and `TestMatmulAccPTO` tests and exposes new test cases.
PTO & 910B_CCE codegen `src/backend/910B_CCE/backend_910b_cce_ops.cpp`, `src/backend/common/pto_ops_common.cpp`	Converts TMATMUL_ACC emission to a 3-argument in-place form (dst, lhs, rhs); implements shared in-place accumulation codegen (`make_acc_codegen`) and registers `tile.matmul_acc`/`tile.gemv_acc` with custom PTO emission.
Memory reuse pass & tests `src/ir/transforms/basic_memory_reuse_pass.cpp`, `tests/ut/ir/transforms/test_basic_memory_reuse.py`	Changes overlap logic to treat touching lifetimes as non-overlapping (uses <=), moves reuse tracking to root-based grouping with transitive propagation, updates debug/logging and test expectations to reflect chain reuse and alloc changes.

Sequence Diagram

sequenceDiagram
    participant Prog as MatmulaccProgram
    participant LMem as LeftMemory
    participant RMem as RightMemory
    participant Compute as ComputeEngine
    participant Out as OutputTensor

    Prog->>LMem: load A tile (K:0-32)
    Prog->>RMem: load B tile (K:0-32)
    LMem-->>Compute: move A tile to compute
    RMem-->>Compute: move B tile to compute
    Compute->>Compute: matmul (K:0-32) -> initial dst
    Compute->>Out: store initial dst

    Prog->>LMem: load A tile (K:32-64)
    Prog->>RMem: load B tile (K:32-64)
    LMem-->>Compute: move A tile to compute
    RMem-->>Compute: move B tile to compute
    Out-->>Compute: load dst as accumulator
    Compute->>Compute: matmul_acc (K:32-64) -> accumulate into dst
    Compute->>Out: store final dst

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

fix(pass): allow memory reuse when producer last_use equals consumer def #503: Modifies basic_memory_reuse_pass.cpp to allow touching lifetimes to be considered non-overlapping — closely related to the reuse semantics change.
fix(passes/codegen): Add memory reuse shape check and col_major layout for unit-dim tile #331: Changes to basic_memory_reuse_pass.cpp (shape-compatibility gating) that touch the same reuse logic and root-owner handling.
refactor(ir): Decouple alloc op creation from address allocation #283: Prior edits to the basic memory reuse pass affecting reuse and alloc cleanup behavior.

Suggested reviewers

Hzfengsy

Poem

🐰
I hop through tiles with numbers bright,
Split K in two to spark the fight,
matmul then matmul_acc in play,
Roots entwine and reuse finds sway,
Backend hums — the sums unite.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main fix: resolving a matmul_acc output mismatch on Ascend NPU, which is the core objective of this pull request.
Description check	✅ Passed	The description comprehensively explains the three key changes (memory reuse pass, PTO codegen, CCE codegen) and their rationale, directly relating to the changeset across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request provides a comprehensive fix for an output mismatch issue with matmul_acc on Ascend NPUs. The changes span multiple layers, from the memory reuse pass to backend-specific codegen, and are well-supported by new examples and tests. The core logic appears sound and correctly addresses the hardware constraints described. My review includes a few suggestions to enhance code maintainability by reducing code duplication and improving naming for better clarity.

gemini-code-assist · 2026-03-16T02:47:44Z

src/backend/common/pto_ops_common.cpp

+      std::string ins_types;
+      if (!dst_type.empty()) ins_types += dst_type;
+      if (!lhs_type.empty()) {
+        if (!ins_types.empty()) ins_types += ", ";
+        ins_types += lhs_type;
+      }
+      if (!rhs_type.empty()) {
+        if (!ins_types.empty()) ins_types += ", ";
+        ins_types += rhs_type;
+      }
+      if (!ins_types.empty()) acc_inst << " : " << ins_types;


The logic for constructing the ins_types string is a bit repetitive. This pattern of conditionally appending to a string with a separator can be simplified to improve readability and reduce duplication.

You could use a helper function or a more direct approach, like collecting non-empty types into a std::vector<std::string> and then joining them.

gemini-code-assist · 2026-03-16T02:47:44Z

src/ir/transforms/basic_memory_reuse_pass.cpp

        bool overlaps_with_source = !(prev_lifetime.last_use_point <= curr_lifetime.def_point ||
-                                      curr_lifetime.last_use_point < prev_lifetime.def_point);
+                                      curr_lifetime.last_use_point <= prev_lifetime.def_point);


The logic to check for overlapping lifetimes is duplicated in three places in this function (here, lines 368-369, and lines 382-383). To improve readability and maintainability, consider extracting this into a small helper function.

For example:

static bool LifetimesOverlap(const LifetimeInterval& a, const LifetimeInterval& b) { // Lifetimes do not overlap if one ends before or at the same time the other starts. return !(a.last_use_point <= b.def_point || b.last_use_point <= a.def_point); }

This would make the main logic cleaner and less error-prone.

References

This rule encourages extracting duplicated code into a private helper method to improve maintainability and prevent inconsistencies, which directly applies to the repeated lifetime overlap check.

gemini-code-assist · 2026-03-16T02:47:44Z

tests/ut/ir/transforms/test_basic_memory_reuse.py


-    def test_partial_reuse_with_overlapping_lifetimes(self):
-        """Producer-consumer reuse still works even when some lifetimes overlap.
+    def test_no_alloc_removed_when_no_reuse(self):


The test name test_no_alloc_removed_when_no_reuse is slightly misleading. The docstring and implementation show that reuse does happen (tile_c reuses tile_a), but not for all tiles due to overlapping lifetimes. The old name test_partial_reuse_with_overlapping_lifetimes was more descriptive. Consider renaming the test for clarity to better reflect that partial reuse is being tested.

Suggested change

def test_no_alloc_removed_when_no_reuse(self):

def test_partial_reuse_with_overlapping_lifetimes(self):

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

examples/language/beginner/matmul.py (1)

54-58: Nitpick: Consider replacing × with x in docstring.

Static analysis flagged the multiplication sign × (U+00D7) on line 57 as ambiguous. While it renders correctly, using ASCII x is more conventional in code documentation.

📝 Suggested fix

-    ``matmul_acc``.  The final result equals the full 64×64 matrix product.
+    ``matmul_acc``.  The final result equals the full 64x64 matrix product.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/language/beginner/matmul.py` around lines 54 - 58, The docstring in
examples/language/beginner/matmul.py contains the Unicode multiplication sign
`×` in the sentence "The final result equals the full 64×64 matrix product.";
replace that character with the ASCII letter `x` so it reads "64x64" to avoid
the non-ASCII symbol. Update the triple-quoted docstring where the phrase
appears (inside the module/function docstring around the `matmul`/`matmul_acc`
description) and ensure no other occurrences of U+00D7 remain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/910B_CCE/backend_910b_cce_ops.cpp`:
- Around line 303-318: The lambda that implements .f_codegen currently emits
TMATMUL_ACC(dst, lhs, rhs) but never verifies that the acc input (op->args_[0])
and the current result target (dst from codegen.GetCurrentResultTarget()) are
aliased by memory reuse; add an explicit CHECK before emitting the instruction
that the acc buffer and dst are the same (e.g., compare
codegen.GetExprAsCode(op->args_[0]) or the buffer/target identifiers for
op->args_[0] with dst) and fail with a clear error message if they differ so we
never silently emit TMATMUL_ACC when acc and dst are not merged.

---

Nitpick comments:
In `@examples/language/beginner/matmul.py`:
- Around line 54-58: The docstring in examples/language/beginner/matmul.py
contains the Unicode multiplication sign `×` in the sentence "The final result
equals the full 64×64 matrix product."; replace that character with the ASCII
letter `x` so it reads "64x64" to avoid the non-ASCII symbol. Update the
triple-quoted docstring where the phrase appears (inside the module/function
docstring around the `matmul`/`matmul_acc` description) and ensure no other
occurrences of U+00D7 remain.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5d5a894c-4b99-46d9-9d8c-654345b7335e

📥 Commits

Reviewing files that changed from the base of the PR and between d01baff and c95ea8f.

📒 Files selected for processing (6)

examples/language/beginner/matmul.py
src/backend/910B_CCE/backend_910b_cce_ops.cpp
src/backend/common/pto_ops_common.cpp
src/ir/transforms/basic_memory_reuse_pass.cpp
tests/st/runtime/test_matmul.py
tests/ut/ir/transforms/test_basic_memory_reuse.py

src/backend/910B_CCE/backend_910b_cce_ops.cpp

coderabbitai

🧹 Nitpick comments (1)

examples/language/beginner/matmul.py (1)

54-58: Replace Unicode multiplication sign with ASCII x.

The docstring uses × (U+00D7, MULTIPLICATION SIGN) which can cause issues with text encoding, search, and copy-paste. Use the standard ASCII letter x instead.

Proposed fix

     """Matrix multiply with accumulation — K=64 split into two K=32 chunks.
 
     First chunk initialises L0C via ``matmul``; second chunk accumulates via
-    ``matmul_acc``.  The final result equals the full 64×64 matrix product.
+    ``matmul_acc``.  The final result equals the full 64x64 matrix product.
     """

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/language/beginner/matmul.py` around lines 54 - 58, In the docstring
in examples/language/beginner/matmul.py (the module-level docstring describing
"Matrix multiply with accumulation" that references K=64 and the functions
matmul and matmul_acc), replace the Unicode multiplication sign "×" with the
ASCII letter "x" so the text reads e.g. "64x64" and "K=32 chunks" to avoid
encoding/search/copy-paste issues; ensure both occurrences in that docstring are
updated and preserve surrounding wording and backticks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/language/beginner/matmul.py`:
- Around line 54-58: In the docstring in examples/language/beginner/matmul.py
(the module-level docstring describing "Matrix multiply with accumulation" that
references K=64 and the functions matmul and matmul_acc), replace the Unicode
multiplication sign "×" with the ASCII letter "x" so the text reads e.g. "64x64"
and "K=32 chunks" to avoid encoding/search/copy-paste issues; ensure both
occurrences in that docstring are updated and preserve surrounding wording and
backticks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f416c50b-e998-415c-83b0-74f8a9bca8ed

📥 Commits

Reviewing files that changed from the base of the PR and between c95ea8f and 084cf14.

📒 Files selected for processing (6)

examples/language/beginner/matmul.py
src/backend/910B_CCE/backend_910b_cce_ops.cpp
src/backend/common/pto_ops_common.cpp
src/ir/transforms/basic_memory_reuse_pass.cpp
tests/st/runtime/test_matmul.py
tests/ut/ir/transforms/test_basic_memory_reuse.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/backend/910B_CCE/backend_910b_cce_ops.cpp

wangqin1723-max force-pushed the issue-411-fix-matmulacc-output-v2 branch from c95ea8f to 084cf14 Compare March 16, 2026 02:47

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

src/backend/910B_CCE/backend_910b_cce_ops.cpp Show resolved Hide resolved

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

fix(codegen,memory): fix matmulacc output mismatch on Ascend NPU

88b0dc5

wangqin1723-max force-pushed the issue-411-fix-matmulacc-output-v2 branch from ef32e6b to 88b0dc5 Compare March 16, 2026 06:07

lyfne123 approved these changes Mar 16, 2026

View reviewed changes

lyfne123 merged commit b0de7d8 into hw-native-sys:main Mar 16, 2026
7 checks passed

coderabbitai bot mentioned this pull request Mar 16, 2026

fix(ir): prevent in-place memory reuse for ops that do not support src==dst #557

Merged

wangqin1723-max mentioned this pull request Mar 16, 2026

[Bug] matmulacc 64x64x64 output does not match golden — all 4096 elements mismatched #411

Closed

coderabbitai bot mentioned this pull request Mar 18, 2026

fix(ir): fix BasicMemoryReuse aliasing of simultaneously live tiles #598

Open

	def test_no_alloc_removed_when_no_reuse(self):
	def test_partial_reuse_with_overlapping_lifetimes(self):

Conversation

wangqin1723-max commented Mar 16, 2026

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 16, 2026 •

edited

Loading