Skip to content

fix(codegen,memory): fix matmulacc output mismatch on Ascend NPU#537

Merged
lyfne123 merged 1 commit intohw-native-sys:mainfrom
wangqin1723-max:issue-411-fix-matmulacc-output-v2
Mar 16, 2026
Merged

fix(codegen,memory): fix matmulacc output mismatch on Ascend NPU#537
lyfne123 merged 1 commit intohw-native-sys:mainfrom
wangqin1723-max:issue-411-fix-matmulacc-output-v2

Conversation

@wangqin1723-max
Copy link
Contributor

The CUBE engine's TMATMUL_ACC instruction always reads the accumulator
from the OUTPUT buffer, ignoring any separate accumulator input parameter.
Three changes ensure correct behavior:

  1. Memory reuse pass: allow "touching" lifetimes (last_use == def_point)
    to share buffers, since within a single statement inputs are consumed
    before outputs are produced. This enables the acc input and output of
    matmul_acc to share the same physical buffer. Also fix transitive
    reuse chain tracking by following reuse chains to the root MemRef
    owner when checking for conflicts.

  2. PTO codegen: add custom codegen for tile.matmul_acc/tile.gemv_acc
    that emits only lhs and rhs as ins() operands (not the accumulator),
    and inserts a pto.tmov when acc and dst resolve to different buffers.

  3. CCE codegen: use 3-arg TMATMUL_ACC(dst, lhs, rhs) form instead of
    4-arg, since the ISA cannot TMOV between two Acc-space tiles.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an output mismatch issue with the matmul_acc instruction on Ascend NPU by refining memory management and code generation strategies. The changes ensure that accumulation operations correctly utilize the hardware's in-place accumulation capabilities, leading to accurate results and optimized resource usage for matrix multiplication with accumulation.

Highlights

  • Memory Reuse Pass Enhancement: Modified the memory reuse pass to allow 'touching' lifetimes (where last_use == def_point) to share buffers. This ensures that within a single statement, input accumulators and output destinations for operations like matmul_acc can share the same physical buffer. Additionally, improved transitive reuse chain tracking by following reuse chains to the root MemRef owner to prevent conflicts.
  • PTO Codegen for Accumulation Operations: Implemented custom codegen for tile.matmul_acc and tile.gemv_acc in the PTO backend. This codegen now emits only the lhs and rhs as input operands and inserts a pto.tmov instruction to copy the accumulator input to the output buffer if they resolve to different physical buffers, ensuring correct in-place accumulation behavior.
  • CCE Codegen for TMATMUL_ACC: Updated the CCE codegen for tile.matmul_acc to use the 3-argument TMATMUL_ACC(dst, lhs, rhs) form. This change is necessary because the CUBE engine on Ascend NPUs always reads the accumulator from the output buffer, and the ISA does not support TMOV between two Acc-space tiles for a separate accumulator input.
Changelog
  • examples/language/beginner/matmul.py
    • Added a new MatmulaccProgram class to demonstrate matrix multiplication with accumulation, splitting the K-dimension into two chunks using pl.matmul and pl.matmul_acc.
  • src/backend/910B_CCE/backend_910b_cce_ops.cpp
    • Modified the tile.matmul_acc codegen to emit the 3-argument TMATMUL_ACC(dst, lhs, rhs) instruction, removing the explicit accumulator input argument.
  • src/backend/common/pto_ops_common.cpp
    • Removed tile.matmul_acc and tile.gemv_acc from the list of simple PTO operations, indicating they now have custom codegen.
    • Introduced custom codegen for tile.matmul_acc and tile.gemv_acc to handle in-place accumulation, including an explicit pto.tmov instruction when the accumulator input and output buffers are distinct.
  • src/ir/transforms/basic_memory_reuse_pass.cpp
    • Adjusted the lifetime overlap check to consider 'touching' lifetimes (where last_use_point == def_point) as non-overlapping, enabling more aggressive memory reuse.
    • Enhanced the transitive reuse conflict detection by traversing the reuse chain to the root MemRef owner to ensure accurate conflict checking.
  • tests/st/runtime/test_matmul.py
    • Imported the new MatmulaccProgram example.
    • Added TestMatmulAcc and TestMatmulAccPTO classes to test the matmul_acc functionality with and without the PTO backend.
  • tests/ut/ir/transforms/test_basic_memory_reuse.py
    • Updated comments in test_simple and test_with_dependencies to clarify the behavior of touching lifetimes in memory reuse.
    • Revised the description and assertions in test_transitive_conflict to accurately reflect the updated transitive reuse logic.
    • Modified the assertion in test_partial_reuse_with_overlapping_lifetimes to reflect the expected number of alloc statements after reuse.
Activity
  • Initial commit of changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new MatmulaccProgram implementing two-stage K-splitting accumulation for 64x64 matmul, updates PTO and 910B_CCE codegen for in-place 3-argument TMATMUL_ACC, refines memory-reuse semantics to allow reuse when lifetimes touch, and adds tests exercising the accumulation path and updated reuse behavior.

Changes

Cohort / File(s) Summary
Matmul accumulation feature
examples/language/beginner/matmul.py, tests/st/runtime/test_matmul.py
Introduces MatmulaccProgram with matmul_acc and orchestrator implementing K-splitting (K[0:32], K[32:64]) accumulation flow; adds TestMatmulAcc and TestMatmulAccPTO tests and exposes new test cases.
PTO & 910B_CCE codegen
src/backend/910B_CCE/backend_910b_cce_ops.cpp, src/backend/common/pto_ops_common.cpp
Converts TMATMUL_ACC emission to a 3-argument in-place form (dst, lhs, rhs); implements shared in-place accumulation codegen (make_acc_codegen) and registers tile.matmul_acc/tile.gemv_acc with custom PTO emission.
Memory reuse pass & tests
src/ir/transforms/basic_memory_reuse_pass.cpp, tests/ut/ir/transforms/test_basic_memory_reuse.py
Changes overlap logic to treat touching lifetimes as non-overlapping (uses <=), moves reuse tracking to root-based grouping with transitive propagation, updates debug/logging and test expectations to reflect chain reuse and alloc changes.

Sequence Diagram

sequenceDiagram
    participant Prog as MatmulaccProgram
    participant LMem as LeftMemory
    participant RMem as RightMemory
    participant Compute as ComputeEngine
    participant Out as OutputTensor

    Prog->>LMem: load A tile (K:0-32)
    Prog->>RMem: load B tile (K:0-32)
    LMem-->>Compute: move A tile to compute
    RMem-->>Compute: move B tile to compute
    Compute->>Compute: matmul (K:0-32) -> initial dst
    Compute->>Out: store initial dst

    Prog->>LMem: load A tile (K:32-64)
    Prog->>RMem: load B tile (K:32-64)
    LMem-->>Compute: move A tile to compute
    RMem-->>Compute: move B tile to compute
    Out-->>Compute: load dst as accumulator
    Compute->>Compute: matmul_acc (K:32-64) -> accumulate into dst
    Compute->>Out: store final dst
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • Hzfengsy

Poem

🐰
I hop through tiles with numbers bright,
Split K in two to spark the fight,
matmul then matmul_acc in play,
Roots entwine and reuse finds sway,
Backend hums — the sums unite.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: resolving a matmul_acc output mismatch on Ascend NPU, which is the core objective of this pull request.
Description check ✅ Passed The description comprehensively explains the three key changes (memory reuse pass, PTO codegen, CCE codegen) and their rationale, directly relating to the changeset across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wangqin1723-max wangqin1723-max force-pushed the issue-411-fix-matmulacc-output-v2 branch from c95ea8f to 084cf14 Compare March 16, 2026 02:47
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a comprehensive fix for an output mismatch issue with matmul_acc on Ascend NPUs. The changes span multiple layers, from the memory reuse pass to backend-specific codegen, and are well-supported by new examples and tests. The core logic appears sound and correctly addresses the hardware constraints described. My review includes a few suggestions to enhance code maintainability by reducing code duplication and improving naming for better clarity.

Comment on lines +1008 to +1018
std::string ins_types;
if (!dst_type.empty()) ins_types += dst_type;
if (!lhs_type.empty()) {
if (!ins_types.empty()) ins_types += ", ";
ins_types += lhs_type;
}
if (!rhs_type.empty()) {
if (!ins_types.empty()) ins_types += ", ";
ins_types += rhs_type;
}
if (!ins_types.empty()) acc_inst << " : " << ins_types;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for constructing the ins_types string is a bit repetitive. This pattern of conditionally appending to a string with a separator can be simplified to improve readability and reduce duplication.

You could use a helper function or a more direct approach, like collecting non-empty types into a std::vector<std::string> and then joining them.

Comment on lines +338 to +339
bool overlaps_with_source = !(prev_lifetime.last_use_point <= curr_lifetime.def_point ||
curr_lifetime.last_use_point < prev_lifetime.def_point);
curr_lifetime.last_use_point <= prev_lifetime.def_point);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to check for overlapping lifetimes is duplicated in three places in this function (here, lines 368-369, and lines 382-383). To improve readability and maintainability, consider extracting this into a small helper function.

For example:

static bool LifetimesOverlap(const LifetimeInterval& a, const LifetimeInterval& b) {
  // Lifetimes do not overlap if one ends before or at the same time the other starts.
  return !(a.last_use_point <= b.def_point || b.last_use_point <= a.def_point);
}

This would make the main logic cleaner and less error-prone.

References
  1. This rule encourages extracting duplicated code into a private helper method to improve maintainability and prevent inconsistencies, which directly applies to the repeated lifetime overlap check.


def test_partial_reuse_with_overlapping_lifetimes(self):
"""Producer-consumer reuse still works even when some lifetimes overlap.
def test_no_alloc_removed_when_no_reuse(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test name test_no_alloc_removed_when_no_reuse is slightly misleading. The docstring and implementation show that reuse does happen (tile_c reuses tile_a), but not for all tiles due to overlapping lifetimes. The old name test_partial_reuse_with_overlapping_lifetimes was more descriptive. Consider renaming the test for clarity to better reflect that partial reuse is being tested.

Suggested change
def test_no_alloc_removed_when_no_reuse(self):
def test_partial_reuse_with_overlapping_lifetimes(self):

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/language/beginner/matmul.py (1)

54-58: Nitpick: Consider replacing × with x in docstring.

Static analysis flagged the multiplication sign × (U+00D7) on line 57 as ambiguous. While it renders correctly, using ASCII x is more conventional in code documentation.

📝 Suggested fix
-    ``matmul_acc``.  The final result equals the full 64×64 matrix product.
+    ``matmul_acc``.  The final result equals the full 64x64 matrix product.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/language/beginner/matmul.py` around lines 54 - 58, The docstring in
examples/language/beginner/matmul.py contains the Unicode multiplication sign
`×` in the sentence "The final result equals the full 64×64 matrix product.";
replace that character with the ASCII letter `x` so it reads "64x64" to avoid
the non-ASCII symbol. Update the triple-quoted docstring where the phrase
appears (inside the module/function docstring around the `matmul`/`matmul_acc`
description) and ensure no other occurrences of U+00D7 remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/910B_CCE/backend_910b_cce_ops.cpp`:
- Around line 303-318: The lambda that implements .f_codegen currently emits
TMATMUL_ACC(dst, lhs, rhs) but never verifies that the acc input (op->args_[0])
and the current result target (dst from codegen.GetCurrentResultTarget()) are
aliased by memory reuse; add an explicit CHECK before emitting the instruction
that the acc buffer and dst are the same (e.g., compare
codegen.GetExprAsCode(op->args_[0]) or the buffer/target identifiers for
op->args_[0] with dst) and fail with a clear error message if they differ so we
never silently emit TMATMUL_ACC when acc and dst are not merged.

---

Nitpick comments:
In `@examples/language/beginner/matmul.py`:
- Around line 54-58: The docstring in examples/language/beginner/matmul.py
contains the Unicode multiplication sign `×` in the sentence "The final result
equals the full 64×64 matrix product."; replace that character with the ASCII
letter `x` so it reads "64x64" to avoid the non-ASCII symbol. Update the
triple-quoted docstring where the phrase appears (inside the module/function
docstring around the `matmul`/`matmul_acc` description) and ensure no other
occurrences of U+00D7 remain.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5d5a894c-4b99-46d9-9d8c-654345b7335e

📥 Commits

Reviewing files that changed from the base of the PR and between d01baff and c95ea8f.

📒 Files selected for processing (6)
  • examples/language/beginner/matmul.py
  • src/backend/910B_CCE/backend_910b_cce_ops.cpp
  • src/backend/common/pto_ops_common.cpp
  • src/ir/transforms/basic_memory_reuse_pass.cpp
  • tests/st/runtime/test_matmul.py
  • tests/ut/ir/transforms/test_basic_memory_reuse.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/language/beginner/matmul.py (1)

54-58: Replace Unicode multiplication sign with ASCII x.

The docstring uses × (U+00D7, MULTIPLICATION SIGN) which can cause issues with text encoding, search, and copy-paste. Use the standard ASCII letter x instead.

Proposed fix
     """Matrix multiply with accumulation — K=64 split into two K=32 chunks.
 
     First chunk initialises L0C via ``matmul``; second chunk accumulates via
-    ``matmul_acc``.  The final result equals the full 64×64 matrix product.
+    ``matmul_acc``.  The final result equals the full 64x64 matrix product.
     """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/language/beginner/matmul.py` around lines 54 - 58, In the docstring
in examples/language/beginner/matmul.py (the module-level docstring describing
"Matrix multiply with accumulation" that references K=64 and the functions
matmul and matmul_acc), replace the Unicode multiplication sign "×" with the
ASCII letter "x" so the text reads e.g. "64x64" and "K=32 chunks" to avoid
encoding/search/copy-paste issues; ensure both occurrences in that docstring are
updated and preserve surrounding wording and backticks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/language/beginner/matmul.py`:
- Around line 54-58: In the docstring in examples/language/beginner/matmul.py
(the module-level docstring describing "Matrix multiply with accumulation" that
references K=64 and the functions matmul and matmul_acc), replace the Unicode
multiplication sign "×" with the ASCII letter "x" so the text reads e.g. "64x64"
and "K=32 chunks" to avoid encoding/search/copy-paste issues; ensure both
occurrences in that docstring are updated and preserve surrounding wording and
backticks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f416c50b-e998-415c-83b0-74f8a9bca8ed

📥 Commits

Reviewing files that changed from the base of the PR and between c95ea8f and 084cf14.

📒 Files selected for processing (6)
  • examples/language/beginner/matmul.py
  • src/backend/910B_CCE/backend_910b_cce_ops.cpp
  • src/backend/common/pto_ops_common.cpp
  • src/ir/transforms/basic_memory_reuse_pass.cpp
  • tests/st/runtime/test_matmul.py
  • tests/ut/ir/transforms/test_basic_memory_reuse.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/backend/910B_CCE/backend_910b_cce_ops.cpp

@wangqin1723-max wangqin1723-max force-pushed the issue-411-fix-matmulacc-output-v2 branch from ef32e6b to 88b0dc5 Compare March 16, 2026 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants