fix(codegen,memory): fix matmulacc output mismatch on Ascend NPU#537
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an output mismatch issue with the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new MatmulaccProgram implementing two-stage K-splitting accumulation for 64x64 matmul, updates PTO and 910B_CCE codegen for in-place 3-argument TMATMUL_ACC, refines memory-reuse semantics to allow reuse when lifetimes touch, and adds tests exercising the accumulation path and updated reuse behavior. Changes
Sequence DiagramsequenceDiagram
participant Prog as MatmulaccProgram
participant LMem as LeftMemory
participant RMem as RightMemory
participant Compute as ComputeEngine
participant Out as OutputTensor
Prog->>LMem: load A tile (K:0-32)
Prog->>RMem: load B tile (K:0-32)
LMem-->>Compute: move A tile to compute
RMem-->>Compute: move B tile to compute
Compute->>Compute: matmul (K:0-32) -> initial dst
Compute->>Out: store initial dst
Prog->>LMem: load A tile (K:32-64)
Prog->>RMem: load B tile (K:32-64)
LMem-->>Compute: move A tile to compute
RMem-->>Compute: move B tile to compute
Out-->>Compute: load dst as accumulator
Compute->>Compute: matmul_acc (K:32-64) -> accumulate into dst
Compute->>Out: store final dst
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c95ea8f to
084cf14
Compare
There was a problem hiding this comment.
Code Review
This pull request provides a comprehensive fix for an output mismatch issue with matmul_acc on Ascend NPUs. The changes span multiple layers, from the memory reuse pass to backend-specific codegen, and are well-supported by new examples and tests. The core logic appears sound and correctly addresses the hardware constraints described. My review includes a few suggestions to enhance code maintainability by reducing code duplication and improving naming for better clarity.
| std::string ins_types; | ||
| if (!dst_type.empty()) ins_types += dst_type; | ||
| if (!lhs_type.empty()) { | ||
| if (!ins_types.empty()) ins_types += ", "; | ||
| ins_types += lhs_type; | ||
| } | ||
| if (!rhs_type.empty()) { | ||
| if (!ins_types.empty()) ins_types += ", "; | ||
| ins_types += rhs_type; | ||
| } | ||
| if (!ins_types.empty()) acc_inst << " : " << ins_types; |
There was a problem hiding this comment.
The logic for constructing the ins_types string is a bit repetitive. This pattern of conditionally appending to a string with a separator can be simplified to improve readability and reduce duplication.
You could use a helper function or a more direct approach, like collecting non-empty types into a std::vector<std::string> and then joining them.
| bool overlaps_with_source = !(prev_lifetime.last_use_point <= curr_lifetime.def_point || | ||
| curr_lifetime.last_use_point < prev_lifetime.def_point); | ||
| curr_lifetime.last_use_point <= prev_lifetime.def_point); |
There was a problem hiding this comment.
The logic to check for overlapping lifetimes is duplicated in three places in this function (here, lines 368-369, and lines 382-383). To improve readability and maintainability, consider extracting this into a small helper function.
For example:
static bool LifetimesOverlap(const LifetimeInterval& a, const LifetimeInterval& b) {
// Lifetimes do not overlap if one ends before or at the same time the other starts.
return !(a.last_use_point <= b.def_point || b.last_use_point <= a.def_point);
}This would make the main logic cleaner and less error-prone.
References
- This rule encourages extracting duplicated code into a private helper method to improve maintainability and prevent inconsistencies, which directly applies to the repeated lifetime overlap check.
|
|
||
| def test_partial_reuse_with_overlapping_lifetimes(self): | ||
| """Producer-consumer reuse still works even when some lifetimes overlap. | ||
| def test_no_alloc_removed_when_no_reuse(self): |
There was a problem hiding this comment.
The test name test_no_alloc_removed_when_no_reuse is slightly misleading. The docstring and implementation show that reuse does happen (tile_c reuses tile_a), but not for all tiles due to overlapping lifetimes. The old name test_partial_reuse_with_overlapping_lifetimes was more descriptive. Consider renaming the test for clarity to better reflect that partial reuse is being tested.
| def test_no_alloc_removed_when_no_reuse(self): | |
| def test_partial_reuse_with_overlapping_lifetimes(self): |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
examples/language/beginner/matmul.py (1)
54-58: Nitpick: Consider replacing×withxin docstring.Static analysis flagged the multiplication sign
×(U+00D7) on line 57 as ambiguous. While it renders correctly, using ASCIIxis more conventional in code documentation.📝 Suggested fix
- ``matmul_acc``. The final result equals the full 64×64 matrix product. + ``matmul_acc``. The final result equals the full 64x64 matrix product.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/language/beginner/matmul.py` around lines 54 - 58, The docstring in examples/language/beginner/matmul.py contains the Unicode multiplication sign `×` in the sentence "The final result equals the full 64×64 matrix product."; replace that character with the ASCII letter `x` so it reads "64x64" to avoid the non-ASCII symbol. Update the triple-quoted docstring where the phrase appears (inside the module/function docstring around the `matmul`/`matmul_acc` description) and ensure no other occurrences of U+00D7 remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/910B_CCE/backend_910b_cce_ops.cpp`:
- Around line 303-318: The lambda that implements .f_codegen currently emits
TMATMUL_ACC(dst, lhs, rhs) but never verifies that the acc input (op->args_[0])
and the current result target (dst from codegen.GetCurrentResultTarget()) are
aliased by memory reuse; add an explicit CHECK before emitting the instruction
that the acc buffer and dst are the same (e.g., compare
codegen.GetExprAsCode(op->args_[0]) or the buffer/target identifiers for
op->args_[0] with dst) and fail with a clear error message if they differ so we
never silently emit TMATMUL_ACC when acc and dst are not merged.
---
Nitpick comments:
In `@examples/language/beginner/matmul.py`:
- Around line 54-58: The docstring in examples/language/beginner/matmul.py
contains the Unicode multiplication sign `×` in the sentence "The final result
equals the full 64×64 matrix product."; replace that character with the ASCII
letter `x` so it reads "64x64" to avoid the non-ASCII symbol. Update the
triple-quoted docstring where the phrase appears (inside the module/function
docstring around the `matmul`/`matmul_acc` description) and ensure no other
occurrences of U+00D7 remain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5d5a894c-4b99-46d9-9d8c-654345b7335e
📒 Files selected for processing (6)
examples/language/beginner/matmul.pysrc/backend/910B_CCE/backend_910b_cce_ops.cppsrc/backend/common/pto_ops_common.cppsrc/ir/transforms/basic_memory_reuse_pass.cpptests/st/runtime/test_matmul.pytests/ut/ir/transforms/test_basic_memory_reuse.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/language/beginner/matmul.py (1)
54-58: Replace Unicode multiplication sign with ASCIIx.The docstring uses
×(U+00D7, MULTIPLICATION SIGN) which can cause issues with text encoding, search, and copy-paste. Use the standard ASCII letterxinstead.Proposed fix
"""Matrix multiply with accumulation — K=64 split into two K=32 chunks. First chunk initialises L0C via ``matmul``; second chunk accumulates via - ``matmul_acc``. The final result equals the full 64×64 matrix product. + ``matmul_acc``. The final result equals the full 64x64 matrix product. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/language/beginner/matmul.py` around lines 54 - 58, In the docstring in examples/language/beginner/matmul.py (the module-level docstring describing "Matrix multiply with accumulation" that references K=64 and the functions matmul and matmul_acc), replace the Unicode multiplication sign "×" with the ASCII letter "x" so the text reads e.g. "64x64" and "K=32 chunks" to avoid encoding/search/copy-paste issues; ensure both occurrences in that docstring are updated and preserve surrounding wording and backticks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/language/beginner/matmul.py`:
- Around line 54-58: In the docstring in examples/language/beginner/matmul.py
(the module-level docstring describing "Matrix multiply with accumulation" that
references K=64 and the functions matmul and matmul_acc), replace the Unicode
multiplication sign "×" with the ASCII letter "x" so the text reads e.g. "64x64"
and "K=32 chunks" to avoid encoding/search/copy-paste issues; ensure both
occurrences in that docstring are updated and preserve surrounding wording and
backticks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f416c50b-e998-415c-83b0-74f8a9bca8ed
📒 Files selected for processing (6)
examples/language/beginner/matmul.pysrc/backend/910B_CCE/backend_910b_cce_ops.cppsrc/backend/common/pto_ops_common.cppsrc/ir/transforms/basic_memory_reuse_pass.cpptests/st/runtime/test_matmul.pytests/ut/ir/transforms/test_basic_memory_reuse.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/backend/910B_CCE/backend_910b_cce_ops.cpp
ef32e6b to
88b0dc5
Compare
The CUBE engine's TMATMUL_ACC instruction always reads the accumulator
from the OUTPUT buffer, ignoring any separate accumulator input parameter.
Three changes ensure correct behavior:
Memory reuse pass: allow "touching" lifetimes (last_use == def_point)
to share buffers, since within a single statement inputs are consumed
before outputs are produced. This enables the acc input and output of
matmul_acc to share the same physical buffer. Also fix transitive
reuse chain tracking by following reuse chains to the root MemRef
owner when checking for conflicts.
PTO codegen: add custom codegen for tile.matmul_acc/tile.gemv_acc
that emits only lhs and rhs as ins() operands (not the accumulator),
and inserts a pto.tmov when acc and dst resolve to different buffers.
CCE codegen: use 3-arg TMATMUL_ACC(dst, lhs, rhs) form instead of
4-arg, since the ISA cannot TMOV between two Acc-space tiles.