Summary
Add per-input memory space declarations to OpConversionRegistry so that the framework automatically bridges memory space mismatches (via tile.load or tile.move), eliminating special-case pattern matchers like MatmulSlicePatternCollector and the hardcoded kSelfLoadingOps exclusion list.
Motivation
Today, memory space handling in convert_tensor_to_tile_ops is spread across three independent mechanisms that don't talk to each other:
| Mechanism |
What it does |
Location |
TensorArgsInConvertedOpsCollector |
Inserts entry tile.load(Vec) for tensor params used by converted ops; hardcodes a kSelfLoadingOps exclusion list |
convert_tensor_to_tile_ops_pass.cpp:80 |
MatmulSlicePatternCollector |
Pre-scans for tensor.slice → tensor.matmul and overrides slice to tile.load(Mat) |
convert_tensor_to_tile_ops_pass.cpp:229 |
LoadOperandToMat() inside matmul/matmul_acc converters |
If input is TensorType, loads to Mat; if TileType, passes through (ignoring whether it's in the right space) |
op_conversion_registry.cpp:84 |
Problems with the current approach:
- Doesn't scale: Every new op that needs a non-Vec input space requires a new special-case collector or converter hack.
- Fragile exclusion list:
kSelfLoadingOps must be manually kept in sync with which ops manage their own loads.
- Incomplete space checking:
LoadOperandToMat() passes through any TileType without checking its actual memory space — a tile in Vec space feeding matmul silently produces wrong code.
- Cross-op patterns are hardcoded: Only
slice→matmul is handled. slice→future_mat_op would need yet another collector.
The root insight: each tensor op has memory space requirements for its inputs. Once the input's memory space matches the requirement, no load/move is needed; otherwise, the framework should bridge the gap. This belongs in the converter registration, not in ad-hoc pattern matchers.
Design
New Types in op_conversion_registry.h
/// Per-input memory space requirement for a converter.
struct InputSpaceReq {
MemorySpace space; ///< Required memory space
std::optional<std::string> trans_kwarg; ///< Read transpose flag from this kwarg (if any)
};
/// Full conversion entry: converter function + input requirements.
struct ConversionEntry {
ConversionFunc func;
/// Per-input space requirements (key = arg index).
/// Inputs not listed here have no space requirement (framework won't auto-load them).
std::unordered_map<size_t, InputSpaceReq> input_reqs;
};
Updated OpConversionRegistry
class OpConversionRegistry {
public:
// --- New registration methods ---
/// Register simple 1:1 conversion with per-input space requirements.
void RegisterSimple(const std::string& from_op, const std::string& to_op,
std::unordered_map<size_t, InputSpaceReq> input_reqs = {});
/// Register custom conversion with per-input space requirements.
void RegisterCustom(const std::string& from_op, ConversionFunc func,
std::unordered_map<size_t, InputSpaceReq> input_reqs = {});
/// Look up a conversion entry (function + requirements).
[[nodiscard]] const ConversionEntry* Lookup(const std::string& op_name) const;
/// Query just input requirements for an op (used by ConsumerSpaceCollector).
[[nodiscard]] const std::unordered_map<size_t, InputSpaceReq>*
GetInputReqs(const std::string& op_name) const;
private:
std::unordered_map<std::string, ConversionEntry> conversions_;
};
Registration Examples
// Elementwise: inputs need Vec (default for most ops)
RegisterSimple("tensor.add", "tile.add",
{{0, {MemorySpace::Vec}}, {1, {MemorySpace::Vec}}});
// Matmul: inputs need Mat, transpose read from kwargs
RegisterCustom("tensor.matmul", matmul_converter,
{{0, {MemorySpace::Mat, "a_trans"}},
{1, {MemorySpace::Mat, "b_trans"}}});
RegisterCustom("tensor.matmul_acc", matmul_acc_converter,
{{1, {MemorySpace::Mat, "a_trans"}}, // lhs at index 1
{2, {MemorySpace::Mat, "b_trans"}}}); // rhs at index 2
// acc at index 0: no space req (passed through from IterArg)
// Slice: no input space requirements (it IS the load mechanism)
RegisterCustom("tensor.slice", slice_converter); // No input_reqs
// Scatter_update: Vec for index and src (when target is local tile)
RegisterCustom("tensor.scatter_update", scatter_converter,
{{1, {MemorySpace::Vec}}, // index
{2, {MemorySpace::Vec}}}); // src
Framework: Automatic Space Bridging in TensorToTileMutator
Before calling any converter, the mutator checks each input against declared requirements:
StmtPtr TensorToTileMutator::VisitStmt_(const AssignStmtPtr& op) {
// ... existing expr visitation ...
auto* entry = conv_registry_.Lookup(call->op_->name_);
if (!entry) return HandlePassThroughAssign(op, new_value);
// --- NEW: Auto-bridge memory space mismatches ---
std::vector<StmtPtr> bridge_prologue;
auto bridged_args = call->args_; // copy
for (const auto& [idx, req] : entry->input_reqs) {
if (idx >= bridged_args.size()) continue;
auto& arg = bridged_args[idx];
bool transpose = req.trans_kwarg
? GetKwargOr<bool>(call->kwargs_, *req.trans_kwarg, false)
: false;
if (auto tensor_type = As<TensorType>(arg->GetType())) {
// TensorType → tile.load(required space)
arg = EmitLoad(arg, tensor_type, req.space, transpose,
bridge_prologue, call->span_);
} else if (auto tile_type = As<TileType>(arg->GetType())) {
if (tile_type->memory_space_ != req.space) {
// TileType in wrong space → tile.move(required space)
arg = EmitMove(arg, req.space, bridge_prologue, call->span_);
}
// else: already in the right space → pass through
}
}
// Call converter with properly-spaced args
auto conv_result = entry->func(bridged_args, call->kwargs_, call->span_);
// ... existing prologue/result handling ...
}
Consumer-Driven Look-Ahead: ConsumerSpaceCollector
To avoid redundant load(Vec) + move(Mat) sequences (e.g., tensor.slice → tensor.matmul), a general pre-scan collects consumer space requirements, driven entirely by registered metadata:
/// General pre-scan: for each variable, collect the memory space needed by its consumers.
/// Replaces the special-purpose MatmulSlicePatternCollector.
class ConsumerSpaceCollector : public IRVisitor {
public:
explicit ConsumerSpaceCollector(const OpConversionRegistry& registry)
: registry_(registry) {}
/// Get consumer space requirement for a variable (if any consumer has one).
[[nodiscard]] std::optional<InputSpaceReq> GetConsumerReq(const Var* var) const {
auto it = consumer_reqs_.find(var);
return it != consumer_reqs_.end() ? std::optional{it->second} : std::nullopt;
}
protected:
void VisitStmt_(const AssignStmtPtr& op) override {
if (!op) return;
auto call = As<Call>(op->value_);
if (!call) { IRVisitor::VisitStmt_(op); return; }
auto* entry = registry_.Lookup(call->op_->name_);
if (!entry) { IRVisitor::VisitStmt_(op); return; }
// For each input with a space requirement, record against the source variable
for (const auto& [idx, req] : entry->input_reqs) {
if (idx >= call->args_.size()) continue;
if (auto var = As<Var>(call->args_[idx])) {
// If multiple consumers want different spaces, the first one wins
// (the framework will move for subsequent consumers)
consumer_reqs_.emplace(var.get(), req);
}
}
IRVisitor::VisitStmt_(op);
}
private:
const OpConversionRegistry& registry_;
std::unordered_map<const Var*, InputSpaceReq> consumer_reqs_;
};
Load-like ops (tensor.slice on TensorType, entry parameter loads) check this map to produce the right space directly:
// In tensor.slice converter (or the framework's entry-load logic):
if (auto consumer_req = consumer_space_collector.GetConsumerReq(result_var)) {
// Consumer wants Mat → load directly to Mat
target_memory = consumer_req->space;
transpose = consumer_req->transpose;
} else {
// Default: load to Vec
target_memory = MemorySpace::Vec;
transpose = false;
}
What Gets Removed
| Current Code |
Status |
MatmulSlicePatternCollector (70 lines) |
Removed — replaced by general ConsumerSpaceCollector |
MatmulSliceInfo struct |
Removed |
HandleMatmulSlice() in TensorToTileMutator |
Removed |
LoadOperandToMat() helper |
Removed — framework auto-bridges |
kSelfLoadingOps hardcoded set in TensorArgsInConvertedOpsCollector |
Removed — ops that manage their own loads simply declare no input_reqs |
Imperative LoadOperandToMat calls inside matmul/matmul_acc converters |
Simplified — converters only emit the compute op |
Simplified Matmul Converter (After)
Before (current):
RegisterCustom("tensor.matmul",
[](const std::vector<ExprPtr>& args, ...) -> ConversionResult {
bool a_trans = GetKwargOr<bool>(kwargs, "a_trans", false);
bool b_trans = GetKwargOr<bool>(kwargs, "b_trans", false);
std::vector<StmtPtr> prologue;
auto lhs_mat = LoadOperandToMat(args[0], a_trans, "lhs_mat", prologue, span);
auto rhs_mat = LoadOperandToMat(args[1], b_trans, "rhs_mat", prologue, span);
auto matmul_call = OpRegistry::GetInstance().Create("tile.matmul", {lhs_mat, rhs_mat}, span);
return ConversionResult{std::move(prologue), matmul_call};
});
After (proposed):
// Framework handles loads via input_reqs — converter is just the compute op
RegisterSimple("tensor.matmul", "tile.matmul",
{{0, {MemorySpace::Mat, "a_trans"}},
{1, {MemorySpace::Mat, "b_trans"}}});
Similarly, tensor.scatter_update can drop its imperative load-index/load-src code and declare {{1, Vec}, {2, Vec}}.
Affected Files
include/pypto/ir/transforms/op_conversion_registry.h — Modify — Add InputSpaceReq, ConversionEntry, update registry API
src/ir/transforms/op_conversion_registry.cpp — Modify — Update registrations with input_reqs, remove LoadOperandToMat, simplify matmul/scatter_update converters
src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp — Modify — Replace MatmulSlicePatternCollector with ConsumerSpaceCollector, add auto-bridging in TensorToTileMutator, simplify TensorArgsInConvertedOpsCollector
Testing Plan
- Existing tests must pass unchanged — the refactor should produce identical IR output for all existing test cases in
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
- Specific tests to verify:
test_matmul_conversion — matmul inputs auto-loaded to Mat
test_matmul_b_trans_conversion — transpose propagated through input_reqs
test_matmul_acc_conversion — acc passed through, lhs/rhs loaded to Mat
TestSliceMatmulConversion — slice produces Mat-space load (via ConsumerSpaceCollector), no extra move
test_scatter_update_* — index/src auto-loaded to Vec
- New tests to add:
- Test that a tile in wrong space triggers
tile.move (e.g., construct IR where a Vec tile feeds an op requiring Mat)
- Test that a tile in the correct space passes through (no extra load/move emitted)
Alternatives Considered
Option A: No Look-Ahead (Pure Local Decisions)
Each converter only declares input requirements. The framework inserts tile.move when spaces mismatch. No pre-scan needed.
- Pro: Simplest implementation, purely local decisions
- Con: Produces
load(Vec) + move(Mat) instead of load(Mat) for slice→matmul — extra data movement on hardware
Rejected because the extra move has real hardware cost, and the general look-ahead (ConsumerSpaceCollector) is straightforward.
Option B: Keep MatmulSlicePatternCollector, Only Add input_reqs
Add input_reqs to the registry for auto-bridging, but keep the existing MatmulSlicePatternCollector for the look-ahead optimization.
- Pro: Smaller change, lower risk
- Con: Doesn't fully realize the generalization — still has a special-case collector alongside the general system
Rejected because ConsumerSpaceCollector is a strict generalization with comparable complexity.
Open Questions
- Multiple consumers with different space requirements: When a variable is consumed by both a Vec op and a Mat op,
ConsumerSpaceCollector currently takes the first one encountered. Should we prioritize "higher" spaces (Mat > Vec) to minimize moves? Or is this rare enough not to matter?
- Default
input_reqs for RegisterSimple: Should simple 1:1 conversions default to {all inputs: Vec} or require explicit declaration? Defaulting to Vec would eliminate boilerplate for elementwise ops but might be surprising.
- Should
TensorArgsInConvertedOpsCollector be fully replaced? With input_reqs metadata, the entry-load logic could also be driven by the registry (ops with no input_reqs don't need entry loads). But this is a bigger change — it might be better as a follow-up.
Summary
Add per-input memory space declarations to
OpConversionRegistryso that the framework automatically bridges memory space mismatches (viatile.loadortile.move), eliminating special-case pattern matchers likeMatmulSlicePatternCollectorand the hardcodedkSelfLoadingOpsexclusion list.Motivation
Today, memory space handling in
convert_tensor_to_tile_opsis spread across three independent mechanisms that don't talk to each other:TensorArgsInConvertedOpsCollectortile.load(Vec)for tensor params used by converted ops; hardcodes akSelfLoadingOpsexclusion listconvert_tensor_to_tile_ops_pass.cpp:80MatmulSlicePatternCollectortensor.slice → tensor.matmuland overrides slice totile.load(Mat)convert_tensor_to_tile_ops_pass.cpp:229LoadOperandToMat()inside matmul/matmul_acc convertersop_conversion_registry.cpp:84Problems with the current approach:
kSelfLoadingOpsmust be manually kept in sync with which ops manage their own loads.LoadOperandToMat()passes through any TileType without checking its actual memory space — a tile in Vec space feeding matmul silently produces wrong code.slice→matmulis handled.slice→future_mat_opwould need yet another collector.The root insight: each tensor op has memory space requirements for its inputs. Once the input's memory space matches the requirement, no load/move is needed; otherwise, the framework should bridge the gap. This belongs in the converter registration, not in ad-hoc pattern matchers.
Design
New Types in
op_conversion_registry.hUpdated
OpConversionRegistryRegistration Examples
Framework: Automatic Space Bridging in
TensorToTileMutatorBefore calling any converter, the mutator checks each input against declared requirements:
Consumer-Driven Look-Ahead:
ConsumerSpaceCollectorTo avoid redundant
load(Vec) + move(Mat)sequences (e.g.,tensor.slice → tensor.matmul), a general pre-scan collects consumer space requirements, driven entirely by registered metadata:Load-like ops (tensor.slice on TensorType, entry parameter loads) check this map to produce the right space directly:
What Gets Removed
MatmulSlicePatternCollector(70 lines)ConsumerSpaceCollectorMatmulSliceInfostructHandleMatmulSlice()inTensorToTileMutatorLoadOperandToMat()helperkSelfLoadingOpshardcoded set inTensorArgsInConvertedOpsCollectorinput_reqsLoadOperandToMatcalls inside matmul/matmul_acc convertersSimplified Matmul Converter (After)
Before (current):
After (proposed):
Similarly,
tensor.scatter_updatecan drop its imperative load-index/load-src code and declare{{1, Vec}, {2, Vec}}.Affected Files
include/pypto/ir/transforms/op_conversion_registry.h— Modify — AddInputSpaceReq,ConversionEntry, update registry APIsrc/ir/transforms/op_conversion_registry.cpp— Modify — Update registrations withinput_reqs, removeLoadOperandToMat, simplify matmul/scatter_update converterssrc/ir/transforms/convert_tensor_to_tile_ops_pass.cpp— Modify — ReplaceMatmulSlicePatternCollectorwithConsumerSpaceCollector, add auto-bridging inTensorToTileMutator, simplifyTensorArgsInConvertedOpsCollectorTesting Plan
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.pytest_matmul_conversion— matmul inputs auto-loaded to Mattest_matmul_b_trans_conversion— transpose propagated throughinput_reqstest_matmul_acc_conversion— acc passed through, lhs/rhs loaded to MatTestSliceMatmulConversion— slice produces Mat-space load (viaConsumerSpaceCollector), no extra movetest_scatter_update_*— index/src auto-loaded to Vectile.move(e.g., construct IR where a Vec tile feeds an op requiring Mat)Alternatives Considered
Option A: No Look-Ahead (Pure Local Decisions)
Each converter only declares input requirements. The framework inserts
tile.movewhen spaces mismatch. No pre-scan needed.load(Vec) + move(Mat)instead ofload(Mat)for slice→matmul — extra data movement on hardwareRejected because the extra move has real hardware cost, and the general look-ahead (
ConsumerSpaceCollector) is straightforward.Option B: Keep MatmulSlicePatternCollector, Only Add input_reqs
Add
input_reqsto the registry for auto-bridging, but keep the existingMatmulSlicePatternCollectorfor the look-ahead optimization.Rejected because
ConsumerSpaceCollectoris a strict generalization with comparable complexity.Open Questions
ConsumerSpaceCollectorcurrently takes the first one encountered. Should we prioritize "higher" spaces (Mat > Vec) to minimize moves? Or is this rare enough not to matter?input_reqsforRegisterSimple: Should simple 1:1 conversions default to{all inputs: Vec}or require explicit declaration? Defaulting to Vec would eliminate boilerplate for elementwise ops but might be surprising.TensorArgsInConvertedOpsCollectorbe fully replaced? Withinput_reqsmetadata, the entry-load logic could also be driven by the registry (ops with noinput_reqsdon't need entry loads). But this is a bigger change — it might be better as a follow-up.