Skip to content

[RFC] Unify memory space requirements into OpConversionRegistry #972

@Hzfengsy

Description

@Hzfengsy

Summary

Add per-input memory space declarations to OpConversionRegistry so that the framework automatically bridges memory space mismatches (via tile.load or tile.move), eliminating special-case pattern matchers like MatmulSlicePatternCollector and the hardcoded kSelfLoadingOps exclusion list.

Motivation

Today, memory space handling in convert_tensor_to_tile_ops is spread across three independent mechanisms that don't talk to each other:

Mechanism What it does Location
TensorArgsInConvertedOpsCollector Inserts entry tile.load(Vec) for tensor params used by converted ops; hardcodes a kSelfLoadingOps exclusion list convert_tensor_to_tile_ops_pass.cpp:80
MatmulSlicePatternCollector Pre-scans for tensor.slice → tensor.matmul and overrides slice to tile.load(Mat) convert_tensor_to_tile_ops_pass.cpp:229
LoadOperandToMat() inside matmul/matmul_acc converters If input is TensorType, loads to Mat; if TileType, passes through (ignoring whether it's in the right space) op_conversion_registry.cpp:84

Problems with the current approach:

  1. Doesn't scale: Every new op that needs a non-Vec input space requires a new special-case collector or converter hack.
  2. Fragile exclusion list: kSelfLoadingOps must be manually kept in sync with which ops manage their own loads.
  3. Incomplete space checking: LoadOperandToMat() passes through any TileType without checking its actual memory space — a tile in Vec space feeding matmul silently produces wrong code.
  4. Cross-op patterns are hardcoded: Only slice→matmul is handled. slice→future_mat_op would need yet another collector.

The root insight: each tensor op has memory space requirements for its inputs. Once the input's memory space matches the requirement, no load/move is needed; otherwise, the framework should bridge the gap. This belongs in the converter registration, not in ad-hoc pattern matchers.

Design

New Types in op_conversion_registry.h

/// Per-input memory space requirement for a converter.
struct InputSpaceReq {
  MemorySpace space;                       ///< Required memory space
  std::optional<std::string> trans_kwarg;  ///< Read transpose flag from this kwarg (if any)
};

/// Full conversion entry: converter function + input requirements.
struct ConversionEntry {
  ConversionFunc func;
  /// Per-input space requirements (key = arg index).
  /// Inputs not listed here have no space requirement (framework won't auto-load them).
  std::unordered_map<size_t, InputSpaceReq> input_reqs;
};

Updated OpConversionRegistry

class OpConversionRegistry {
 public:
  // --- New registration methods ---

  /// Register simple 1:1 conversion with per-input space requirements.
  void RegisterSimple(const std::string& from_op, const std::string& to_op,
                      std::unordered_map<size_t, InputSpaceReq> input_reqs = {});

  /// Register custom conversion with per-input space requirements.
  void RegisterCustom(const std::string& from_op, ConversionFunc func,
                      std::unordered_map<size_t, InputSpaceReq> input_reqs = {});

  /// Look up a conversion entry (function + requirements).
  [[nodiscard]] const ConversionEntry* Lookup(const std::string& op_name) const;

  /// Query just input requirements for an op (used by ConsumerSpaceCollector).
  [[nodiscard]] const std::unordered_map<size_t, InputSpaceReq>*
  GetInputReqs(const std::string& op_name) const;

 private:
  std::unordered_map<std::string, ConversionEntry> conversions_;
};

Registration Examples

// Elementwise: inputs need Vec (default for most ops)
RegisterSimple("tensor.add", "tile.add",
    {{0, {MemorySpace::Vec}}, {1, {MemorySpace::Vec}}});

// Matmul: inputs need Mat, transpose read from kwargs
RegisterCustom("tensor.matmul", matmul_converter,
    {{0, {MemorySpace::Mat, "a_trans"}},
     {1, {MemorySpace::Mat, "b_trans"}}});

RegisterCustom("tensor.matmul_acc", matmul_acc_converter,
    {{1, {MemorySpace::Mat, "a_trans"}},   // lhs at index 1
     {2, {MemorySpace::Mat, "b_trans"}}});  // rhs at index 2
    // acc at index 0: no space req (passed through from IterArg)

// Slice: no input space requirements (it IS the load mechanism)
RegisterCustom("tensor.slice", slice_converter);  // No input_reqs

// Scatter_update: Vec for index and src (when target is local tile)
RegisterCustom("tensor.scatter_update", scatter_converter,
    {{1, {MemorySpace::Vec}},   // index
     {2, {MemorySpace::Vec}}});  // src

Framework: Automatic Space Bridging in TensorToTileMutator

Before calling any converter, the mutator checks each input against declared requirements:

StmtPtr TensorToTileMutator::VisitStmt_(const AssignStmtPtr& op) {
  // ... existing expr visitation ...
  auto* entry = conv_registry_.Lookup(call->op_->name_);
  if (!entry) return HandlePassThroughAssign(op, new_value);

  // --- NEW: Auto-bridge memory space mismatches ---
  std::vector<StmtPtr> bridge_prologue;
  auto bridged_args = call->args_;  // copy

  for (const auto& [idx, req] : entry->input_reqs) {
    if (idx >= bridged_args.size()) continue;
    auto& arg = bridged_args[idx];
    bool transpose = req.trans_kwarg
        ? GetKwargOr<bool>(call->kwargs_, *req.trans_kwarg, false)
        : false;

    if (auto tensor_type = As<TensorType>(arg->GetType())) {
      // TensorType → tile.load(required space)
      arg = EmitLoad(arg, tensor_type, req.space, transpose,
                     bridge_prologue, call->span_);
    } else if (auto tile_type = As<TileType>(arg->GetType())) {
      if (tile_type->memory_space_ != req.space) {
        // TileType in wrong space → tile.move(required space)
        arg = EmitMove(arg, req.space, bridge_prologue, call->span_);
      }
      // else: already in the right space → pass through
    }
  }

  // Call converter with properly-spaced args
  auto conv_result = entry->func(bridged_args, call->kwargs_, call->span_);
  // ... existing prologue/result handling ...
}

Consumer-Driven Look-Ahead: ConsumerSpaceCollector

To avoid redundant load(Vec) + move(Mat) sequences (e.g., tensor.slice → tensor.matmul), a general pre-scan collects consumer space requirements, driven entirely by registered metadata:

/// General pre-scan: for each variable, collect the memory space needed by its consumers.
/// Replaces the special-purpose MatmulSlicePatternCollector.
class ConsumerSpaceCollector : public IRVisitor {
 public:
  explicit ConsumerSpaceCollector(const OpConversionRegistry& registry)
      : registry_(registry) {}

  /// Get consumer space requirement for a variable (if any consumer has one).
  [[nodiscard]] std::optional<InputSpaceReq> GetConsumerReq(const Var* var) const {
    auto it = consumer_reqs_.find(var);
    return it != consumer_reqs_.end() ? std::optional{it->second} : std::nullopt;
  }

 protected:
  void VisitStmt_(const AssignStmtPtr& op) override {
    if (!op) return;
    auto call = As<Call>(op->value_);
    if (!call) { IRVisitor::VisitStmt_(op); return; }

    auto* entry = registry_.Lookup(call->op_->name_);
    if (!entry) { IRVisitor::VisitStmt_(op); return; }

    // For each input with a space requirement, record against the source variable
    for (const auto& [idx, req] : entry->input_reqs) {
      if (idx >= call->args_.size()) continue;
      if (auto var = As<Var>(call->args_[idx])) {
        // If multiple consumers want different spaces, the first one wins
        // (the framework will move for subsequent consumers)
        consumer_reqs_.emplace(var.get(), req);
      }
    }
    IRVisitor::VisitStmt_(op);
  }

 private:
  const OpConversionRegistry& registry_;
  std::unordered_map<const Var*, InputSpaceReq> consumer_reqs_;
};

Load-like ops (tensor.slice on TensorType, entry parameter loads) check this map to produce the right space directly:

// In tensor.slice converter (or the framework's entry-load logic):
if (auto consumer_req = consumer_space_collector.GetConsumerReq(result_var)) {
  // Consumer wants Mat → load directly to Mat
  target_memory = consumer_req->space;
  transpose = consumer_req->transpose;
} else {
  // Default: load to Vec
  target_memory = MemorySpace::Vec;
  transpose = false;
}

What Gets Removed

Current Code Status
MatmulSlicePatternCollector (70 lines) Removed — replaced by general ConsumerSpaceCollector
MatmulSliceInfo struct Removed
HandleMatmulSlice() in TensorToTileMutator Removed
LoadOperandToMat() helper Removed — framework auto-bridges
kSelfLoadingOps hardcoded set in TensorArgsInConvertedOpsCollector Removed — ops that manage their own loads simply declare no input_reqs
Imperative LoadOperandToMat calls inside matmul/matmul_acc converters Simplified — converters only emit the compute op

Simplified Matmul Converter (After)

Before (current):

RegisterCustom("tensor.matmul",
    [](const std::vector<ExprPtr>& args, ...) -> ConversionResult {
      bool a_trans = GetKwargOr<bool>(kwargs, "a_trans", false);
      bool b_trans = GetKwargOr<bool>(kwargs, "b_trans", false);
      std::vector<StmtPtr> prologue;
      auto lhs_mat = LoadOperandToMat(args[0], a_trans, "lhs_mat", prologue, span);
      auto rhs_mat = LoadOperandToMat(args[1], b_trans, "rhs_mat", prologue, span);
      auto matmul_call = OpRegistry::GetInstance().Create("tile.matmul", {lhs_mat, rhs_mat}, span);
      return ConversionResult{std::move(prologue), matmul_call};
    });

After (proposed):

// Framework handles loads via input_reqs — converter is just the compute op
RegisterSimple("tensor.matmul", "tile.matmul",
    {{0, {MemorySpace::Mat, "a_trans"}},
     {1, {MemorySpace::Mat, "b_trans"}}});

Similarly, tensor.scatter_update can drop its imperative load-index/load-src code and declare {{1, Vec}, {2, Vec}}.

Affected Files

  • include/pypto/ir/transforms/op_conversion_registry.hModify — Add InputSpaceReq, ConversionEntry, update registry API
  • src/ir/transforms/op_conversion_registry.cppModify — Update registrations with input_reqs, remove LoadOperandToMat, simplify matmul/scatter_update converters
  • src/ir/transforms/convert_tensor_to_tile_ops_pass.cppModify — Replace MatmulSlicePatternCollector with ConsumerSpaceCollector, add auto-bridging in TensorToTileMutator, simplify TensorArgsInConvertedOpsCollector

Testing Plan

  1. Existing tests must pass unchanged — the refactor should produce identical IR output for all existing test cases in tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
  2. Specific tests to verify:
    • test_matmul_conversion — matmul inputs auto-loaded to Mat
    • test_matmul_b_trans_conversion — transpose propagated through input_reqs
    • test_matmul_acc_conversion — acc passed through, lhs/rhs loaded to Mat
    • TestSliceMatmulConversion — slice produces Mat-space load (via ConsumerSpaceCollector), no extra move
    • test_scatter_update_* — index/src auto-loaded to Vec
  3. New tests to add:
    • Test that a tile in wrong space triggers tile.move (e.g., construct IR where a Vec tile feeds an op requiring Mat)
    • Test that a tile in the correct space passes through (no extra load/move emitted)

Alternatives Considered

Option A: No Look-Ahead (Pure Local Decisions)

Each converter only declares input requirements. The framework inserts tile.move when spaces mismatch. No pre-scan needed.

  • Pro: Simplest implementation, purely local decisions
  • Con: Produces load(Vec) + move(Mat) instead of load(Mat) for slice→matmul — extra data movement on hardware

Rejected because the extra move has real hardware cost, and the general look-ahead (ConsumerSpaceCollector) is straightforward.

Option B: Keep MatmulSlicePatternCollector, Only Add input_reqs

Add input_reqs to the registry for auto-bridging, but keep the existing MatmulSlicePatternCollector for the look-ahead optimization.

  • Pro: Smaller change, lower risk
  • Con: Doesn't fully realize the generalization — still has a special-case collector alongside the general system

Rejected because ConsumerSpaceCollector is a strict generalization with comparable complexity.

Open Questions

  1. Multiple consumers with different space requirements: When a variable is consumed by both a Vec op and a Mat op, ConsumerSpaceCollector currently takes the first one encountered. Should we prioritize "higher" spaces (Mat > Vec) to minimize moves? Or is this rare enough not to matter?
  2. Default input_reqs for RegisterSimple: Should simple 1:1 conversions default to {all inputs: Vec} or require explicit declaration? Defaulting to Vec would eliminate boilerplate for elementwise ops but might be surprising.
  3. Should TensorArgsInConvertedOpsCollector be fully replaced? With input_reqs metadata, the entry-load logic could also be driven by the registry (ops with no input_reqs don't need entry loads). But this is a bigger change — it might be better as a follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    rfcDesign proposal / request for comments

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions