[RFC] Unify memory space requirements into OpConversionRegistry

## Summary

Add per-input memory space declarations to `OpConversionRegistry` so that the framework automatically bridges memory space mismatches (via `tile.load` or `tile.move`), eliminating special-case pattern matchers like `MatmulSlicePatternCollector` and the hardcoded `kSelfLoadingOps` exclusion list.

## Motivation

Today, memory space handling in `convert_tensor_to_tile_ops` is spread across **three independent mechanisms** that don't talk to each other:

| Mechanism | What it does | Location |
|-----------|-------------|----------|
| `TensorArgsInConvertedOpsCollector` | Inserts entry `tile.load(Vec)` for tensor params used by converted ops; hardcodes a `kSelfLoadingOps` exclusion list | `convert_tensor_to_tile_ops_pass.cpp:80` |
| `MatmulSlicePatternCollector` | Pre-scans for `tensor.slice → tensor.matmul` and overrides slice to `tile.load(Mat)` | `convert_tensor_to_tile_ops_pass.cpp:229` |
| `LoadOperandToMat()` inside matmul/matmul_acc converters | If input is TensorType, loads to Mat; if TileType, passes through (ignoring whether it's in the right space) | `op_conversion_registry.cpp:84` |

Problems with the current approach:

1. **Doesn't scale**: Every new op that needs a non-Vec input space requires a new special-case collector or converter hack.
2. **Fragile exclusion list**: `kSelfLoadingOps` must be manually kept in sync with which ops manage their own loads.
3. **Incomplete space checking**: `LoadOperandToMat()` passes through any TileType without checking its actual memory space — a tile in Vec space feeding matmul silently produces wrong code.
4. **Cross-op patterns are hardcoded**: Only `slice→matmul` is handled. `slice→future_mat_op` would need yet another collector.

The root insight: **each tensor op has memory space requirements for its inputs.** Once the input's memory space matches the requirement, no load/move is needed; otherwise, the framework should bridge the gap. This belongs in the converter registration, not in ad-hoc pattern matchers.

## Design

### New Types in `op_conversion_registry.h`

```cpp
/// Per-input memory space requirement for a converter.
struct InputSpaceReq {
  MemorySpace space;                       ///< Required memory space
  std::optional<std::string> trans_kwarg;  ///< Read transpose flag from this kwarg (if any)
};

/// Full conversion entry: converter function + input requirements.
struct ConversionEntry {
  ConversionFunc func;
  /// Per-input space requirements (key = arg index).
  /// Inputs not listed here have no space requirement (framework won't auto-load them).
  std::unordered_map<size_t, InputSpaceReq> input_reqs;
};
```

### Updated `OpConversionRegistry`

```cpp
class OpConversionRegistry {
 public:
  // --- New registration methods ---

  /// Register simple 1:1 conversion with per-input space requirements.
  void RegisterSimple(const std::string& from_op, const std::string& to_op,
                      std::unordered_map<size_t, InputSpaceReq> input_reqs = {});

  /// Register custom conversion with per-input space requirements.
  void RegisterCustom(const std::string& from_op, ConversionFunc func,
                      std::unordered_map<size_t, InputSpaceReq> input_reqs = {});

  /// Look up a conversion entry (function + requirements).
  [[nodiscard]] const ConversionEntry* Lookup(const std::string& op_name) const;

  /// Query just input requirements for an op (used by ConsumerSpaceCollector).
  [[nodiscard]] const std::unordered_map<size_t, InputSpaceReq>*
  GetInputReqs(const std::string& op_name) const;

 private:
  std::unordered_map<std::string, ConversionEntry> conversions_;
};
```

### Registration Examples

```cpp
// Elementwise: inputs need Vec (default for most ops)
RegisterSimple("tensor.add", "tile.add",
    {{0, {MemorySpace::Vec}}, {1, {MemorySpace::Vec}}});

// Matmul: inputs need Mat, transpose read from kwargs
RegisterCustom("tensor.matmul", matmul_converter,
    {{0, {MemorySpace::Mat, "a_trans"}},
     {1, {MemorySpace::Mat, "b_trans"}}});

RegisterCustom("tensor.matmul_acc", matmul_acc_converter,
    {{1, {MemorySpace::Mat, "a_trans"}},   // lhs at index 1
     {2, {MemorySpace::Mat, "b_trans"}}});  // rhs at index 2
    // acc at index 0: no space req (passed through from IterArg)

// Slice: no input space requirements (it IS the load mechanism)
RegisterCustom("tensor.slice", slice_converter);  // No input_reqs

// Scatter_update: Vec for index and src (when target is local tile)
RegisterCustom("tensor.scatter_update", scatter_converter,
    {{1, {MemorySpace::Vec}},   // index
     {2, {MemorySpace::Vec}}});  // src
```

### Framework: Automatic Space Bridging in `TensorToTileMutator`

Before calling any converter, the mutator checks each input against declared requirements:

```cpp
StmtPtr TensorToTileMutator::VisitStmt_(const AssignStmtPtr& op) {
  // ... existing expr visitation ...
  auto* entry = conv_registry_.Lookup(call->op_->name_);
  if (!entry) return HandlePassThroughAssign(op, new_value);

  // --- NEW: Auto-bridge memory space mismatches ---
  std::vector<StmtPtr> bridge_prologue;
  auto bridged_args = call->args_;  // copy

  for (const auto& [idx, req] : entry->input_reqs) {
    if (idx >= bridged_args.size()) continue;
    auto& arg = bridged_args[idx];
    bool transpose = req.trans_kwarg
        ? GetKwargOr<bool>(call->kwargs_, *req.trans_kwarg, false)
        : false;

    if (auto tensor_type = As<TensorType>(arg->GetType())) {
      // TensorType → tile.load(required space)
      arg = EmitLoad(arg, tensor_type, req.space, transpose,
                     bridge_prologue, call->span_);
    } else if (auto tile_type = As<TileType>(arg->GetType())) {
      if (tile_type->memory_space_ != req.space) {
        // TileType in wrong space → tile.move(required space)
        arg = EmitMove(arg, req.space, bridge_prologue, call->span_);
      }
      // else: already in the right space → pass through
    }
  }

  // Call converter with properly-spaced args
  auto conv_result = entry->func(bridged_args, call->kwargs_, call->span_);
  // ... existing prologue/result handling ...
}
```

### Consumer-Driven Look-Ahead: `ConsumerSpaceCollector`

To avoid redundant `load(Vec) + move(Mat)` sequences (e.g., `tensor.slice → tensor.matmul`), a **general** pre-scan collects consumer space requirements, driven entirely by registered metadata:

```cpp
/// General pre-scan: for each variable, collect the memory space needed by its consumers.
/// Replaces the special-purpose MatmulSlicePatternCollector.
class ConsumerSpaceCollector : public IRVisitor {
 public:
  explicit ConsumerSpaceCollector(const OpConversionRegistry& registry)
      : registry_(registry) {}

  /// Get consumer space requirement for a variable (if any consumer has one).
  [[nodiscard]] std::optional<InputSpaceReq> GetConsumerReq(const Var* var) const {
    auto it = consumer_reqs_.find(var);
    return it != consumer_reqs_.end() ? std::optional{it->second} : std::nullopt;
  }

 protected:
  void VisitStmt_(const AssignStmtPtr& op) override {
    if (!op) return;
    auto call = As<Call>(op->value_);
    if (!call) { IRVisitor::VisitStmt_(op); return; }

    auto* entry = registry_.Lookup(call->op_->name_);
    if (!entry) { IRVisitor::VisitStmt_(op); return; }

    // For each input with a space requirement, record against the source variable
    for (const auto& [idx, req] : entry->input_reqs) {
      if (idx >= call->args_.size()) continue;
      if (auto var = As<Var>(call->args_[idx])) {
        // If multiple consumers want different spaces, the first one wins
        // (the framework will move for subsequent consumers)
        consumer_reqs_.emplace(var.get(), req);
      }
    }
    IRVisitor::VisitStmt_(op);
  }

 private:
  const OpConversionRegistry& registry_;
  std::unordered_map<const Var*, InputSpaceReq> consumer_reqs_;
};
```

Load-like ops (tensor.slice on TensorType, entry parameter loads) check this map to produce the right space directly:

```cpp
// In tensor.slice converter (or the framework's entry-load logic):
if (auto consumer_req = consumer_space_collector.GetConsumerReq(result_var)) {
  // Consumer wants Mat → load directly to Mat
  target_memory = consumer_req->space;
  transpose = consumer_req->transpose;
} else {
  // Default: load to Vec
  target_memory = MemorySpace::Vec;
  transpose = false;
}
```

### What Gets Removed

| Current Code | Status |
|-------------|--------|
| `MatmulSlicePatternCollector` (70 lines) | **Removed** — replaced by general `ConsumerSpaceCollector` |
| `MatmulSliceInfo` struct | **Removed** |
| `HandleMatmulSlice()` in `TensorToTileMutator` | **Removed** |
| `LoadOperandToMat()` helper | **Removed** — framework auto-bridges |
| `kSelfLoadingOps` hardcoded set in `TensorArgsInConvertedOpsCollector` | **Removed** — ops that manage their own loads simply declare no `input_reqs` |
| Imperative `LoadOperandToMat` calls inside matmul/matmul_acc converters | **Simplified** — converters only emit the compute op |

### Simplified Matmul Converter (After)

Before (current):
```cpp
RegisterCustom("tensor.matmul",
    [](const std::vector<ExprPtr>& args, ...) -> ConversionResult {
      bool a_trans = GetKwargOr<bool>(kwargs, "a_trans", false);
      bool b_trans = GetKwargOr<bool>(kwargs, "b_trans", false);
      std::vector<StmtPtr> prologue;
      auto lhs_mat = LoadOperandToMat(args[0], a_trans, "lhs_mat", prologue, span);
      auto rhs_mat = LoadOperandToMat(args[1], b_trans, "rhs_mat", prologue, span);
      auto matmul_call = OpRegistry::GetInstance().Create("tile.matmul", {lhs_mat, rhs_mat}, span);
      return ConversionResult{std::move(prologue), matmul_call};
    });
```

After (proposed):
```cpp
// Framework handles loads via input_reqs — converter is just the compute op
RegisterSimple("tensor.matmul", "tile.matmul",
    {{0, {MemorySpace::Mat, "a_trans"}},
     {1, {MemorySpace::Mat, "b_trans"}}});
```

Similarly, `tensor.scatter_update` can drop its imperative load-index/load-src code and declare `{{1, Vec}, {2, Vec}}`.

## Affected Files

- `include/pypto/ir/transforms/op_conversion_registry.h` — **Modify** — Add `InputSpaceReq`, `ConversionEntry`, update registry API
- `src/ir/transforms/op_conversion_registry.cpp` — **Modify** — Update registrations with `input_reqs`, remove `LoadOperandToMat`, simplify matmul/scatter_update converters
- `src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp` — **Modify** — Replace `MatmulSlicePatternCollector` with `ConsumerSpaceCollector`, add auto-bridging in `TensorToTileMutator`, simplify `TensorArgsInConvertedOpsCollector`

## Testing Plan

1. **Existing tests must pass unchanged** — the refactor should produce identical IR output for all existing test cases in `tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py`
2. **Specific tests to verify:**
   - `test_matmul_conversion` — matmul inputs auto-loaded to Mat
   - `test_matmul_b_trans_conversion` — transpose propagated through `input_reqs`
   - `test_matmul_acc_conversion` — acc passed through, lhs/rhs loaded to Mat
   - `TestSliceMatmulConversion` — slice produces Mat-space load (via `ConsumerSpaceCollector`), no extra move
   - `test_scatter_update_*` — index/src auto-loaded to Vec
3. **New tests to add:**
   - Test that a tile in wrong space triggers `tile.move` (e.g., construct IR where a Vec tile feeds an op requiring Mat)
   - Test that a tile in the correct space passes through (no extra load/move emitted)

## Alternatives Considered

### Option A: No Look-Ahead (Pure Local Decisions)

Each converter only declares input requirements. The framework inserts `tile.move` when spaces mismatch. No pre-scan needed.

- **Pro**: Simplest implementation, purely local decisions
- **Con**: Produces `load(Vec) + move(Mat)` instead of `load(Mat)` for slice→matmul — extra data movement on hardware

**Rejected** because the extra move has real hardware cost, and the general look-ahead (`ConsumerSpaceCollector`) is straightforward.

### Option B: Keep MatmulSlicePatternCollector, Only Add input_reqs

Add `input_reqs` to the registry for auto-bridging, but keep the existing `MatmulSlicePatternCollector` for the look-ahead optimization.

- **Pro**: Smaller change, lower risk
- **Con**: Doesn't fully realize the generalization — still has a special-case collector alongside the general system

**Rejected** because `ConsumerSpaceCollector` is a strict generalization with comparable complexity.

## Open Questions

1. **Multiple consumers with different space requirements**: When a variable is consumed by both a Vec op and a Mat op, `ConsumerSpaceCollector` currently takes the first one encountered. Should we prioritize "higher" spaces (Mat > Vec) to minimize moves? Or is this rare enough not to matter?
2. **Default `input_reqs` for `RegisterSimple`**: Should simple 1:1 conversions default to `{all inputs: Vec}` or require explicit declaration? Defaulting to Vec would eliminate boilerplate for elementwise ops but might be surprising.
3. **Should `TensorArgsInConvertedOpsCollector` be fully replaced?** With `input_reqs` metadata, the entry-load logic could also be driven by the registry (ops with no `input_reqs` don't need entry loads). But this is a bigger change — it might be better as a follow-up.

Mechanism	What it does	Location
`TensorArgsInConvertedOpsCollector`	Inserts entry `tile.load(Vec)` for tensor params used by converted ops; hardcodes a `kSelfLoadingOps` exclusion list	`convert_tensor_to_tile_ops_pass.cpp:80`
`MatmulSlicePatternCollector`	Pre-scans for `tensor.slice → tensor.matmul` and overrides slice to `tile.load(Mat)`	`convert_tensor_to_tile_ops_pass.cpp:229`
`LoadOperandToMat()` inside matmul/matmul_acc converters	If input is TensorType, loads to Mat; if TileType, passes through (ignoring whether it's in the right space)	`op_conversion_registry.cpp:84`

Current Code	Status
`MatmulSlicePatternCollector` (70 lines)	Removed — replaced by general `ConsumerSpaceCollector`
`MatmulSliceInfo` struct	Removed
`HandleMatmulSlice()` in `TensorToTileMutator`	Removed
`LoadOperandToMat()` helper	Removed — framework auto-bridges
`kSelfLoadingOps` hardcoded set in `TensorArgsInConvertedOpsCollector`	Removed — ops that manage their own loads simply declare no `input_reqs`
Imperative `LoadOperandToMat` calls inside matmul/matmul_acc converters	Simplified — converters only emit the compute op

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Unify memory space requirements into OpConversionRegistry #972

Summary

Motivation

Design

New Types in `op_conversion_registry.h`

Updated `OpConversionRegistry`

Registration Examples

Framework: Automatic Space Bridging in `TensorToTileMutator`

Consumer-Driven Look-Ahead: `ConsumerSpaceCollector`

What Gets Removed

Simplified Matmul Converter (After)

Affected Files

Testing Plan

Alternatives Considered

Option A: No Look-Ahead (Pure Local Decisions)

Option B: Keep MatmulSlicePatternCollector, Only Add input_reqs

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Unify memory space requirements into OpConversionRegistry #972

Description

Summary

Motivation

Design

New Types in op_conversion_registry.h

Updated OpConversionRegistry

Registration Examples

Framework: Automatic Space Bridging in TensorToTileMutator

Consumer-Driven Look-Ahead: ConsumerSpaceCollector

What Gets Removed

Simplified Matmul Converter (After)

Affected Files

Testing Plan

Alternatives Considered

Option A: No Look-Ahead (Pure Local Decisions)

Option B: Keep MatmulSlicePatternCollector, Only Add input_reqs

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New Types in `op_conversion_registry.h`

Updated `OpConversionRegistry`

Framework: Automatic Space Bridging in `TensorToTileMutator`

Consumer-Driven Look-Ahead: `ConsumerSpaceCollector`