Cross-component optimization passes for meld-fused modules

## Context

meld fuses multiple WebAssembly components into a single core module with multi-memory isolation. This gives loom whole-program visibility across what were previously component boundaries — the same opportunity as LLVM full LTO, but at the Wasm level where semantics are cleaner (no UB, structured control flow, typed).

Related: [BA RFC #46](https://github.com/bytecodealliance/rfcs/pull/46) is converging on a multi-module output for `lower-component`. The single-module approach (meld + loom) can achieve zero-cost fusion that multi-module cannot, because loom can optimize across the former component boundaries.

## Optimization opportunities

### Tier 1: Adapter elimination

**1.1 Adapter inlining (scalar)**
Adapter functions copy data between memories, call the target, copy results back. For scalar types (i32, i64, f32, f64), the memory copies can be eliminated entirely — values pass on the stack:

```
Before: load mem[2] → store mem[1] → call → load mem[1] → store mem[2]
After:  call (args/results on stack, zero memory copies)
```

**1.2 Adapter inlining (compound)**
For strings/lists, the adapter allocates in callee memory, copies, calls, copies back. After inlining, if the callee only reads the data (doesn't store a pointer), the copy can be replaced with cross-memory loads.

**1.3 Callback adapter folding**
P3 async adapters drive a callback loop (`[async-lift]` → poll → `[callback]` → repeat until EXIT). If the callee always completes synchronously (returns EXIT immediately), loom can fold the loop away to a direct call.

### Tier 2: Whole-program analysis

**2.1 Cross-component dead code elimination**
Functions only used internally by one component (not exported or called cross-component) can be eliminated if unused after fusion. This extends to entire call trees.

**2.2 Function deduplication**
Components sharing the same library (wasi-libc, wit-bindgen runtime) produce duplicate functions. Loom can merge identical function bodies, potentially significant savings.

**2.3 Interprocedural constant propagation**
If a cross-component call always passes the same constant (e.g., string encoding = UTF-8), propagate through the adapter into the callee. Can cascade into further simplifications.

**2.4 Whole-program type narrowing**
If `result<T, E>` across a component boundary always succeeds, eliminate the error path entirely. Same for `option<T>` that's always `some`.

### Tier 3: Memory optimization

**3.1 Memory-to-register promotion**
If a value is written to memory by one component and immediately read by another (via adapter), promote to register — eliminate the store+load pair entirely.

**3.2 Multi-memory access optimization**
Strength-reduce small fixed-size cross-memory copies to direct loads/stores or register passing.

**3.3 Unused memory elimination**
After dead code elimination, if a memory is never accessed, remove it. Reduces virtual address space usage.

**3.4 Memory merging**
If two components' memories are never accessed simultaneously (one only used during init), merge into one memory with disjoint address ranges.

### Tier 4: Advanced LTO

**4.1 Escape analysis for resources**
If a resource handle never leaves its creating component after fusion, eliminate the handle table overhead (ht_new/ht_rep/ht_drop → direct pointer passing).

**4.2 Speculative devirtualization**
Indirect calls through tables (common in component model dispatch) that always target the same function after fusion → convert to direct call.

**4.3 Cross-component loop fusion**
If component A produces values in a loop and component B consumes them (via adapter), fuse the producer/consumer loops to eliminate intermediate buffering.

**4.4 Partial evaluation**
If one component's behavior is fully determined by initialization parameters (constants from the composition), evaluate at compile time and replace with the result.

**4.5 Global scheduling**
Reorder instructions across the former component boundary for better CPU pipeline utilization — only possible with single-module whole-program view.

## Why single-module is unique

None of these optimizations are possible with the multi-module approach (RFC #46's direction for wasmtime). Multi-module preserves module boundaries that prevent cross-module analysis. The meld + loom pipeline is:

```
Components → meld fuse → single module → loom optimize → optimized module
```

This achieves what RFC #46 cannot: **component boundaries erased at the code level while preserved at the semantic level** (multi-memory isolation). The optimizer sees one program, not N isolated modules.

## Input format

meld's fused output is a standard core wasm module with:
- Multiple memories (one per original component)
- Adapter functions connecting cross-component calls
- Handle table functions (ht_new/ht_rep/ht_drop) for resource re-exporters
- P3 async callback-driving adapters importing `waitable-set-poll` from host

## Benchmark plan

Compare fused module size and execution speed:
- Before loom optimization
- After loom optimization
- Against unfused multi-module (baseline)
- Across runtimes: wasmtime, V8 (d8), kiln interpreter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-component optimization passes for meld-fused modules #68

Context

Optimization opportunities

Tier 1: Adapter elimination

Tier 2: Whole-program analysis

Tier 3: Memory optimization

Tier 4: Advanced LTO

Why single-module is unique

Input format

Benchmark plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cross-component optimization passes for meld-fused modules #68

Description

Context

Optimization opportunities

Tier 1: Adapter elimination

Tier 2: Whole-program analysis

Tier 3: Memory optimization

Tier 4: Advanced LTO

Why single-module is unique

Input format

Benchmark plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions