Context
meld fuses multiple WebAssembly components into a single core module with multi-memory isolation. This gives loom whole-program visibility across what were previously component boundaries — the same opportunity as LLVM full LTO, but at the Wasm level where semantics are cleaner (no UB, structured control flow, typed).
Related: BA RFC #46 is converging on a multi-module output for lower-component. The single-module approach (meld + loom) can achieve zero-cost fusion that multi-module cannot, because loom can optimize across the former component boundaries.
Optimization opportunities
Tier 1: Adapter elimination
1.1 Adapter inlining (scalar)
Adapter functions copy data between memories, call the target, copy results back. For scalar types (i32, i64, f32, f64), the memory copies can be eliminated entirely — values pass on the stack:
Before: load mem[2] → store mem[1] → call → load mem[1] → store mem[2]
After: call (args/results on stack, zero memory copies)
1.2 Adapter inlining (compound)
For strings/lists, the adapter allocates in callee memory, copies, calls, copies back. After inlining, if the callee only reads the data (doesn't store a pointer), the copy can be replaced with cross-memory loads.
1.3 Callback adapter folding
P3 async adapters drive a callback loop ([async-lift] → poll → [callback] → repeat until EXIT). If the callee always completes synchronously (returns EXIT immediately), loom can fold the loop away to a direct call.
Tier 2: Whole-program analysis
2.1 Cross-component dead code elimination
Functions only used internally by one component (not exported or called cross-component) can be eliminated if unused after fusion. This extends to entire call trees.
2.2 Function deduplication
Components sharing the same library (wasi-libc, wit-bindgen runtime) produce duplicate functions. Loom can merge identical function bodies, potentially significant savings.
2.3 Interprocedural constant propagation
If a cross-component call always passes the same constant (e.g., string encoding = UTF-8), propagate through the adapter into the callee. Can cascade into further simplifications.
2.4 Whole-program type narrowing
If result<T, E> across a component boundary always succeeds, eliminate the error path entirely. Same for option<T> that's always some.
Tier 3: Memory optimization
3.1 Memory-to-register promotion
If a value is written to memory by one component and immediately read by another (via adapter), promote to register — eliminate the store+load pair entirely.
3.2 Multi-memory access optimization
Strength-reduce small fixed-size cross-memory copies to direct loads/stores or register passing.
3.3 Unused memory elimination
After dead code elimination, if a memory is never accessed, remove it. Reduces virtual address space usage.
3.4 Memory merging
If two components' memories are never accessed simultaneously (one only used during init), merge into one memory with disjoint address ranges.
Tier 4: Advanced LTO
4.1 Escape analysis for resources
If a resource handle never leaves its creating component after fusion, eliminate the handle table overhead (ht_new/ht_rep/ht_drop → direct pointer passing).
4.2 Speculative devirtualization
Indirect calls through tables (common in component model dispatch) that always target the same function after fusion → convert to direct call.
4.3 Cross-component loop fusion
If component A produces values in a loop and component B consumes them (via adapter), fuse the producer/consumer loops to eliminate intermediate buffering.
4.4 Partial evaluation
If one component's behavior is fully determined by initialization parameters (constants from the composition), evaluate at compile time and replace with the result.
4.5 Global scheduling
Reorder instructions across the former component boundary for better CPU pipeline utilization — only possible with single-module whole-program view.
Why single-module is unique
None of these optimizations are possible with the multi-module approach (RFC #46's direction for wasmtime). Multi-module preserves module boundaries that prevent cross-module analysis. The meld + loom pipeline is:
Components → meld fuse → single module → loom optimize → optimized module
This achieves what RFC #46 cannot: component boundaries erased at the code level while preserved at the semantic level (multi-memory isolation). The optimizer sees one program, not N isolated modules.
Input format
meld's fused output is a standard core wasm module with:
- Multiple memories (one per original component)
- Adapter functions connecting cross-component calls
- Handle table functions (ht_new/ht_rep/ht_drop) for resource re-exporters
- P3 async callback-driving adapters importing
waitable-set-poll from host
Benchmark plan
Compare fused module size and execution speed:
- Before loom optimization
- After loom optimization
- Against unfused multi-module (baseline)
- Across runtimes: wasmtime, V8 (d8), kiln interpreter
Context
meld fuses multiple WebAssembly components into a single core module with multi-memory isolation. This gives loom whole-program visibility across what were previously component boundaries — the same opportunity as LLVM full LTO, but at the Wasm level where semantics are cleaner (no UB, structured control flow, typed).
Related: BA RFC #46 is converging on a multi-module output for
lower-component. The single-module approach (meld + loom) can achieve zero-cost fusion that multi-module cannot, because loom can optimize across the former component boundaries.Optimization opportunities
Tier 1: Adapter elimination
1.1 Adapter inlining (scalar)
Adapter functions copy data between memories, call the target, copy results back. For scalar types (i32, i64, f32, f64), the memory copies can be eliminated entirely — values pass on the stack:
1.2 Adapter inlining (compound)
For strings/lists, the adapter allocates in callee memory, copies, calls, copies back. After inlining, if the callee only reads the data (doesn't store a pointer), the copy can be replaced with cross-memory loads.
1.3 Callback adapter folding
P3 async adapters drive a callback loop (
[async-lift]→ poll →[callback]→ repeat until EXIT). If the callee always completes synchronously (returns EXIT immediately), loom can fold the loop away to a direct call.Tier 2: Whole-program analysis
2.1 Cross-component dead code elimination
Functions only used internally by one component (not exported or called cross-component) can be eliminated if unused after fusion. This extends to entire call trees.
2.2 Function deduplication
Components sharing the same library (wasi-libc, wit-bindgen runtime) produce duplicate functions. Loom can merge identical function bodies, potentially significant savings.
2.3 Interprocedural constant propagation
If a cross-component call always passes the same constant (e.g., string encoding = UTF-8), propagate through the adapter into the callee. Can cascade into further simplifications.
2.4 Whole-program type narrowing
If
result<T, E>across a component boundary always succeeds, eliminate the error path entirely. Same foroption<T>that's alwayssome.Tier 3: Memory optimization
3.1 Memory-to-register promotion
If a value is written to memory by one component and immediately read by another (via adapter), promote to register — eliminate the store+load pair entirely.
3.2 Multi-memory access optimization
Strength-reduce small fixed-size cross-memory copies to direct loads/stores or register passing.
3.3 Unused memory elimination
After dead code elimination, if a memory is never accessed, remove it. Reduces virtual address space usage.
3.4 Memory merging
If two components' memories are never accessed simultaneously (one only used during init), merge into one memory with disjoint address ranges.
Tier 4: Advanced LTO
4.1 Escape analysis for resources
If a resource handle never leaves its creating component after fusion, eliminate the handle table overhead (ht_new/ht_rep/ht_drop → direct pointer passing).
4.2 Speculative devirtualization
Indirect calls through tables (common in component model dispatch) that always target the same function after fusion → convert to direct call.
4.3 Cross-component loop fusion
If component A produces values in a loop and component B consumes them (via adapter), fuse the producer/consumer loops to eliminate intermediate buffering.
4.4 Partial evaluation
If one component's behavior is fully determined by initialization parameters (constants from the composition), evaluate at compile time and replace with the result.
4.5 Global scheduling
Reorder instructions across the former component boundary for better CPU pipeline utilization — only possible with single-module whole-program view.
Why single-module is unique
None of these optimizations are possible with the multi-module approach (RFC #46's direction for wasmtime). Multi-module preserves module boundaries that prevent cross-module analysis. The meld + loom pipeline is:
This achieves what RFC #46 cannot: component boundaries erased at the code level while preserved at the semantic level (multi-memory isolation). The optimizer sees one program, not N isolated modules.
Input format
meld's fused output is a standard core wasm module with:
waitable-set-pollfrom hostBenchmark plan
Compare fused module size and execution speed: