fix(cuda): vpmm v4 - reusing VA & synced defrag #179

gaxiom · 2025-11-25T16:55:10Z

Current hypothesis: we're out of available virtual address space. Solution: reuse regions after defragmentation.

Intoducing unmapped_regions - it's a contiguous VA space that is already reserved, but not used as freed or malloced region. So, we can use it for defragmentation
Reserving more VA space on the fly if needed
VA_SIZE is now env var
event.synchronize() for proper unmaping when defragmentation

Current solution is focused on simplicity and guaranteed behavior by the price of synced defragmentation.
Optional: how to make it async again:

on current stream use cudaLaunchHost that will unmap region and add to the unmapped_regions
on other stream call default_stream_wait_event and use cudaLaunchHost that will drop event and unmap region

Related to INT-5557
Closes INT-5592
Closes INT-5531

docs/vpmm_spec.md

crates/cuda-common/src/memory_manager/mod.rs

crates/cuda-common/src/memory_manager/vm_pool.rs

jonathanpwang · 2025-11-30T17:56:26Z

docs/vpmm_spec.md

-5. **Cross-stream synchronization**: Use CUDA events to track when freed memory becomes safe to reuse across streams.
-6. **Grow**: if still insufficient, **allocate more pages** and map them at the end.
+1. **Reserve** a large VA chunk once (size configurable via `VPMM_VA_SIZE`). Additional chunks are reserved on demand.
+2. **Track** three disjoint maps: `malloc_regions`, `free_regions` (with CUDA events/stream ids), and `unmapped_regions` (holes).


I find the term "hole" confusing: what is more important is the distinction between "host-side freed but still VA-reserved space" versus totally VA-unmapped. I think free_regions vs unmapped_regions naming is good to clarify that. "Hole" makes it hard to tell if you're talking about free_regions or unmapped_regions

crates/cuda-common/src/memory_manager/mod.rs

jonathanpwang · 2025-11-30T18:05:04Z

crates/cuda-common/src/memory_manager/tests.rs

+                let original_addr = *addr_holder2.lock().unwrap();
+                let new_addr = buf.as_raw_ptr() as usize;
+                if new_addr == original_addr {
+                    println!("Cross-thread: reused same address (event synced)");


if you only print, this test will not automatically catch anything: can you force exactly one scenario to occur and assert! it?

jonathanpwang · 2025-11-30T18:09:22Z

crates/cuda-common/src/memory_manager/tests.rs

+        let mut handles = Vec::new();
+
+        for thread_idx in 0..8 {
+            let handle = tokio::task::spawn_blocking(move || {


what is this testing? if it's not close to any hard memory limits, the behavior could be incorrect but the allocations still work. I expect more asserts of exact behavior? Right now even if something didn't go correctly, since there's no kernels accessing the memory, I doubt you'd catch much with this test

jonathanpwang

I believe there was an edge case which is fixed in c299a49: the to_defrag list doesn't consider the free region merged with dst, but remaining size didn't account for this.

jonathanpwang

Approving to merge because I reviewed the code and documentation and it looks correct to me, but I made a follow-up ticket to harden the stress testing.

Updating to use openvm-org/stark-backend#179 Comparison bench: https://github.com/axiom-crypto/openvm-reth-benchmark/actions/runs/19803890244 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

jonathanpwang and others added 6 commits November 25, 2025 16:51

Proper erroring

f099bfd

track remaining va

2290451

assert page size

21fd8ce

synced defragmentation

30cb42b

format

f44ad83

debug print

fa7e6cb

gaxiom changed the title ~~chore(cuda): vpmm better log~~ chore(cuda): vpmm v4 - reusing VA & synced defrag Nov 26, 2025

lint fix

79313ec

gaxiom requested a review from jonathanpwang November 26, 2025 14:23

gaxiom marked this pull request as ready for review November 26, 2025 14:23

docs update

129cd1e