-
Notifications
You must be signed in to change notification settings - Fork 13
fix(cuda): vpmm v4 - reusing VA & synced defrag #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 5. **Cross-stream synchronization**: Use CUDA events to track when freed memory becomes safe to reuse across streams. | ||
| 6. **Grow**: if still insufficient, **allocate more pages** and map them at the end. | ||
| 1. **Reserve** a large VA chunk once (size configurable via `VPMM_VA_SIZE`). Additional chunks are reserved on demand. | ||
| 2. **Track** three disjoint maps: `malloc_regions`, `free_regions` (with CUDA events/stream ids), and `unmapped_regions` (holes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the term "hole" confusing: what is more important is the distinction between "host-side freed but still VA-reserved space" versus totally VA-unmapped. I think free_regions vs unmapped_regions naming is good to clarify that. "Hole" makes it hard to tell if you're talking about free_regions or unmapped_regions
| let original_addr = *addr_holder2.lock().unwrap(); | ||
| let new_addr = buf.as_raw_ptr() as usize; | ||
| if new_addr == original_addr { | ||
| println!("Cross-thread: reused same address (event synced)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you only print, this test will not automatically catch anything: can you force exactly one scenario to occur and assert! it?
| let mut handles = Vec::new(); | ||
|
|
||
| for thread_idx in 0..8 { | ||
| let handle = tokio::task::spawn_blocking(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this testing? if it's not close to any hard memory limits, the behavior could be incorrect but the allocations still work. I expect more asserts of exact behavior? Right now even if something didn't go correctly, since there's no kernels accessing the memory, I doubt you'd catch much with this test
jonathanpwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there was an edge case which is fixed in c299a49: the to_defrag list doesn't consider the free region merged with dst, but remaining size didn't account for this.
e38add1 to
0df93e0
Compare
jonathanpwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to merge because I reviewed the code and documentation and it looks correct to me, but I made a follow-up ticket to harden the stress testing.
Updating to use openvm-org/stark-backend#179 Comparison bench: https://github.com/axiom-crypto/openvm-reth-benchmark/actions/runs/19803890244 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Current hypothesis: we're out of available virtual address space. Solution: reuse regions after defragmentation.
unmapped_regions- it's a contiguous VA space that is already reserved, but not used as freed or malloced region. So, we can use it for defragmentationCurrent solution is focused on simplicity and guaranteed behavior by the price of synced defragmentation.
Optional: how to make it async again:
cudaLaunchHostthat will unmap region and add to theunmapped_regionsdefault_stream_wait_eventand usecudaLaunchHostthat will drop event and unmap regionRelated to INT-5557
Closes INT-5592
Closes INT-5531