-
Notifications
You must be signed in to change notification settings - Fork 108
Update load test to support concurrency #1944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
@dstaay-fb has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87475053. |
525601f to
260cc2e
Compare
dstaay-fb
added a commit
to dstaay-fb/monarch
that referenced
this pull request
Nov 20, 2025
Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:0) ============================================================ Total iterations completed: 20 Average data per operation: 587.5 MB Total data transferred: 11750.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 19339.31 Gbps Maximum bandwidth: 28238.82 Gbps Minimum bandwidth: 12535.37 Gbps AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 356.10 Gbps Maximum aggregate bandwidth: 459.63 Gbps Minimum aggregate bandwidth: 308.98 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 0.289 s Total data transferred: 11750.0 MB Sustained throughput: 341.34 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ ``` Differential Revision: D87475053
casteryh
approved these changes
Nov 20, 2025
Contributor
casteryh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review automatically exported from Phabricator review in Meta.
dstaay-fb
added a commit
to dstaay-fb/monarch
that referenced
this pull request
Nov 21, 2025
Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ================================================================== CONCURRENT BATCH TIMING (wall-clock for all concurrent ops): Average batch time: 48.681 ms Minimum batch time: 25.463 ms Maximum batch time: 230.379 ms Standard deviation: 20.382 ms Average data per batch: 1982.5 MB AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 341.62 Gbps Maximum aggregate bandwidth: 653.13 Gbps Minimum aggregate bandwidth: 72.19 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 5.094 s Total data transferred: 198250.0 MB Sustained throughput: 326.47 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:1) ============================================================ INDIVIDUAL OPERATION TIMING: Average time per operation: 29.031 ms Minimum time per operation: 6.103 ms Maximum time per operation: 191.391 ms Standard deviation: 19.436 ms Total iterations completed: 400 Average data per operation: 495.6 MB Total data transferred: 198250.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 143.21 Gbps Maximum bandwidth: 681.26 Gbps Minimum bandwidth: 21.72 Gbps ``` Reviewed By: casteryh Differential Revision: D87475053
260cc2e to
225cd52
Compare
dstaay-fb
added a commit
to dstaay-fb/monarch
that referenced
this pull request
Nov 21, 2025
Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ================================================================== CONCURRENT BATCH TIMING (wall-clock for all concurrent ops): Average batch time: 48.681 ms Minimum batch time: 25.463 ms Maximum batch time: 230.379 ms Standard deviation: 20.382 ms Average data per batch: 1982.5 MB AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 341.62 Gbps Maximum aggregate bandwidth: 653.13 Gbps Minimum aggregate bandwidth: 72.19 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 5.094 s Total data transferred: 198250.0 MB Sustained throughput: 326.47 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:1) ============================================================ INDIVIDUAL OPERATION TIMING: Average time per operation: 29.031 ms Minimum time per operation: 6.103 ms Maximum time per operation: 191.391 ms Standard deviation: 19.436 ms Total iterations completed: 400 Average data per operation: 495.6 MB Total data transferred: 198250.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 143.21 Gbps Maximum bandwidth: 681.26 Gbps Minimum bandwidth: 21.72 Gbps ``` Reviewed By: casteryh Differential Revision: D87475053
225cd52 to
ad4c9b5
Compare
Summary:
TL;DR:
BEFORE: controlled flow by requiring python caller to obtain a QP ownership and hold for duration of call (.read_from/.write_into)
AFTER: now we can cheaply clone QPs, and just use atomics to generate wr_id, and rely on ibverbs internal locks (ibv_post_send is thread-safe). Complexity introduced by Work completion events which may be returned out of order and only delivered once, so need to store any WC in seperate cache.
### Atomic Counters in rdmaxcel_qp_t for Lock-Free Operations
The rdmaxcel_qp_t wrapper uses atomic counters to enable concurrent, lock-free work request posting:
```
typedef struct rdmaxcel_qp {
struct ibv_qp* ibv_qp;
struct ibv_cq* send_cq;
struct ibv_cq* recv_cq;
// Atomic counters for lock-free concurrent access
_Atomic uint64_t send_wqe_idx; // Next send WQE slot
_Atomic uint64_t send_db_idx; // Last doorbell rung
_Atomic uint64_t recv_wqe_idx; // Next recv WQE slot
_Atomic uint64_t recv_db_idx; // Last recv doorbell
_Atomic uint64_t rts_timestamp; // Ready-to-send timestamp
// Completion caches for efficient polling
completion_cache_t* send_completion_cache;
completion_cache_t* recv_completion_cache;
} rdmaxcel_qp_t;
```
Key Benefits:
Multiple threads can post work requests concurrently using fetch_add on atomic indices
No locks needed for the hot path (posting operations)
Each thread gets a unique WQE slot atomically
Completion polling uses cached results to avoid redundant CQ polls
### Mutex-Protected Queue Pair Creation
While operations are lock-free, QP creation is serialized using Rust Arc<Mutex<HashSet>>:
```
pub struct RdmaManagerActor {
// Track QPs currently being created to prevent duplicate creation
pending_qp_creation: Arc<Mutex<HashSet<(String, ActorId, String)>>>,
// ...
}
```
Creation Flow:
Thread checks if QP exists (lock-free read from HashMap)
If not, acquires mutex and checks pending_qp_creation set
If another thread is creating it, waits without holding lock
Otherwise, inserts key into set, releases lock, and creates QP
After creation, removes key from set
This prevents race conditions where multiple threads try to create the same QP simultaneously while keeping the common path (using existing QPs) lock-free.
### Resource Lifecycle Management
Simplified cleanup via rdmaxcel_qp_destroy:
Previously: Rust manually destroyed ibv_qp and CQs separately (error-prone with concurrent access)
Now: Single C function destroys all resources atomically
Changed register_segments(pd, rdmaxcel_qp_t*) to work with wrapper instead of raw ibv_qp
Reviewed By: casteryh
Differential Revision: D87021168
Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ================================================================== CONCURRENT BATCH TIMING (wall-clock for all concurrent ops): Average batch time: 48.681 ms Minimum batch time: 25.463 ms Maximum batch time: 230.379 ms Standard deviation: 20.382 ms Average data per batch: 1982.5 MB AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 341.62 Gbps Maximum aggregate bandwidth: 653.13 Gbps Minimum aggregate bandwidth: 72.19 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 5.094 s Total data transferred: 198250.0 MB Sustained throughput: 326.47 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:1) ============================================================ INDIVIDUAL OPERATION TIMING: Average time per operation: 29.031 ms Minimum time per operation: 6.103 ms Maximum time per operation: 191.391 ms Standard deviation: 19.436 ms Total iterations completed: 400 Average data per operation: 495.6 MB Total data transferred: 198250.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 143.21 Gbps Maximum bandwidth: 681.26 Gbps Minimum bandwidth: 21.72 Gbps ``` Reviewed By: casteryh Differential Revision: D87475053
ad4c9b5 to
2a1f86e
Compare
|
This pull request has been merged in 93b653a. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Update script to support concurrency, with relevant benchmarks:
buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4
sample output
Differential Revision: D87475053