Update load test to support concurrency #1944

dstaay-fb · 2025-11-19T21:23:44Z

Summary:
Update script to support concurrency, with relevant benchmarks:

buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4

sample output

============================================================
RDMA WRITE LOAD TEST RESULTS (CUDA:0)
============================================================
Total iterations completed: 20
Average data per operation: 587.5 MB
Total data transferred: 11750.0 MB

... 

AGGREGATE BANDWIDTH (concurrency=4):
  Average aggregate bandwidth: 356.10 Gbps
  Maximum aggregate bandwidth: 459.63 Gbps
  Minimum aggregate bandwidth: 308.98 Gbps

TOTAL SUSTAINED THROUGHPUT:
  Total wall-clock time: 0.289 s
  Total data transferred: 11750.0 MB
  Sustained throughput: 341.34 Gbps
  (Accounts for 4x concurrent overlapping operations)
============================================================

Differential Revision: D87475053

meta-codesync · 2025-11-19T21:23:51Z

@dstaay-fb has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87475053.

Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:0) ============================================================ Total iterations completed: 20 Average data per operation: 587.5 MB Total data transferred: 11750.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 19339.31 Gbps Maximum bandwidth: 28238.82 Gbps Minimum bandwidth: 12535.37 Gbps AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 356.10 Gbps Maximum aggregate bandwidth: 459.63 Gbps Minimum aggregate bandwidth: 308.98 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 0.289 s Total data transferred: 11750.0 MB Sustained throughput: 341.34 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ ``` Differential Revision: D87475053

casteryh

Review automatically exported from Phabricator review in Meta.

Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ================================================================== CONCURRENT BATCH TIMING (wall-clock for all concurrent ops): Average batch time: 48.681 ms Minimum batch time: 25.463 ms Maximum batch time: 230.379 ms Standard deviation: 20.382 ms Average data per batch: 1982.5 MB AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 341.62 Gbps Maximum aggregate bandwidth: 653.13 Gbps Minimum aggregate bandwidth: 72.19 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 5.094 s Total data transferred: 198250.0 MB Sustained throughput: 326.47 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:1) ============================================================ INDIVIDUAL OPERATION TIMING: Average time per operation: 29.031 ms Minimum time per operation: 6.103 ms Maximum time per operation: 191.391 ms Standard deviation: 19.436 ms Total iterations completed: 400 Average data per operation: 495.6 MB Total data transferred: 198250.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 143.21 Gbps Maximum bandwidth: 681.26 Gbps Minimum bandwidth: 21.72 Gbps ``` Reviewed By: casteryh Differential Revision: D87475053

Summary: TL;DR: BEFORE: controlled flow by requiring python caller to obtain a QP ownership and hold for duration of call (.read_from/.write_into) AFTER: now we can cheaply clone QPs, and just use atomics to generate wr_id, and rely on ibverbs internal locks (ibv_post_send is thread-safe). Complexity introduced by Work completion events which may be returned out of order and only delivered once, so need to store any WC in seperate cache. ### Atomic Counters in rdmaxcel_qp_t for Lock-Free Operations The rdmaxcel_qp_t wrapper uses atomic counters to enable concurrent, lock-free work request posting: ``` typedef struct rdmaxcel_qp { struct ibv_qp* ibv_qp; struct ibv_cq* send_cq; struct ibv_cq* recv_cq; // Atomic counters for lock-free concurrent access _Atomic uint64_t send_wqe_idx; // Next send WQE slot _Atomic uint64_t send_db_idx; // Last doorbell rung _Atomic uint64_t recv_wqe_idx; // Next recv WQE slot _Atomic uint64_t recv_db_idx; // Last recv doorbell _Atomic uint64_t rts_timestamp; // Ready-to-send timestamp // Completion caches for efficient polling completion_cache_t* send_completion_cache; completion_cache_t* recv_completion_cache; } rdmaxcel_qp_t; ``` Key Benefits: Multiple threads can post work requests concurrently using fetch_add on atomic indices No locks needed for the hot path (posting operations) Each thread gets a unique WQE slot atomically Completion polling uses cached results to avoid redundant CQ polls ### Mutex-Protected Queue Pair Creation While operations are lock-free, QP creation is serialized using Rust Arc<Mutex<HashSet>>: ``` pub struct RdmaManagerActor { // Track QPs currently being created to prevent duplicate creation pending_qp_creation: Arc<Mutex<HashSet<(String, ActorId, String)>>>, // ... } ``` Creation Flow: Thread checks if QP exists (lock-free read from HashMap) If not, acquires mutex and checks pending_qp_creation set If another thread is creating it, waits without holding lock Otherwise, inserts key into set, releases lock, and creates QP After creation, removes key from set This prevents race conditions where multiple threads try to create the same QP simultaneously while keeping the common path (using existing QPs) lock-free. ### Resource Lifecycle Management Simplified cleanup via rdmaxcel_qp_destroy: Previously: Rust manually destroyed ibv_qp and CQs separately (error-prone with concurrent access) Now: Single C function destroys all resources atomically Changed register_segments(pd, rdmaxcel_qp_t*) to work with wrapper instead of raw ibv_qp Reviewed By: casteryh Differential Revision: D87021168

Summary: Update script to support concurrency, with relevant benchmarks: buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4 sample output ``` ================================================================== CONCURRENT BATCH TIMING (wall-clock for all concurrent ops): Average batch time: 48.681 ms Minimum batch time: 25.463 ms Maximum batch time: 230.379 ms Standard deviation: 20.382 ms Average data per batch: 1982.5 MB AGGREGATE BANDWIDTH (concurrency=4): Average aggregate bandwidth: 341.62 Gbps Maximum aggregate bandwidth: 653.13 Gbps Minimum aggregate bandwidth: 72.19 Gbps TOTAL SUSTAINED THROUGHPUT: Total wall-clock time: 5.094 s Total data transferred: 198250.0 MB Sustained throughput: 326.47 Gbps (Accounts for 4x concurrent overlapping operations) ============================================================ RDMA WRITE LOAD TEST RESULTS (CUDA:1) ============================================================ INDIVIDUAL OPERATION TIMING: Average time per operation: 29.031 ms Minimum time per operation: 6.103 ms Maximum time per operation: 191.391 ms Standard deviation: 19.436 ms Total iterations completed: 400 Average data per operation: 495.6 MB Total data transferred: 198250.0 MB INDIVIDUAL OPERATION BANDWIDTH: Average bandwidth: 143.21 Gbps Maximum bandwidth: 681.26 Gbps Minimum bandwidth: 21.72 Gbps ``` Reviewed By: casteryh Differential Revision: D87475053

meta-codesync · 2025-11-21T15:41:31Z

This pull request has been merged in 93b653a.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 19, 2025

dstaay-fb force-pushed the export-D87475053 branch from 525601f to 260cc2e Compare November 20, 2025 19:50

casteryh approved these changes Nov 20, 2025

View reviewed changes

dstaay-fb force-pushed the export-D87475053 branch from 260cc2e to 225cd52 Compare November 21, 2025 00:00

dstaay-fb force-pushed the export-D87475053 branch from 225cd52 to ad4c9b5 Compare November 21, 2025 03:20

dstaay-fb added 2 commits November 20, 2025 19:55

dstaay-fb force-pushed the export-D87475053 branch from ad4c9b5 to 2a1f86e Compare November 21, 2025 03:56

meta-codesync bot closed this in 93b653a Nov 21, 2025

facebook-github-bot added the Merged label Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update load test to support concurrency #1944

Update load test to support concurrency #1944

Uh oh!

dstaay-fb commented Nov 19, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Nov 19, 2025

Uh oh!

casteryh left a comment

Uh oh!

meta-codesync bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update load test to support concurrency #1944

Update load test to support concurrency #1944

Uh oh!

Conversation

dstaay-fb commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Nov 19, 2025

Uh oh!

casteryh left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dstaay-fb commented Nov 19, 2025 •

edited

Loading