Skip to content

Conversation

@dstaay-fb
Copy link
Contributor

@dstaay-fb dstaay-fb commented Nov 19, 2025

Summary:
Update script to support concurrency, with relevant benchmarks:

buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write --iterations 5 --size 500 --expandable-segments true --concurrency 4

sample output

============================================================
RDMA WRITE LOAD TEST RESULTS (CUDA:0)
============================================================
Total iterations completed: 20
Average data per operation: 587.5 MB
Total data transferred: 11750.0 MB

... 

AGGREGATE BANDWIDTH (concurrency=4):
  Average aggregate bandwidth: 356.10 Gbps
  Maximum aggregate bandwidth: 459.63 Gbps
  Minimum aggregate bandwidth: 308.98 Gbps

TOTAL SUSTAINED THROUGHPUT:
  Total wall-clock time: 0.289 s
  Total data transferred: 11750.0 MB
  Sustained throughput: 341.34 Gbps
  (Accounts for 4x concurrent overlapping operations)
============================================================

Differential Revision: D87475053

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 19, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 19, 2025

@dstaay-fb has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87475053.

dstaay-fb added a commit to dstaay-fb/monarch that referenced this pull request Nov 20, 2025
Summary:

Update script to support concurrency, with relevant benchmarks:


  buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write  --iterations 5 --size 500 --expandable-segments true --concurrency 4

sample output
```
============================================================
RDMA WRITE LOAD TEST RESULTS (CUDA:0)
============================================================
Total iterations completed: 20
Average data per operation: 587.5 MB
Total data transferred: 11750.0 MB

INDIVIDUAL OPERATION BANDWIDTH:
  Average bandwidth: 19339.31 Gbps
  Maximum bandwidth: 28238.82 Gbps
  Minimum bandwidth: 12535.37 Gbps

AGGREGATE BANDWIDTH (concurrency=4):
  Average aggregate bandwidth: 356.10 Gbps
  Maximum aggregate bandwidth: 459.63 Gbps
  Minimum aggregate bandwidth: 308.98 Gbps

TOTAL SUSTAINED THROUGHPUT:
  Total wall-clock time: 0.289 s
  Total data transferred: 11750.0 MB
  Sustained throughput: 341.34 Gbps
  (Accounts for 4x concurrent overlapping operations)
============================================================
```

Differential Revision: D87475053
Copy link
Contributor

@casteryh casteryh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

dstaay-fb added a commit to dstaay-fb/monarch that referenced this pull request Nov 21, 2025
Summary:

Update script to support concurrency, with relevant benchmarks:


  buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write  --iterations 5 --size 500 --expandable-segments true --concurrency 4

sample output
```
 ==================================================================
CONCURRENT BATCH TIMING (wall-clock for all concurrent ops):
  Average batch time: 48.681 ms
  Minimum batch time: 25.463 ms
  Maximum batch time: 230.379 ms
  Standard deviation: 20.382 ms
  Average data per batch: 1982.5 MB

AGGREGATE BANDWIDTH (concurrency=4):
  Average aggregate bandwidth: 341.62 Gbps
  Maximum aggregate bandwidth: 653.13 Gbps
  Minimum aggregate bandwidth: 72.19 Gbps

TOTAL SUSTAINED THROUGHPUT:
  Total wall-clock time: 5.094 s
  Total data transferred: 198250.0 MB
  Sustained throughput: 326.47 Gbps
  (Accounts for 4x concurrent overlapping operations)

============================================================
RDMA WRITE LOAD TEST RESULTS (CUDA:1)
============================================================
INDIVIDUAL OPERATION TIMING:
  Average time per operation: 29.031 ms
  Minimum time per operation: 6.103 ms
  Maximum time per operation: 191.391 ms
  Standard deviation: 19.436 ms
Total iterations completed: 400
Average data per operation: 495.6 MB
Total data transferred: 198250.0 MB

INDIVIDUAL OPERATION BANDWIDTH:
  Average bandwidth: 143.21 Gbps
  Maximum bandwidth: 681.26 Gbps
  Minimum bandwidth: 21.72 Gbps

```

Reviewed By: casteryh

Differential Revision: D87475053
dstaay-fb added a commit to dstaay-fb/monarch that referenced this pull request Nov 21, 2025
Summary:

Update script to support concurrency, with relevant benchmarks:


  buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write  --iterations 5 --size 500 --expandable-segments true --concurrency 4

sample output
```
 ==================================================================
CONCURRENT BATCH TIMING (wall-clock for all concurrent ops):
  Average batch time: 48.681 ms
  Minimum batch time: 25.463 ms
  Maximum batch time: 230.379 ms
  Standard deviation: 20.382 ms
  Average data per batch: 1982.5 MB

AGGREGATE BANDWIDTH (concurrency=4):
  Average aggregate bandwidth: 341.62 Gbps
  Maximum aggregate bandwidth: 653.13 Gbps
  Minimum aggregate bandwidth: 72.19 Gbps

TOTAL SUSTAINED THROUGHPUT:
  Total wall-clock time: 5.094 s
  Total data transferred: 198250.0 MB
  Sustained throughput: 326.47 Gbps
  (Accounts for 4x concurrent overlapping operations)

============================================================
RDMA WRITE LOAD TEST RESULTS (CUDA:1)
============================================================
INDIVIDUAL OPERATION TIMING:
  Average time per operation: 29.031 ms
  Minimum time per operation: 6.103 ms
  Maximum time per operation: 191.391 ms
  Standard deviation: 19.436 ms
Total iterations completed: 400
Average data per operation: 495.6 MB
Total data transferred: 198250.0 MB

INDIVIDUAL OPERATION BANDWIDTH:
  Average bandwidth: 143.21 Gbps
  Maximum bandwidth: 681.26 Gbps
  Minimum bandwidth: 21.72 Gbps

```

Reviewed By: casteryh

Differential Revision: D87475053
Summary:

TL;DR:
 BEFORE: controlled flow by requiring python caller to obtain a QP ownership and hold for duration of call (.read_from/.write_into)
 AFTER: now we can cheaply clone QPs, and just use atomics to generate wr_id, and rely on ibverbs internal locks (ibv_post_send is thread-safe).   Complexity introduced by Work completion events which may be returned out of order and only delivered once, so need to store any WC in seperate cache.

### Atomic Counters in rdmaxcel_qp_t for Lock-Free Operations
The rdmaxcel_qp_t wrapper uses atomic counters to enable concurrent, lock-free work request posting:

```
typedef struct rdmaxcel_qp {
    struct ibv_qp* ibv_qp;
    struct ibv_cq* send_cq;
    struct ibv_cq* recv_cq;

    // Atomic counters for lock-free concurrent access
    _Atomic uint64_t send_wqe_idx;    // Next send WQE slot
    _Atomic uint64_t send_db_idx;     // Last doorbell rung
    _Atomic uint64_t recv_wqe_idx;    // Next recv WQE slot
    _Atomic uint64_t recv_db_idx;     // Last recv doorbell
    _Atomic uint64_t rts_timestamp;   // Ready-to-send timestamp

    // Completion caches for efficient polling
    completion_cache_t* send_completion_cache;
    completion_cache_t* recv_completion_cache;
} rdmaxcel_qp_t;
```

Key Benefits:

Multiple threads can post work requests concurrently using fetch_add on atomic indices
No locks needed for the hot path (posting operations)
Each thread gets a unique WQE slot atomically
Completion polling uses cached results to avoid redundant CQ polls

### Mutex-Protected Queue Pair Creation
While operations are lock-free, QP creation is serialized using Rust Arc<Mutex<HashSet>>:

```
pub struct RdmaManagerActor {
    // Track QPs currently being created to prevent duplicate creation
    pending_qp_creation: Arc<Mutex<HashSet<(String, ActorId, String)>>>,
    // ...
}
```
Creation Flow:

Thread checks if QP exists (lock-free read from HashMap)
If not, acquires mutex and checks pending_qp_creation set
If another thread is creating it, waits without holding lock
Otherwise, inserts key into set, releases lock, and creates QP
After creation, removes key from set
This prevents race conditions where multiple threads try to create the same QP simultaneously while keeping the common path (using existing QPs) lock-free.

### Resource Lifecycle Management
Simplified cleanup via rdmaxcel_qp_destroy:

Previously: Rust manually destroyed ibv_qp and CQs separately (error-prone with concurrent access)
Now: Single C function destroys all resources atomically
Changed register_segments(pd, rdmaxcel_qp_t*) to work with wrapper instead of raw ibv_qp

Reviewed By: casteryh

Differential Revision: D87021168
Summary:

Update script to support concurrency, with relevant benchmarks:


  buck run @//mode/dev-nosan //monarch/python/tests:rdma_load_test -- --device cuda:0 cuda:1 --operation write  --iterations 5 --size 500 --expandable-segments true --concurrency 4

sample output
```
 ==================================================================
CONCURRENT BATCH TIMING (wall-clock for all concurrent ops):
  Average batch time: 48.681 ms
  Minimum batch time: 25.463 ms
  Maximum batch time: 230.379 ms
  Standard deviation: 20.382 ms
  Average data per batch: 1982.5 MB

AGGREGATE BANDWIDTH (concurrency=4):
  Average aggregate bandwidth: 341.62 Gbps
  Maximum aggregate bandwidth: 653.13 Gbps
  Minimum aggregate bandwidth: 72.19 Gbps

TOTAL SUSTAINED THROUGHPUT:
  Total wall-clock time: 5.094 s
  Total data transferred: 198250.0 MB
  Sustained throughput: 326.47 Gbps
  (Accounts for 4x concurrent overlapping operations)

============================================================
RDMA WRITE LOAD TEST RESULTS (CUDA:1)
============================================================
INDIVIDUAL OPERATION TIMING:
  Average time per operation: 29.031 ms
  Minimum time per operation: 6.103 ms
  Maximum time per operation: 191.391 ms
  Standard deviation: 19.436 ms
Total iterations completed: 400
Average data per operation: 495.6 MB
Total data transferred: 198250.0 MB

INDIVIDUAL OPERATION BANDWIDTH:
  Average bandwidth: 143.21 Gbps
  Maximum bandwidth: 681.26 Gbps
  Minimum bandwidth: 21.72 Gbps

```

Reviewed By: casteryh

Differential Revision: D87475053
@meta-codesync
Copy link

meta-codesync bot commented Nov 21, 2025

This pull request has been merged in 93b653a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants