Adds RDMA Support & generic transport interface #13

LucasLLC · 2025-08-29T00:41:37Z

Adds RDMA Support & generic transport interface.

Interface

Introduces "pipe", "message", & "transport buffer". Should be fairly generic enough for now for us to support shared memory.

Gotcha's (Read before using)

-- Using monarch nightly from 8.2. On latest monarch, we need D81395338 for rdma support. Disable via TORCHSTORE_RDMA_ENABLED to workaround.

On HF models, for some reason we're not too sure, certain tensors cause RDMA to fail unless chunk size = 1 mb. Still trying to understand what's going on here, but it causes us to lose a majority of perf.

Benchmarks

TestStore.test_large_tensors

TestHFModel.test_basic (qwen 1 -> 1)

rdma_chunk_s=1 Put 21.79s, 21.49s
rdma_chunk_s=512 (broken :/)
rdma_enabled=False put: 40.52s, get: 44.731s

New Env Vars

Sets default value for HYPERACTOR_CODEC_MAX_FRAME_LENGTH, which prevents us from sending large messages via monarch RPC.
TORCHSTORE_RDMDA_CHUNK_SIZE_MB: chunking size for rdma buffers. Currently set to 1mb as a workaround for an issue in tensor_engine.RDMABuffer.
TORCHSTORE_RDMA_ENABLED: helpful for swapping back and forth

allenwang28

this is awesome :)

allenwang28 · 2025-08-31T23:12:21Z

torchstore/transport/buffers.py

+# but setting this chunk size works around the issue until we can fix it
+# N.B. from benchmarking, we know the ideal size is any size >256mb.
+RDMDA_CHUNK_SIZE_MB= int(
+    os.environ.get("TORCHSTORE_RDMDA_CHUNK_SIZE_MB", "1") 


Suggested change

os.environ.get("TORCHSTORE_RDMDA_CHUNK_SIZE_MB", "1")

os.environ.get("TORCHSTORE_RDMA_CHUNK_SIZE_MB", "1")

tests/test_models.py

torchstore/transport/buffers.py

torchstore/store.py

allenwang28 · 2025-09-02T14:45:08Z

torchstore/store.py

-        # we should consider turning this into a "PendingTensor" class,
-        # and having these functions defined there instead.
-        # should also totally simplify the logic here
+        # TODO: Utility fucntions may make more sense in a 


Suggested change

# TODO: Utility fucntions may make more sense in a

# TODO: Utility functions may make more sense in a

torchstore/store.py

allenwang28

i mostly had naming related nits, looking forward to the next PR!

allenwang28 · 2025-09-02T15:15:10Z

torchstore/store.py

+        message = Message.from_any(inplace_tensor)

-            return inplace_tensor
+        fetched_tensor = await pipe.get_from_storage_volume(key, message)


can this API be simplified to just a get(...)?

torchstore/store.py

casteryh

This looks incredible! Left some noob questions.

torchstore/transport/buffers.py

casteryh · 2025-09-02T23:10:28Z

torchstore/transport/buffers.py

+
+    # recv
+    async def write_from(self, tensor):
+        self.tensor = tensor


The contract of write_from is that the caller is responsible to maintain the integrity of the passed in tensor until it's read (for example, not modified unless intended), correct? Hence storing a reference to tensor?

The contract around monarch.tensor_engine.RDMABuffer is such, yes. There are also some other gotcha's which force us to create copies locally, with top of mind being no support for non-contiguous tensors.

Ah sorry, as it relates to the monarch buffer, that is also the current state, yes. In the future when we provide mechanisms for asynchronous puts/gets, we'll have to think about this a little more

torchstore/transport/buffers.py

casteryh · 2025-09-03T05:31:15Z

torchstore/transport/buffers.py

+        # else: we are in the remote case (in a different process), and must read from 
+        # the rdma buffer
+        try:
+            for idx, chunk in enumerate(chunked_byte_view):


Should we gather instead of sequentially read?

This is a good idea, I think we can follow up in a future diff, but will add a todo

The reason I didn't do this from the start is that I recalled running into some issues with rdma buffer when running async in the past.

got it! that makes perfect sense. My initial thought was if we can get concurrent RDMA buffer read working then maybe performance for smaller chunk size (say 1MB) will be better. But I agree this can be added later assuming it can work.

casteryh · 2025-09-03T05:38:05Z

torchstore/store.py

+            "tensor": tensor
+        }
+
+    async def put(self, key: str, transport_buffer: torch.Tensor, message: Message):


Suggested change

async def put(self, key: str, transport_buffer: torch.Tensor, message: Message):

async def put(self, key: str, transport_buffer: TransportBuffer, message: Message):

casteryh · 2025-09-03T05:38:25Z

torchstore/store.py

+
+        self.kv[key] = tensor
+
+    async def get(self, key: str, transport_buffer: torch.Tensor, message: Message):


Suggested change

async def get(self, key: str, transport_buffer: torch.Tensor, message: Message):

async def get(self, key: str, transport_buffer: TransportBuffer, message: Message):

torchstore/transport/pipe.py

LucasLLC added 8 commits August 25, 2025 19:30

begin integration of new pipe class

b0f08ab

begin resharding

1a4379c

first workign version of new buffer interface

69915dd

steady

48a0c97

add support for objects

cb444aa

basic tests with rdma working

f55e271

all basic tests working for rdma

26f270f

working things

34e3052

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 29, 2025

LucasLLC added 12 commits August 28, 2025 17:52

fix broken llama test

c4c207e

Merge branch 'add_qwen' into luca/add_rdma_support

c4dfa1c

fix logs

4a11b6a

Merge branch 'main' into luca/add_rdma_support

3e0d0d0

more test code

526df9a

more test code

8cefa93

it works!

8dce7c9

cleaning up

86ab659

set rdma chunk size to 1mb

31995b4

comments, remove test code

ea63ba3

comments, logs, testing with rdma disabled

ac3bc17

pushing playground

3447616

LucasLLC changed the title ~~[WIP] Not ready for review!! RDMA Support + an interface for transportation methods~~ Adds RDMA Support & generic transport interface Aug 31, 2025

LucasLLC marked this pull request as ready for review August 31, 2025 21:56

This was referenced Sep 1, 2025

Add a store.get_slice() method to get tensors at certain offsets #11

Closed

[draft] Rdma support #4

Closed

comments, etc.

60d185e

allenwang28 reviewed Sep 2, 2025

View reviewed changes

allenwang28 approved these changes Sep 2, 2025

View reviewed changes

casteryh reviewed Sep 3, 2025

View reviewed changes

pr nits

148324b

comments

387a1fc

LucasLLC merged commit 751f97b into main Sep 3, 2025
1 check passed

LucasLLC deleted the luca/add_rdma_support branch September 3, 2025 17:22

	os.environ.get("TORCHSTORE_RDMDA_CHUNK_SIZE_MB", "1")
	os.environ.get("TORCHSTORE_RDMA_CHUNK_SIZE_MB", "1")

	# TODO: Utility fucntions may make more sense in a
	# TODO: Utility functions may make more sense in a

	async def put(self, key: str, transport_buffer: torch.Tensor, message: Message):
	async def put(self, key: str, transport_buffer: TransportBuffer, message: Message):


		self.kv[key] = tensor

		async def get(self, key: str, transport_buffer: torch.Tensor, message: Message):

Adds RDMA Support & generic transport interface #13

Adds RDMA Support & generic transport interface #13

Uh oh!

Conversation

LucasLLC commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adds RDMA Support & generic transport interface.

Interface

Gotcha's (Read before using)

Benchmarks

TestStore.test_large_tensors

TestHFModel.test_basic (qwen 1 -> 1)

New Env Vars

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

casteryh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LucasLLC commented Aug 29, 2025 •

edited

Loading