- 
                Notifications
    
You must be signed in to change notification settings  - Fork 5
 
Adds RDMA Support & generic transport interface #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome :)
        
          
                torchstore/transport/buffers.py
              
                Outdated
          
        
      | # but setting this chunk size works around the issue until we can fix it | ||
| # N.B. from benchmarking, we know the ideal size is any size >256mb. | ||
| RDMDA_CHUNK_SIZE_MB= int( | ||
| os.environ.get("TORCHSTORE_RDMDA_CHUNK_SIZE_MB", "1") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| os.environ.get("TORCHSTORE_RDMDA_CHUNK_SIZE_MB", "1") | |
| os.environ.get("TORCHSTORE_RDMA_CHUNK_SIZE_MB", "1") | 
| # we should consider turning this into a "PendingTensor" class, | ||
| # and having these functions defined there instead. | ||
| # should also totally simplify the logic here | ||
| # TODO: Utility fucntions may make more sense in a | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # TODO: Utility fucntions may make more sense in a | |
| # TODO: Utility functions may make more sense in a | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mostly had naming related nits, looking forward to the next PR!
| message = Message.from_any(inplace_tensor) | ||
| 
               | 
          ||
| return inplace_tensor | ||
| fetched_tensor = await pipe.get_from_storage_volume(key, message) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this API be simplified to just a get(...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks incredible! Left some noob questions.
| 
               | 
          ||
| # recv | ||
| async def write_from(self, tensor): | ||
| self.tensor = tensor | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract of write_from is that the caller is responsible to maintain the integrity of the passed in tensor until it's read (for example, not modified unless intended), correct? Hence storing a reference to tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract around monarch.tensor_engine.RDMABuffer is such, yes. There are also some other gotcha's which force us to create copies locally, with top of mind being no support for non-contiguous tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, as it relates to the monarch buffer, that is also the current state, yes. In the future when we provide mechanisms for asynchronous puts/gets, we'll have to think about this a little more
| # else: we are in the remote case (in a different process), and must read from | ||
| # the rdma buffer | ||
| try: | ||
| for idx, chunk in enumerate(chunked_byte_view): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we gather instead of sequentially read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, I think we can follow up in a future diff, but will add a todo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't do this from the start is that I recalled running into some issues with rdma buffer when running async in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it! that makes perfect sense. My initial thought was if we can get concurrent RDMA buffer read working then maybe performance for smaller chunk size (say 1MB) will be better. But I agree this can be added later assuming it can work.
| "tensor": tensor | ||
| } | ||
| 
               | 
          ||
| async def put(self, key: str, transport_buffer: torch.Tensor, message: Message): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| async def put(self, key: str, transport_buffer: torch.Tensor, message: Message): | |
| async def put(self, key: str, transport_buffer: TransportBuffer, message: Message): | 
| 
               | 
          ||
| self.kv[key] = tensor | ||
| 
               | 
          ||
| async def get(self, key: str, transport_buffer: torch.Tensor, message: Message): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| async def get(self, key: str, transport_buffer: torch.Tensor, message: Message): | |
| async def get(self, key: str, transport_buffer: TransportBuffer, message: Message): | 
Adds RDMA Support & generic transport interface.
Interface
Introduces "pipe", "message", & "transport buffer". Should be fairly generic enough for now for us to support shared memory.
Gotcha's (Read before using)
-- Using monarch nightly from 8.2. On latest monarch, we need
D81395338for rdma support. Disable viaTORCHSTORE_RDMA_ENABLEDto workaround.On HF models, for some reason we're not too sure, certain tensors cause RDMA to fail unless chunk size = 1 mb. Still trying to understand what's going on here, but it causes us to lose a majority of perf.
Benchmarks
TestStore.test_large_tensors
TestHFModel.test_basic (qwen 1 -> 1)
New Env Vars
HYPERACTOR_CODEC_MAX_FRAME_LENGTH, which prevents us from sending large messages via monarch RPC.