Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[object_buffer_pool] reduce mutex lock scope in WriteChunk (ray-proje…
…ct#43434) Signed-off-by: Saurabh Vishwas Joshi <sjoshi@pinterest.com> ## Why are these changes needed? Object store network transfer performance is slow, and we observe a periodic burst followed by a gap in the network usage. A burst of inbound network traffic occurs at the beginning of each `ray.get(obj_refs)` call, then there is a wide-gap of unused network traffic and then a subsequent burst of network traffic in the next `ray.get(obj_refs) call`. This looks like a processing bottleneck when payloads are received over the network on the Pull side. We dug into the code in [object_manager](https://github.com/ray-project/ray/tree/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager) and found the following: - Objects are transferred in chunks of size: 5 MiB - When a PushRequest is received for a chunk, it is processed by [ObjectManager::HandlePush](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L562) - Which internally calls the [ObjectManager::ReceiveObjectChunk](https://github.com/ray-project/ray/blob/7ff3969159d3aeac00415ac26bf96a63f782db86/src/ray/object_manager/object_manager.cc#L623). This results in a call to the function [ObjectBufferPool::WriteChunk](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L128). - The WriteChunk function is [mutex guarded](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L122) throughout its execution. - This includes the [std::memcpy](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.cc#L139) call for a 5MiB payload - This [pool_mutex_](https://github.com/ray-project/ray/blob/91d5af69085897b02d29bc0d15a53849e56eb8e4/src/ray/object_manager/object_buffer_pool.h#L215) lock is shared by all object_ids being received over the network Which makes us believe that even if chunks for different ObjectId are received in parallel over the network they are written sequentially. Which would explain why we see a burst in the network usage followed by a hole in the network usage ### Changes **Write Chunk** - Transition the chunk from `REFERENCED` to `SEALED` before releasing the lock - Increment / Decrement `num_inflight_copies` before / after the copy - Perform an unguarded memcpy of the chunk into the buffer - Reacquire the mutex lock and perform object_id level `Seal` and `Release` decisions **AbortCreate** - Wait to ensure `num_inflight_copies == 0` before allowing the `Release` & `Abort` calls for the `object_id` - This check ensures that we do not release the underlying buffer while an unguarded copy is ongoing ### Tests #### Before - Sampled network at 1 second frequency `speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0` <img width="590" alt="Screenshot 2024-02-13 at 10 26 33 AM" src="https://github.com/ray-project/ray/assets/8691593/7a5497dd-b87d-4bec-a51c-62d629c06c58"> - Sampled network at 100 millisecond frequency `speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0 ` <img width="656" alt="Screenshot 2024-02-13 at 10 26 37 AM" src="https://github.com/ray-project/ray/assets/8691593/5ac229e6-4777-4820-b9e7-ebbd63cafcc9"> #### After - Sampled network at 1 second frequency `speedometer.py -i 1 -m 64424509440 -n 1073741824 -rx eth0` <img width="620" alt="Screenshot 2024-02-21 at 1 22 08 PM" src="https://github.com/ray-project/ray/assets/8691593/f15196d6-01f7-4d3e-b56c-ba09c58be455"> - Sampled network at 100ms frequency `speedometer.py -i 0.1 -m 64424509440 -n 1073741824 -rx eth0` <img width="1674" alt="Screenshot 2024-02-21 at 1 19 10 PM" src="https://github.com/ray-project/ray/assets/8691593/584f64b9-a39f-4441-b034-86234aff4283"> ``` Finished benchmark for total_size_MiB: 102400, block_size_MiB: 1024, parallel_block: None ray.wait(fetch_local=True) Gbps: 54.67178658153866 ray.wait(fetch_local=True) total time s: 15.711823463439941 ray.get() Gbps: 186301.24111362518 ray.get() total time s: 0.004610776901245117 ``` ## Related issue number
- Loading branch information