-
Notifications
You must be signed in to change notification settings - Fork 3
Minimize awaiting in the primary qorb pool's tokio::select #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/slot.rs
Outdated
| name = "Slots::claim", | ||
| fields(name = ?self.name), | ||
| )] | ||
| fn claim(&mut self, id: ClaimId) -> Result<claim::Handle<Conn>, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has basically been moved from the SetWorker to this new struct, Slots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, the whole chain of awaits has been unraveled because this used to be a message sent to the SetWorker actor, but is now done by just locking the Slots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SlotSet used to have an async fn claim method that used a oneshot to communicate with the SlotWorker. The worker then called a synchronous claim function and returned the result.
This new structure avoids the message passing to the SlotWorker, by pulling out this Slots structure. Now, there's no message passing -- as you said, yeah, we're just locking Slots and pulling the connection out of it.
| name = "SetWorker::create_slot" | ||
| name = "Slots::take_connected_unclaimed_slot" | ||
| )] | ||
| fn take_connected_unclaimed_slot( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has basically been moved from the SetWorker to this new struct, Slots
| self.conform_slot_count(); | ||
| } | ||
|
|
||
| fn conform_slot_count(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has basically been moved from the SetWorker to this new struct, Slots
| // Provides direct access to all underlying slots | ||
| // | ||
| // Shared by both a [`SetWorker`] and [`Set`] | ||
| struct Slots<Conn: Connection> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the biggest crux of this change: we move this data out of the SetWorker, into a std::mutex-wrapped object, so callers can access it from the pool without using SetRequests and message-passing.
This means we can avoid await-ing from the pool.
| Some(Request::Claim { id, tx }) => { | ||
| self.claim_or_enqueue(id, tx).await | ||
| self.claim_or_enqueue(id, tx) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "why" we're making this change -- so we don't need to await in the select arms, to reduce the risk of futurelock.
hawkw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think making this synchronous is the right move. However, I did note that we may be able to get away with using a read-write lock around the Slots instead of a Mutex, so that multiple claimants can claim slots concurrently, and the entire thing only needs to be locked when changing the number of slots. I think that would make claiming slots less of a bottleneck. Does that make sense to you?
src/slot.rs
Outdated
| name = "Slots::claim", | ||
| fields(name = ?self.name), | ||
| )] | ||
| fn claim(&mut self, id: ClaimId) -> Result<claim::Handle<Conn>, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, the whole chain of awaits has been unraveled because this used to be a message sent to the SetWorker actor, but is now done by just locking the Slots?
src/slot.rs
Outdated
| name = "Slots::take_connected_unclaimed_slot" | ||
| )] | ||
| fn take_connected_unclaimed_slot( | ||
| &mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this borrow self mutably? It looks like each slot has its own mutex, and this function just iterates over them immutably and locks each slot. At a glance, I think that this should be able to take &self.
If we can make that change, I think that Slots::claim could be changed to take &self. Then, we could change the Mutex around the Slots to be an RwLock, allowing concurrent calls to claim, and making it so that only set_wanted_count needs a write lock. That way, we might be able to get away with locking the whole thing only when we are changing the number of slots, and not when we're claiming them.
I think this should work, but I may be overlooking a place where we mutate the whole Slots in the claim path...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're definitely right that we can change &mut self to &self -- and I'll make that change now. On the RwLock note -- I think this is technically true, that we can make it a RwLock to allow multiple readers to claim concurrently.
However, the pool is definitely serializing these incoming claim requests, as it's checking each backend. I think this cost isn't terrible -- especially after this PR, it's basically doing quick synchronous work, rather than blocking on any I/O -- but I kinda would want to do some benchmarking before moving to an rwlock, and have a compelling way to actually call claim concurrently. Since the pool is the only consumer, and it's not calling claim concurrently on any SlotSets, using a RwLock instead of a Mutex seems like it would just add some mild overhead without (yet) inferring an advantage.
You okay if I defer using the RwLock? I'm not fundamentally opposed to using it, I just want to be cautious with introducing a change like that without a good higher-level justification, and without a more careful analysis of e.g., "could writers starve readers".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I misunderstood how this works; I thought that consumers were directly calling claim and that this might happen in parallel. If the only consumer of the Slots type is the pool itself, what is the mutex around it doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only external consumer in this case is the "pool", which is calling claim directly. It's true that this could happen in parallel with the API of the slot.rs, but the structure of pool.rs means that it currently won't be called in parallel. That's kinda what I meant by the comment above -- an RwLock has the potential to improve things, but we'd need to re-write pool.rs to make that actually useful.
There is an internal consumer of "all slots" - the SetWorker also has an Arc<Mutex<Slots>>, so it can:
- Change the total number of slots
- Recycle slots that get dropped and returned to qorb (this basically starts the recycling process; the actual I/O is managed by the slot itself).
Additionally, there's a per-slot task, spawned in create_slot, which locks individual slots, and does the actual work of recycling/health checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense, thanks for the explanation. I think this change makes sense to me as-is, and I agree that it's better to keep it minimal. I do wonder about the potential opportunities for additional refactoring, but I think that's best done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #120 to keep track of this
This PR does not fully eliminate "await from
tokio::select", nor does it fully eliminate the use of borrowed futures intokio::select. However, for the mainPoolworker, it avoidsawait-ing for accessing claims and rebalancing.