Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Oct 29, 2025

This PR does not fully eliminate "await from tokio::select", nor does it fully eliminate the use of borrowed futures in tokio::select. However, for the main Pool worker, it avoids await-ing for accessing claims and rebalancing.

src/slot.rs Outdated
name = "Slots::claim",
fields(name = ?self.name),
)]
fn claim(&mut self, id: ClaimId) -> Result<claim::Handle<Conn>, Error> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has basically been moved from the SetWorker to this new struct, Slots

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, the whole chain of awaits has been unraveled because this used to be a message sent to the SetWorker actor, but is now done by just locking the Slots?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SlotSet used to have an async fn claim method that used a oneshot to communicate with the SlotWorker. The worker then called a synchronous claim function and returned the result.

This new structure avoids the message passing to the SlotWorker, by pulling out this Slots structure. Now, there's no message passing -- as you said, yeah, we're just locking Slots and pulling the connection out of it.

name = "SetWorker::create_slot"
name = "Slots::take_connected_unclaimed_slot"
)]
fn take_connected_unclaimed_slot(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has basically been moved from the SetWorker to this new struct, Slots

self.conform_slot_count();
}

fn conform_slot_count(&mut self) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has basically been moved from the SetWorker to this new struct, Slots

// Provides direct access to all underlying slots
//
// Shared by both a [`SetWorker`] and [`Set`]
struct Slots<Conn: Connection> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest crux of this change: we move this data out of the SetWorker, into a std::mutex-wrapped object, so callers can access it from the pool without using SetRequests and message-passing.

This means we can avoid await-ing from the pool.

Some(Request::Claim { id, tx }) => {
self.claim_or_enqueue(id, tx).await
self.claim_or_enqueue(id, tx)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is "why" we're making this change -- so we don't need to await in the select arms, to reduce the risk of futurelock.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think making this synchronous is the right move. However, I did note that we may be able to get away with using a read-write lock around the Slots instead of a Mutex, so that multiple claimants can claim slots concurrently, and the entire thing only needs to be locked when changing the number of slots. I think that would make claiming slots less of a bottleneck. Does that make sense to you?

src/slot.rs Outdated
name = "Slots::claim",
fields(name = ?self.name),
)]
fn claim(&mut self, id: ClaimId) -> Result<claim::Handle<Conn>, Error> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, the whole chain of awaits has been unraveled because this used to be a message sent to the SetWorker actor, but is now done by just locking the Slots?

src/slot.rs Outdated
name = "Slots::take_connected_unclaimed_slot"
)]
fn take_connected_unclaimed_slot(
&mut self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this borrow self mutably? It looks like each slot has its own mutex, and this function just iterates over them immutably and locks each slot. At a glance, I think that this should be able to take &self.

If we can make that change, I think that Slots::claim could be changed to take &self. Then, we could change the Mutex around the Slots to be an RwLock, allowing concurrent calls to claim, and making it so that only set_wanted_count needs a write lock. That way, we might be able to get away with locking the whole thing only when we are changing the number of slots, and not when we're claiming them.

I think this should work, but I may be overlooking a place where we mutate the whole Slots in the claim path...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're definitely right that we can change &mut self to &self -- and I'll make that change now. On the RwLock note -- I think this is technically true, that we can make it a RwLock to allow multiple readers to claim concurrently.

However, the pool is definitely serializing these incoming claim requests, as it's checking each backend. I think this cost isn't terrible -- especially after this PR, it's basically doing quick synchronous work, rather than blocking on any I/O -- but I kinda would want to do some benchmarking before moving to an rwlock, and have a compelling way to actually call claim concurrently. Since the pool is the only consumer, and it's not calling claim concurrently on any SlotSets, using a RwLock instead of a Mutex seems like it would just add some mild overhead without (yet) inferring an advantage.

You okay if I defer using the RwLock? I'm not fundamentally opposed to using it, I just want to be cautious with introducing a change like that without a good higher-level justification, and without a more careful analysis of e.g., "could writers starve readers".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I misunderstood how this works; I thought that consumers were directly calling claim and that this might happen in parallel. If the only consumer of the Slots type is the pool itself, what is the mutex around it doing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only external consumer in this case is the "pool", which is calling claim directly. It's true that this could happen in parallel with the API of the slot.rs, but the structure of pool.rs means that it currently won't be called in parallel. That's kinda what I meant by the comment above -- an RwLock has the potential to improve things, but we'd need to re-write pool.rs to make that actually useful.

There is an internal consumer of "all slots" - the SetWorker also has an Arc<Mutex<Slots>>, so it can:

  • Change the total number of slots
  • Recycle slots that get dropped and returned to qorb (this basically starts the recycling process; the actual I/O is managed by the slot itself).

Additionally, there's a per-slot task, spawned in create_slot, which locks individual slots, and does the actual work of recycling/health checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, thanks for the explanation. I think this change makes sense to me as-is, and I agree that it's better to keep it minimal. I do wonder about the potential opportunities for additional refactoring, but I think that's best done separately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #120 to keep track of this

@smklein smklein merged commit e6b4628 into master Oct 31, 2025
4 checks passed
@smklein smklein deleted the minimize-await branch October 31, 2025 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants