RUST-556 POC of maxConnecting #259

patrickfreed · 2020-10-09T23:10:49Z

This PR implements a POC of the maxConnecting requirement of the "Avoiding Connection Storms" project. In doing so, it also re-architects the connection pool to use a channel based WaitQueue rather than semaphore based one. This was done to allow for easier and more effective maintenance of the wait queue now that we need to support arbitrary conditions for exiting the queue (e.g. maxConnecting).

I realize that a rewrite of this magnitude typically goes through a design phase, so I apologize for suddenly dropping it like this. I decided it was the preferable choice for this POC largely because a counting semaphore wait queue was frustrating to extend to use in this case, and that this lockless + channel-based pattern is a much more common aproach in the async world for accessing shared state than our previous lock heavy one. Maintaining the old pool and adding support for maxConnecting would introduce even more locking while also producing a suboptimal solution (impossible or difficult to ensure fairness with multiple semaphores/locks in this case), so this seemed like an good time for a rewrite. Additionally, this is a proof-of-concept after all, so I figured it would be a good time to proof-of-concept what an idiomatic async pool would look like too. I think that I should have opted for a pool like this back when I originally converted the old pool to async, but I was lacking in experience with async patterns at the time. Ironically, the original sync pool with its queue of condvars was somewhat similar to this one.

Overview of new pool

I'll give a brief overview of the layout of the new pool here to make digesting the actual code a little easier. We have the same clonable ConnectionPool type as before with the same functionality; however, when we create a new pool, instead of creating a ConnectionPoolInner type that is arc'd and wrapping that, we start a worker task (ConnectionPoolWorker) that opens up a few different channels, and we store senders to those channels (ConnectionRequester, PoolManager) on the ConnectionPool. The worker task is in charge of all changes and access to the connection pool state; to do anything with the pool, a you need to send a request via one of the two senders mentioned before. To check out a connection for example, a thread sends a ConnectionRequest via a ConnectionRequester and waits for the worker to respond. To perform any other pool interactions (e.g. clearing, checking in), a thread uses the PoolManager to send the appropriate requests. The ConnectionRequesters act like strong references to the pool in that they keep the worker alive until they are all dropped, whereas PoolManagers are like weak references and can still be in scope when the worker ceases execution. A ConnectionPool instance holds onto both a ConnectionRequester and a PoolManager, but a Connection for example only holds on to a PoolManager in order to check itself back in. Instead of having a separate background task for ensuring minPoolSize, the worker task just takes care of that.

In summary:

ConnectionPoolInner gone, replaced with a background worker task ConnectionPoolWorker that has exclusive access to pool state
ConnectionPoolWorker listens on a few channels for incoming requests, responding to them as necessary
Threads can check out connections via ConnectionRequesters. The ConnectionPoolWorker quits once all the requesters are dropped
Threads can modify the pool state via PoolManagers. These do not keep the worker running and may no-op if outlive the worker.
Connections use PoolManagers to check themselves back in

Some notes

Given this architecture, it was trivially easy to introduce maxConnecting: we simply don't poll the ConnectionRequestReceiver when the pool is empty and has too many pending connections. The actual changes required to implement that part are all in the last few commits or so. Additionally, we now have a context which always has exclusive access to the pool state, making it much easier to reason about. Lastly, the drop implementations of the various connection types have been simplified with the removal of ConnectionPoolInner and DroppedConnectionState.

patrickfreed

The clippy failures seem unrelated to this PR. The other failures are either due to the max files issue or should be fixed in #258 I think.

Cargo.toml

patrickfreed · 2020-10-10T01:00:33Z

src/cmap/test/mod.rs


 const TEST_DESCRIPTIONS_TO_SKIP: &[&str] = &[
    "must destroy checked in connection if pool has been closed",
    "must throw error if checkOut is called on a closed pool",
 ];

+const EVENT_DELAY: Duration = Duration::from_millis(500);


This is just a temporary solution until the CMAP spec test changes in #258 go in. I wanted to avoid duplicating them here and just opted for the easy solution. Even with this generous delay, we still sometimes don't see the events on Windows with async-std, which is a bit concerning. The failures should go away once the aforementioned changes are merged in though.

saghm

I've reviewed the refactor of the pool and added comments for it. I'll come back and do a second review of the updates for maxConnecting later this week

src/cmap/connection_requester.rs

src/cmap/test/mod.rs

src/runtime/join_handle.rs

src/cmap/mod.rs

src/cmap/worker.rs

src/cmap/manager.rs

kmahar · 2020-10-15T23:33:09Z

I’m going to mainly focus my review on the design you describe and the big picture here, rather than the code. I trust that Isabel and Sam can confirm your code does what you say it does 🙂

I mainly just have questions about the overall architecture. I don't think I necessarily know enough about CMAP or what an idiomatic async pool would look like to have strong opinions, so it might be worthwhile having Matt weigh in on this since he is familiar with the past pool designs, and also with maxConnecting, but I will defer to your judgement on whether that is needed.

We have the same clonable ConnectionPool type

This isn’t that important but just curious, IIUC something being clonable means you can make a deep copy of it, right? Why do we need that capability for our connection pool?

So what I understand so far is that (I think) there is a 1:1 pool-worker relationship. Is there also a single manager and a single requester per pool? I thought so from the overview, but then this sentence confused me, because it sounds like there are multiple requesters which a single worker is paying attention to:

The ConnectionPoolWorker quits once all the requesters are dropped

Also, at what point is a requester dropped?

saghm · 2020-10-16T19:24:21Z

IIUC something being clonable means you can make a deep copy of it, right? Why do we need that capability for our connection pool?

Generally, yes, this is correct. However, Rust's mechanism for reference counted types is to define types in the standard library that implement the same Clone trait as other clonable types but with semantics that share the underlying data. In this case, we utilize those types for all of the mutable state in the connection pool to make it act as if there's a shallow copy. We use the same strategy for Client, Database, and Collection types to allow users to make copies to send to different threads/tasks.

kmahar · 2020-10-16T19:42:02Z

@saghm Ahh I see, that makes sense.

Tangential question: it makes sense to me that a client would be a reference type, since you want to be able to share all of its underlying resources between objects that reference it. But why make databases and collections reference types, too? Would doing so mean that copying them ends up deep copying everything they store references to (such clients)?

I'm comparing this to Swift where if you copy a value type - say MongoDatabase which has a property that is a reference type - say MongoClient - the copy of the database will automatically refer to the same client as the original database referred to.

saghm · 2020-10-16T19:51:25Z

You raise a good point here; I don't think we technically need to make the Database and Collection reference counted right now since the only mutable data they have (the internals of the underlying client) are already reference counted, so it doesn't actually make any semantic difference from the users perspective. This is all internal implementation right now though, so the only real effect of making Database and Collection reference counted right now is that it causes the small amount of data specific to the Database and Collection types (their names, read preferences, and read/write concerns) allocated on the heap rather than the stack.

kmahar · 2020-10-16T20:15:43Z

Gotcha, thanks for the explanation!

patrickfreed

So what I understand so far is that (I think) there is a 1:1 pool-worker relationship. Is there also a single manager and a single requester per pool? I thought so from the overview, but then this sentence confused me, because it sounds like there are multiple requesters which a single worker is paying attention to. Also, at what point is a requester dropped?

This is partially explained by the cloning discussion above. While there is a 1:1 relationship between requesters and ConnectionPools, there can be many requesters for a single worker because each instance of ConnectionPool has one and can be cloned. A requester is dropped when the ConnectionPool instance that owns it is dropped. ConnectionPools also contain managers, so there can be many managers per worker, and since checked out Connections also have managers, there isn't a 1:1 relationship between pools and managers for a given worker.

src/cmap/connection_requester.rs

src/cmap/mod.rs

src/cmap/worker.rs

src/runtime/join_handle.rs

src/cmap/manager.rs

src/cmap/test/mod.rs

src/cmap/worker.rs

saghm · 2020-10-21T17:17:31Z

The lint failures are likely due to the new minor version of Rust that came out since you started this PR; Isabel merged a PR to fix them last week, so if you rebase with master, I think they should go away

patrickfreed · 2020-10-21T17:43:50Z

src/cmap/worker.rs

+    async fn execute(mut self) {
+        let mut maintenance_interval = RUNTIME.interval(Duration::from_millis(500));
+
+        loop {


I moved the unfinished request cache to the receiver itself, which made this loop/select a little simpler.

Awesome, this does make it a lot easier to follow!

patrickfreed · 2020-10-21T17:44:52Z

Just rebased, hopefully that fixes them.

saghm

All the questions I had about this have been answered, and the fully green build is super nice, so I think this is ready for my LGTM!

kmahar · 2020-11-02T22:21:44Z

@patrickfreed Thanks for the explanation, that makes sense. Design SGTM then.

patrickfreed commented Oct 10, 2020

View reviewed changes

patrickfreed marked this pull request as ready for review October 10, 2020 01:07

patrickfreed requested review from saghm, isabelatkinson and kmahar October 10, 2020 01:07

saghm reviewed Oct 13, 2020

View reviewed changes

patrickfreed commented Oct 19, 2020

View reviewed changes

patrickfreed force-pushed the RUST-556/maxConnecting branch from dbe3ad5 to 45eda2c Compare October 21, 2020 17:42

patrickfreed commented Oct 21, 2020

View reviewed changes

patrickfreed added 14 commits October 21, 2020 14:01

connection pool refactor

8f31298

maxConnecting POC

1107cda

min_pool_size connections respect maxConnecting

e7c2840

add better docstrings, organize code layout a little better

e6fa784

fix clippy

2f2c1c4

bring in wait queue timeout test

c79277c

small tweaks

35299c7

move comment

0eaa0c0

block -> yield

cf170f3

listen -> wait_for_all_handle_drops

d65b53f

cache unfinished request on receiver

0029121

sync latest tests

53bb0d0

rebase leftovers

3eefe19

rustfmt

09b1fc8

patrickfreed force-pushed the RUST-556/maxConnecting branch from 45eda2c to 09b1fc8 Compare October 21, 2020 18:42

saghm approved these changes Oct 21, 2020

View reviewed changes

isabelatkinson approved these changes Oct 22, 2020

View reviewed changes

kmahar approved these changes Nov 2, 2020

View reviewed changes

sync cmap spec test readme

2202e3a

patrickfreed merged commit 89a8e9c into mongodb:master Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUST-556 POC of maxConnecting #259

RUST-556 POC of maxConnecting #259

patrickfreed commented Oct 9, 2020

patrickfreed left a comment

patrickfreed Oct 10, 2020

saghm left a comment

kmahar commented Oct 15, 2020

saghm commented Oct 16, 2020 •

edited

kmahar commented Oct 16, 2020

saghm commented Oct 16, 2020

kmahar commented Oct 16, 2020

patrickfreed left a comment

saghm commented Oct 21, 2020

patrickfreed Oct 21, 2020

saghm Oct 21, 2020

patrickfreed commented Oct 21, 2020

saghm left a comment

kmahar commented Nov 2, 2020

RUST-556 POC of maxConnecting #259

RUST-556 POC of maxConnecting #259

Conversation

patrickfreed commented Oct 9, 2020

Overview of new pool

Some notes

patrickfreed left a comment

Choose a reason for hiding this comment

patrickfreed Oct 10, 2020

Choose a reason for hiding this comment

saghm left a comment

Choose a reason for hiding this comment

kmahar commented Oct 15, 2020

saghm commented Oct 16, 2020 • edited

kmahar commented Oct 16, 2020

saghm commented Oct 16, 2020

kmahar commented Oct 16, 2020

patrickfreed left a comment

Choose a reason for hiding this comment

saghm commented Oct 21, 2020

patrickfreed Oct 21, 2020

Choose a reason for hiding this comment

saghm Oct 21, 2020

Choose a reason for hiding this comment

patrickfreed commented Oct 21, 2020

saghm left a comment

Choose a reason for hiding this comment

kmahar commented Nov 2, 2020

saghm commented Oct 16, 2020 •

edited