SPEC-1555 Consider connection pool health during server selection #1

patrickfreed · 2020-10-10T04:40:43Z

SPEC-1555

DRAFT

This PR updates the Server Selection algorithm to consider connection pool health as per the design in DRIVERS-781. It also updates the CMAP spec to require liveness checking of pooled connections. I have yet to write spec tests for this yet, but I figured it was still worth having a look over the prose.

This PR is currently based off of my maxConnecting PR. I'll open a new one against the mongodb repo once that PR gets merged.

patrickfreed · 2020-10-19T22:25:31Z

Testing the server selection bits via spec tests would require a new format that would entail some amount of work to implement, and such tests would need to exclude testing the random selection of two servers part in order to be deterministic. Given the lack of complete coverage, the relative simplicity of the algorithm, and the added work it would take to implement any tests, I wonder if we then ought to consider not including tests for this feature in the specification and instead simply recommend drivers write their own tests as they see fit.

For reference, I imagine an example spec test would look like the following:

{
    "server1": { "activeConnections": 1, "availableConnections": 1 },
    "server2": { "activeConnections": 2, "availableConnections": 0 },
    "maxPoolSize": 5,
    "selected": "server1"
}

And drivers would need to mock the pools or separate out the function that selects from within the latency window to accommodate this data.

ShaneHarvey · 2020-10-20T04:41:44Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

+
+An "active Connection" is a `Connection`_ that is either pending or in use. The
+server selection algorithm takes the number of active connections a particular
+server has into account when deciding among suitable choices.


Would it be more accurate to say 'a particular pool' instead of 'a particular server'?

ShaneHarvey · 2020-10-20T05:09:37Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

+
+-  **Stale:** The `Connection`_ 's generation does not match the generation of the parent pool
+-  **Idle:** The `Connection`_ is currently "available" and has been for longer than **maxIdleTimeMS**.
+-  **Errored:** The `Connection`_ has experienced an error that indicates it is no longer recommended for use. To determine if a `Connection`_ is in such a state, the pool MUST check if it is “live” by using poll(), select(), or similar functionality available in the language’s networking library. This check MUST NOT block. This can be achieved by passing a timeout of 0 to poll(), for example. Examples of how a `Connection`_ might enter such a state include, but are not limited to: 


This section seems to mix errors that happen when the connection is being actively used with errors that happen while the connection is idle in the pool. On a first read, this makes it seem like we should call poll() even after a client side error occurs while using the connection (eg a socket timeout) . It might be clearer to distinguish these two cases. Should we add a new term here like Dead?:

... - **Idle:** The `Connection`_ is currently "available" and has been for longer than **maxIdleTimeMS**. - **Errored:** The `Connection`_ has experienced an error that indicates it is no longer recommended for use. Examples include, but are not limited to: ... - **Dead:** The `Connection`_ is currently "available" (i.e. idle in the pool) and has experienced an error. To determine if a `Connection`_ is in such a state, the pool MUST check if it is “live” by using poll(), select(), or similar functionality available in the language’s networking ...

Yeah, I like that distinction, updated to that. One question now is do we need another reason for a connection to be closed? Or should we still repurpose the "error" reason for closing? I think error still fits here, but having a separate "dead" close reason also seems applicable.

ShaneHarvey · 2020-10-20T05:16:37Z

source/server-selection/server-selection.rst


 .. _Server Discovery and Monitoring: https://github.com/mongodb/specifications/tree/master/source/server-discovery-and-monitoring
 .. _heartbeatFrequencyMS: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#heartbeatfrequencyms
 .. _Max Staleness: https://github.com/mongodb/specifications/tree/master/source/max-staleness
 .. _idleWritePeriodMS: https://github.com/mongodb/specifications/blob/master/source/max-staleness/max-staleness.rst#idlewriteperiodms
 .. _Driver Authentication: https://github.com/mongodb/specifications/blob/master/source/auth
+.. _Connection Pool: https://github.com/mongodb/specifications/blob/master/source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool


Can you use a relative links here? The benefit is that the links will work correctly for PRs, branches, and forks. Like this:

.. _Connection Pool: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool .. _Connection Monitoring and Pooling: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool

ShaneHarvey · 2020-10-20T05:22:38Z

source/server-selection/server-selection.rst

+<https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/>`_
+load balancing algorithm. The steps are as follows:
+
+1. Select two servers at random from the set of available servers in the latency


This section (and the pseudocode below) should mention the case that there is only 1 server in the latency window. It's fairly obvious (just return the one server) but we should still mention it.

ShaneHarvey · 2020-10-20T05:24:17Z

source/server-selection/server-selection.rst

+1. Select two servers at random from the set of available servers in the latency
+   window
+
+2. If ``abs(server1.pool.activeConnectionCount - server2.pool.activeConnectionCount) > 0.05*maxPoolSize``,


Should 0.05*maxPoolSize be updated to max(0.05*maxPoolSize, 1)?

ShaneHarvey · 2020-10-20T05:29:40Z

source/server-selection/server-selection.rst

+                else:
+                    selected = server2
+
+                return selected


Can we define a shared method that implements Selecting servers within the latency window so that we don't need to duplicate the pseudocode here?

ShaneHarvey · 2020-10-20T05:31:37Z

source/server-selection/server-selection.rst

+                if diff > max(0.05 * max_pool_size, 1):
+                    selected = server in (server1, server2) with minimum active_connection_count
+
+                if server1.pool.available_connection_count >= server2.pool.available_connection_count:


I think this if is supposed to be in an elif:

if diff > max(0.05 * max_pool_size, 1): selected = server in (server1, server2) with minimum active_connection_count elif server1.pool.available_connection_count >= server2.pool.available_connection_count: selected = server1 else: selected = server2

ShaneHarvey · 2020-10-20T05:50:09Z

source/server-selection/server-selection.rst

+As an added benefit, the new algorithm gives preference to nodes that have
+recently been discovered and are thus are more likely to be alive (e.g. during a
+rolling restart). The narrowing to two random choices first ensures new servers
+aren't overly preferred however, preventing a "thundering herd" situation.


Doesn't this algorithm suffer from the "thundering herd" problem as soon as the "servers within the latency window" goes down to 2 servers? In the pure random selection each node would get 50% of the work, whereas in the new algorithm the new server would (potentially) get 100% of the work.

We could also consider what happens when the "servers within the latency window" goes down to 3 servers. In pure random selection each node would get 33% of the work, whereas in the new algorithm the new server would (potentially) get 66% of the work.

With 4 servers, pure random gives 25% whereas the new algorithm gives a 50% worst case.

Is this acceptable? Should we consider only using this power of two choices algorithm when the number of servers in the latency windows is above some threshold (maybe 3)?

I think in cases where there are only a few available nodes, the "thundering herd" problem is less of a concern. For example, in the case where we go from 1->2 servers, it's not really any more of a "herd" than was already happening when we were routing every single operation to the lone node. We could have an issue in a situation where the lone node is isolated but the new node is shared among driver instances, but I'm not sure if this is a case that actually happens in practice. Also remember that targeting newer nodes is a benefit of this algorithm, so imo it's okay than in the 3 node and 4 node cases that the newer server is seeing some higher load initially.

For example, in the case where we go from 1->2 servers,

What about the case where there are already 2 servers (A and B) that are each getting 50%? A transient network error on A would clear A's pool and according to the new scheme, server A would get 100% of the requests until abs(server1.pool.active_connection_count - server2.pool.active_connection_count) > max(0.05 * max_pool_size, 1) returns False.

Also remember that targeting newer nodes is a benefit of this algorithm,

I find that it's not very straightforward to determine why this statement is true. As a counter example, consider the scenario where the client is idle (all pools have activeConnectionCount==0) and a server is restarted, will the next request prefer the restarted node? I think the answer is no.

The unwritten caveat here is that we only prefer new/restarted servers when there is a consistently high request load (activeConnectionCount > max(0.05 * max_pool_size, 1)). Is that right?

On the flip side of that example, while A is marked as unknown after the transient network error, B would be accepting 100% of the requests, so ideally we'd want to prefer the new server as soon as possible to start shedding some of that load. In general, for systems with only a single layer of redundancy, each node needs to be able to handle the entire workload, because otherwise the cluster will just go down whenever any single node does. Another thing that helps here is if most operations are short-lived, then we should reach an equilibrium stage pretty quickly because operations on B would be finishing and checking their connections back in while we were routing everything to A, so their activeConnectionCounts would be converging from both directions.

One concern I do have in that scenario though is that we would be routing all our ops to A which will then likely hit maxConnecting, even if we have pooled connections sitting around in B, increasing latency. We know based on benchmarking that maxConnecting doesn't necessarily reduce latency in single server scenarios, but with the introduction of this load balancing algorithm it might do so in multi-server ones. We could add consideration of pending connections separate from "in use" ones to the selection algorithm, but I worry it would be impossible to get an accurate value for this without incorporating connection checkout into server selection / holding exclusive locks for longer than we ought to, which would increase latency anyways. Perhaps this would be another good benchmarking case to see if there are any noticeable increases in latency.

Overall, I definitely agree that this is a scenario worth discussing, but I'm not necessarily sure if addressing it is worth adding any complexity / special casing to the algorithm, and I'm worried that we may miss out on some of the benefits of the algorithm by not applying it to n=2 or n=3 cases.

I find that it's not very straightforward to determine why this statement is true.

I think the idea that preferring new servers is good comes from 1) new servers likely have less load than old ones (this is kinda circular logic though) and 2) new servers are more likely to remain up than older ones (e.g. during a rolling upgrade, new servers probably already have been updated and won't be taken down again, older servers could be down but SDAM just hasn't noticed yet or will be taken down shortly).

The unwritten caveat here is that we only prefer new/restarted servers when there is a consistently high request load

Yep, this portion of server selection is aiming to be a client-side load balancing solution, so if there isn't consistent load, it has little use, but that's okay.

I agree in principle with the idea of routing operations to new servers. When I said "it's not very straightforward to determine why this statement is true" I meant that it's difficult to understand how (and to what degree) this new algorithm satisfies that goal. I hope the new test format you added will make it easier to reason about the behavior.

On the flip side of that example, while A is marked as unknown after the transient network error, B would be accepting 100% of the requests, so ideally we'd want to prefer the new server as soon as possible to start shedding some of that load.

But in this case we wouldn't be shedding some of the load. We would be shedding all the load. I think the scenario above is certainly a potential regression where request load can unnecessarily flip-flop between two servers. I wonder if this case should be mentioned in the spec. It might prove useful down the line if we see a similar issue in the field.

I don't think the new test format will necessarily help shed too much light on how the algorithm performs in practice since the tests are more of a point-in-time verification. A key benefit of the algorithm in practice is that routing operations to an individual node makes it less likely for that node to receive future operations until the routed operations complete, but that sort of thing isn't accounted for in the spec tests, which perform the same selection many times. I'm going to do some ad-hoc experimentation on real deployments with the Rust driver to see how the algorithm actually handles a few different scenarios (this one included), and I'll let you know the results of that.

But in this case we wouldn't be shedding some of the load. We would be shedding all the load. I think the scenario above is certainly a potential regression where request load can unnecessarily flip-flop between two servers.

This is true, but only for short periods of time due to the convergence of activeConnectionCounts I alluded to in my previous comment. In fact, both nodes should reach equilibrium points more quickly under this scheme than in pure random chance.

Sending too much load to an individual node (hotspotting) even for a short period of time is certainly a concern though, but I think it's mostly kept in check by maxConnecting. For instance, current drivers would route roughly 50% of the load to the new server once it comes back up, but they will do so via a burst of new connections. In the new scheme, drivers will temporarily route 100% of the load to the new node, but only 2 at a time until the pool starts to fill. So it might be that, even though drivers will be routing a higher % of operations to the node once it comes up, the new node will actually receive a more manageable workload with this algorithm + maxConnecting than it would with existing drivers. There is a risk that overall latency worsens, though. I'll have to see when I do my experiments.

This does make me realize that we might need to take into account the length of the WaitQueue when calculating load or health of a server. In this scenario for instance, a bunch of operations could be queued on the new node at the start up due to maxConnecting, but activeConnectionCount would still be 2, leading to further operations getting routed to the node in the meantime. If we factored in the WaitQueue size to the comparisons we do, we'd potentially start routing operations to the other node sooner. We also get the added benefit of evenly distributing queued work when all connection pools are full. It's hard for me to know how much of a problem this is though. What do you think?

So I performed a few experiments on a sharded cluster with 2 mongoses. In each experiment, I spawned 50 tasks that each ran find_one a hundred times, and I measured how long it took to finish, how many times each shard was selected, and the connections created in each pool to service this workload. Besides one scenario noted below, all pools began as empty. I performed this experiment with the following scenarios:

2 normal mongoses

2 mongoses, 1 of which (A) has a 5% chance of find taking an extra 250ms

2 mongoses, 1 of which (A) has a 25% chance of find taking an extra 250ms

2 mongoses with minPoolSize=100, 1 of which (A) has a 25% chance of find taking an extra 250ms

I waited until the pools both reached 100 connections before starting this scenario

2 mongoses, both have a 5% chance of find taking an extra 250ms

2 mongoses, both have a 25% chance of find taking an extra 250ms

Here are the results: https://docs.google.com/spreadsheets/d/1r9CZRjPEbk43g5BLQFXY0Ql4iTlzw2qqyt2dj583XRs/edit?usp=sharing

Some notable points:

Pure random appears slightly faster in "normal" scenarios (i.e. 1, 2, 5). This seems to be due to the fact that considering health spreads connection creation out more evenly, resulting in more connections being created and more latency. The differences were really narrow though and this benchmarking was certainly not performed in an ideal environment, so its not clear if this difference is meaningful or not.

Considering health leads to the driver preferring better performing nodes.

Surprisingly, considering health performed worse (latency-wise) than pure random in scenario 3. This also seems to be due to considering health creating more connections in total, which seems to outweigh the latency benefits of routing away from the slow node. This hypothesis is potentially confirmed by scenario 4, which populates the pools ahead of time and has health performing significantly better than pure random. Given a much larger operation count, we'd likely see health perform better in scenario 3 as well.

Considering health leads to more evenly populated pools.

Considering health performed better than pure random in Scenario 6 (lots of ops on both mongoses take a long time). This seems to align with the intended goal of better handling of slow operations.

None of these workloads actually made the pools reach maxPoolSize and start queuing operations, though when I did try one that did (e.g. burst of 5000 tasks at once), the results started to look the same between the two methods. This seems to indicate that we do in fact need to consider the length of the WaitQueue when calculating load, since otherwise both pools will be considered under equal load when they're full and we'll just choose at random. I'll update the spec to include this.

Note that I omitted the scenario we've been discussing here. I still need to figure out a good way to emulate that and record proper results. I figured these preliminary ones were worth sharing though.

ShaneHarvey · 2020-10-20T05:59:55Z

I wonder if we then ought to consider not including tests for this feature in the specification and instead simply recommend drivers write their own tests as they see fit.

I don't think we can get away with that. At the least we need prose tests to cover various cases. And at that point I think it's less effort and more maintainable to create a new spec test format.

Testing the server selection bits via spec tests would require a new format that would entail some amount of work to implement, and such tests would need to exclude testing the random selection of two servers part in order to be deterministic.

It might be possible to account for random selection by having the driver run server selection 1000's of times and then compare the percentage of times each server gets selected (with some leeway of course). This kind of testing would also give us a better understanding of how the algorithm performs in various scenarios (different # of servers, different pool sizes, different # of active vs available connections, etc...).

patrickfreed · 2020-10-21T21:14:55Z

I sketched out a test format and added a sample case, let me know what you think.

ShaneHarvey

Neat I like this new testing approach!

ShaneHarvey · 2020-10-22T00:20:33Z

source/server-selection/server-selection.rst

+As an added benefit, the new algorithm gives preference to nodes that have
+recently been discovered and are thus are more likely to be alive (e.g. during a
+rolling restart). The narrowing to two random choices first ensures new servers
+aren't overly preferred however, preventing a "thundering herd" situation.


For example, in the case where we go from 1->2 servers,

What about the case where there are already 2 servers (A and B) that are each getting 50%? A transient network error on A would clear A's pool and according to the new scheme, server A would get 100% of the requests until abs(server1.pool.active_connection_count - server2.pool.active_connection_count) > max(0.05 * max_pool_size, 1) returns False.

Also remember that targeting newer nodes is a benefit of this algorithm,

I find that it's not very straightforward to determine why this statement is true. As a counter example, consider the scenario where the client is idle (all pools have activeConnectionCount==0) and a server is restarted, will the next request prefer the restarted node? I think the answer is no.

The unwritten caveat here is that we only prefer new/restarted servers when there is a consistently high request load (activeConnectionCount > max(0.05 * max_pool_size, 1)). Is that right?

ShaneHarvey · 2020-11-02T16:54:52Z

source/server-selection/tests/README.rst

+- ``in_window``: array of servers in the latency window that the selected server
+  is to be chosen from. Each element will have all of the following fields:
+
+  - ``id``: a unique string identifier for this server


The convention in our tests is to identify servers by address like this address: a:27017. Can we do the same in these tests (in both in_window and expected_frequencies)?

ShaneHarvey · 2020-11-02T17:05:34Z

source/server-selection/tests/README.rst

+Each YAML file for these tests has the following format:
+
+- ``in_window``: array of servers in the latency window that the selected server
+  is to be chosen from. Each element will have all of the following fields:


I think it would help implementors to add a TopologyDescription field to the test setup similar to the existing Server Selection Logic Tests. This would help drivers that need to create Topology/ServerDescriptions in order to test selection logic.

Done. I added it as just another field in the file, separate from the in_window one, seeing as connection pool information is likely stored separate from the actual ServerDescription type within drivers. Let me know if this is what you had in mind.

Nice, I'll test this out later today!

patrickfreed · 2020-11-04T21:05:27Z

FYI - I opened a PR against the actual specifications repo with the current set of changes here: mongodb#876

patrickfreed added 8 commits October 9, 2020 20:38

spec changes

638536c

improve wording of server selection changes

1a8f936

improve cmap spec changes wording

32daaca

fix typos in server selection spec

87155a2

fix list

4898e4c

update changelogs

01b04f8

fix wording of rationale

10a5950

use code formatting

a79d8c9

ShaneHarvey reviewed Oct 20, 2020

View reviewed changes

patrickfreed added 8 commits October 20, 2020 11:55

server -> pool

43c4875

add "dead" reason for being perished

163ef64

include language for case with only 1 server

4528458

use max

a6ddfa9

break out selection in window to its own function

fa39724

use relative links

9cce2ab

improve wording of dead example

dadd6fd

add test format and example case

2a47a9a

ShaneHarvey reviewed Nov 2, 2020

View reviewed changes

patrickfreed added 2 commits November 2, 2020 17:31

use addresses instead of string ids

de676f2

add topology description to test format

c0f7aaf

patrickfreed mentioned this pull request Nov 4, 2020

SPEC-1555 Consider connection pool health during server selection mongodb/specifications#876

Merged

patrickfreed closed this Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPEC-1555 Consider connection pool health during server selection #1

SPEC-1555 Consider connection pool health during server selection #1

patrickfreed commented Oct 10, 2020

patrickfreed commented Oct 19, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020 •

edited

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 20, 2020

patrickfreed Oct 20, 2020

ShaneHarvey Oct 22, 2020

patrickfreed Nov 2, 2020

ShaneHarvey Nov 5, 2020

patrickfreed Nov 5, 2020

patrickfreed Nov 6, 2020

ShaneHarvey commented Oct 20, 2020 •

edited

patrickfreed commented Oct 21, 2020

ShaneHarvey left a comment

ShaneHarvey Oct 22, 2020

ShaneHarvey Nov 2, 2020

patrickfreed Nov 2, 2020

ShaneHarvey Nov 2, 2020

patrickfreed Nov 2, 2020

ShaneHarvey Nov 5, 2020

patrickfreed commented Nov 4, 2020

SPEC-1555 Consider connection pool health during server selection #1

SPEC-1555 Consider connection pool health during server selection #1

Conversation

patrickfreed commented Oct 10, 2020

patrickfreed commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShaneHarvey Oct 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShaneHarvey commented Oct 20, 2020 • edited

patrickfreed commented Oct 21, 2020

ShaneHarvey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickfreed commented Nov 4, 2020

ShaneHarvey Oct 20, 2020 •

edited

ShaneHarvey commented Oct 20, 2020 •

edited