Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPEC-1555 Consider connection pool health during server selection #1

Closed
wants to merge 18 commits into from

Conversation

patrickfreed
Copy link
Owner

SPEC-1555

DRAFT

This PR updates the Server Selection algorithm to consider connection pool health as per the design in DRIVERS-781. It also updates the CMAP spec to require liveness checking of pooled connections. I have yet to write spec tests for this yet, but I figured it was still worth having a look over the prose.

This PR is currently based off of my maxConnecting PR. I'll open a new one against the mongodb repo once that PR gets merged.

@patrickfreed
Copy link
Owner Author

Testing the server selection bits via spec tests would require a new format that would entail some amount of work to implement, and such tests would need to exclude testing the random selection of two servers part in order to be deterministic. Given the lack of complete coverage, the relative simplicity of the algorithm, and the added work it would take to implement any tests, I wonder if we then ought to consider not including tests for this feature in the specification and instead simply recommend drivers write their own tests as they see fit.

For reference, I imagine an example spec test would look like the following:

{
    "server1": { "activeConnections": 1, "availableConnections": 1 },
    "server2": { "activeConnections": 2, "availableConnections": 0 },
    "maxPoolSize": 5,
    "selected": "server1"
}

And drivers would need to mock the pools or separate out the function that selects from within the latency window to accommodate this data.


An "active Connection" is a `Connection`_ that is either pending or in use. The
server selection algorithm takes the number of active connections a particular
server has into account when deciding among suitable choices.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say 'a particular pool' instead of 'a particular server'?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


- **Stale:** The `Connection`_ 's generation does not match the generation of the parent pool
- **Idle:** The `Connection`_ is currently "available" and has been for longer than **maxIdleTimeMS**.
- **Errored:** The `Connection`_ has experienced an error that indicates it is no longer recommended for use. To determine if a `Connection`_ is in such a state, the pool MUST check if it is “live” by using poll(), select(), or similar functionality available in the language’s networking library. This check MUST NOT block. This can be achieved by passing a timeout of 0 to poll(), for example. Examples of how a `Connection`_ might enter such a state include, but are not limited to:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to mix errors that happen when the connection is being actively used with errors that happen while the connection is idle in the pool. On a first read, this makes it seem like we should call poll() even after a client side error occurs while using the connection (eg a socket timeout) . It might be clearer to distinguish these two cases. Should we add a new term here like Dead?:

...
-  **Idle:** The `Connection`_ is currently "available" and has been for longer than **maxIdleTimeMS**.
-  **Errored:** The `Connection`_ has experienced an error that indicates it is no longer recommended for use.  Examples include, but are not limited to: ...
-  **Dead:** The `Connection`_ is currently "available" (i.e. idle in the pool) and has experienced an error. To determine if a `Connection`_ is in such a state, the pool MUST check if it is “live” by using poll(), select(), or similar functionality available in the language’s networking ...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like that distinction, updated to that. One question now is do we need another reason for a connection to be closed? Or should we still repurpose the "error" reason for closing? I think error still fits here, but having a separate "dead" close reason also seems applicable.


.. _Server Discovery and Monitoring: https://github.com/mongodb/specifications/tree/master/source/server-discovery-and-monitoring
.. _heartbeatFrequencyMS: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#heartbeatfrequencyms
.. _Max Staleness: https://github.com/mongodb/specifications/tree/master/source/max-staleness
.. _idleWritePeriodMS: https://github.com/mongodb/specifications/blob/master/source/max-staleness/max-staleness.rst#idlewriteperiodms
.. _Driver Authentication: https://github.com/mongodb/specifications/blob/master/source/auth
.. _Connection Pool: https://github.com/mongodb/specifications/blob/master/source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a relative links here? The benefit is that the links will work correctly for PRs, branches, and forks. Like this:

.. _Connection Pool: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool
.. _Connection Monitoring and Pooling: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

<https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/>`_
load balancing algorithm. The steps are as follows:

1. Select two servers at random from the set of available servers in the latency

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section (and the pseudocode below) should mention the case that there is only 1 server in the latency window. It's fairly obvious (just return the one server) but we should still mention it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

1. Select two servers at random from the set of available servers in the latency
window

2. If ``abs(server1.pool.activeConnectionCount - server2.pool.activeConnectionCount) > 0.05*maxPoolSize``,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 0.05*maxPoolSize be updated to max(0.05*maxPoolSize, 1)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

else:
selected = server2

return selected

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define a shared method that implements Selecting servers within the latency window so that we don't need to duplicate the pseudocode here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if diff > max(0.05 * max_pool_size, 1):
selected = server in (server1, server2) with minimum active_connection_count

if server1.pool.available_connection_count >= server2.pool.available_connection_count:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if is supposed to be in an elif:

if diff > max(0.05 * max_pool_size, 1):
    selected = server in (server1, server2) with minimum active_connection_count
elif server1.pool.available_connection_count >= server2.pool.available_connection_count:
    selected = server1
else:
    selected = server2

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fixed

As an added benefit, the new algorithm gives preference to nodes that have
recently been discovered and are thus are more likely to be alive (e.g. during a
rolling restart). The narrowing to two random choices first ensures new servers
aren't overly preferred however, preventing a "thundering herd" situation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this algorithm suffer from the "thundering herd" problem as soon as the "servers within the latency window" goes down to 2 servers? In the pure random selection each node would get 50% of the work, whereas in the new algorithm the new server would (potentially) get 100% of the work.

We could also consider what happens when the "servers within the latency window" goes down to 3 servers. In pure random selection each node would get 33% of the work, whereas in the new algorithm the new server would (potentially) get 66% of the work.

With 4 servers, pure random gives 25% whereas the new algorithm gives a 50% worst case.

Is this acceptable? Should we consider only using this power of two choices algorithm when the number of servers in the latency windows is above some threshold (maybe 3)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in cases where there are only a few available nodes, the "thundering herd" problem is less of a concern. For example, in the case where we go from 1->2 servers, it's not really any more of a "herd" than was already happening when we were routing every single operation to the lone node. We could have an issue in a situation where the lone node is isolated but the new node is shared among driver instances, but I'm not sure if this is a case that actually happens in practice. Also remember that targeting newer nodes is a benefit of this algorithm, so imo it's okay than in the 3 node and 4 node cases that the newer server is seeing some higher load initially.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the case where we go from 1->2 servers,

What about the case where there are already 2 servers (A and B) that are each getting 50%? A transient network error on A would clear A's pool and according to the new scheme, server A would get 100% of the requests until abs(server1.pool.active_connection_count - server2.pool.active_connection_count) > max(0.05 * max_pool_size, 1) returns False.

Also remember that targeting newer nodes is a benefit of this algorithm,

I find that it's not very straightforward to determine why this statement is true. As a counter example, consider the scenario where the client is idle (all pools have activeConnectionCount==0) and a server is restarted, will the next request prefer the restarted node? I think the answer is no.

The unwritten caveat here is that we only prefer new/restarted servers when there is a consistently high request load (activeConnectionCount > max(0.05 * max_pool_size, 1)). Is that right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the flip side of that example, while A is marked as unknown after the transient network error, B would be accepting 100% of the requests, so ideally we'd want to prefer the new server as soon as possible to start shedding some of that load. In general, for systems with only a single layer of redundancy, each node needs to be able to handle the entire workload, because otherwise the cluster will just go down whenever any single node does. Another thing that helps here is if most operations are short-lived, then we should reach an equilibrium stage pretty quickly because operations on B would be finishing and checking their connections back in while we were routing everything to A, so their activeConnectionCounts would be converging from both directions.

One concern I do have in that scenario though is that we would be routing all our ops to A which will then likely hit maxConnecting, even if we have pooled connections sitting around in B, increasing latency. We know based on benchmarking that maxConnecting doesn't necessarily reduce latency in single server scenarios, but with the introduction of this load balancing algorithm it might do so in multi-server ones. We could add consideration of pending connections separate from "in use" ones to the selection algorithm, but I worry it would be impossible to get an accurate value for this without incorporating connection checkout into server selection / holding exclusive locks for longer than we ought to, which would increase latency anyways. Perhaps this would be another good benchmarking case to see if there are any noticeable increases in latency.

Overall, I definitely agree that this is a scenario worth discussing, but I'm not necessarily sure if addressing it is worth adding any complexity / special casing to the algorithm, and I'm worried that we may miss out on some of the benefits of the algorithm by not applying it to n=2 or n=3 cases.

I find that it's not very straightforward to determine why this statement is true.

I think the idea that preferring new servers is good comes from 1) new servers likely have less load than old ones (this is kinda circular logic though) and 2) new servers are more likely to remain up than older ones (e.g. during a rolling upgrade, new servers probably already have been updated and won't be taken down again, older servers could be down but SDAM just hasn't noticed yet or will be taken down shortly).

The unwritten caveat here is that we only prefer new/restarted servers when there is a consistently high request load

Yep, this portion of server selection is aiming to be a client-side load balancing solution, so if there isn't consistent load, it has little use, but that's okay.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in principle with the idea of routing operations to new servers. When I said "it's not very straightforward to determine why this statement is true" I meant that it's difficult to understand how (and to what degree) this new algorithm satisfies that goal. I hope the new test format you added will make it easier to reason about the behavior.

On the flip side of that example, while A is marked as unknown after the transient network error, B would be accepting 100% of the requests, so ideally we'd want to prefer the new server as soon as possible to start shedding some of that load.

But in this case we wouldn't be shedding some of the load. We would be shedding all the load. I think the scenario above is certainly a potential regression where request load can unnecessarily flip-flop between two servers. I wonder if this case should be mentioned in the spec. It might prove useful down the line if we see a similar issue in the field.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the new test format will necessarily help shed too much light on how the algorithm performs in practice since the tests are more of a point-in-time verification. A key benefit of the algorithm in practice is that routing operations to an individual node makes it less likely for that node to receive future operations until the routed operations complete, but that sort of thing isn't accounted for in the spec tests, which perform the same selection many times. I'm going to do some ad-hoc experimentation on real deployments with the Rust driver to see how the algorithm actually handles a few different scenarios (this one included), and I'll let you know the results of that.

But in this case we wouldn't be shedding some of the load. We would be shedding all the load. I think the scenario above is certainly a potential regression where request load can unnecessarily flip-flop between two servers.

This is true, but only for short periods of time due to the convergence of activeConnectionCounts I alluded to in my previous comment. In fact, both nodes should reach equilibrium points more quickly under this scheme than in pure random chance.

Sending too much load to an individual node (hotspotting) even for a short period of time is certainly a concern though, but I think it's mostly kept in check by maxConnecting. For instance, current drivers would route roughly 50% of the load to the new server once it comes back up, but they will do so via a burst of new connections. In the new scheme, drivers will temporarily route 100% of the load to the new node, but only 2 at a time until the pool starts to fill. So it might be that, even though drivers will be routing a higher % of operations to the node once it comes up, the new node will actually receive a more manageable workload with this algorithm + maxConnecting than it would with existing drivers. There is a risk that overall latency worsens, though. I'll have to see when I do my experiments.

This does make me realize that we might need to take into account the length of the WaitQueue when calculating load or health of a server. In this scenario for instance, a bunch of operations could be queued on the new node at the start up due to maxConnecting, but activeConnectionCount would still be 2, leading to further operations getting routed to the node in the meantime. If we factored in the WaitQueue size to the comparisons we do, we'd potentially start routing operations to the other node sooner. We also get the added benefit of evenly distributing queued work when all connection pools are full. It's hard for me to know how much of a problem this is though. What do you think?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I performed a few experiments on a sharded cluster with 2 mongoses. In each experiment, I spawned 50 tasks that each ran find_one a hundred times, and I measured how long it took to finish, how many times each shard was selected, and the connections created in each pool to service this workload. Besides one scenario noted below, all pools began as empty. I performed this experiment with the following scenarios:

  • 2 normal mongoses
  • 2 mongoses, 1 of which (A) has a 5% chance of find taking an extra 250ms
  • 2 mongoses, 1 of which (A) has a 25% chance of find taking an extra 250ms
  • 2 mongoses with minPoolSize=100, 1 of which (A) has a 25% chance of find taking an extra 250ms
    • I waited until the pools both reached 100 connections before starting this scenario
  • 2 mongoses, both have a 5% chance of find taking an extra 250ms
  • 2 mongoses, both have a 25% chance of find taking an extra 250ms

Here are the results: https://docs.google.com/spreadsheets/d/1r9CZRjPEbk43g5BLQFXY0Ql4iTlzw2qqyt2dj583XRs/edit?usp=sharing

Some notable points:

  • Pure random appears slightly faster in "normal" scenarios (i.e. 1, 2, 5). This seems to be due to the fact that considering health spreads connection creation out more evenly, resulting in more connections being created and more latency. The differences were really narrow though and this benchmarking was certainly not performed in an ideal environment, so its not clear if this difference is meaningful or not.
  • Considering health leads to the driver preferring better performing nodes.
  • Surprisingly, considering health performed worse (latency-wise) than pure random in scenario 3. This also seems to be due to considering health creating more connections in total, which seems to outweigh the latency benefits of routing away from the slow node. This hypothesis is potentially confirmed by scenario 4, which populates the pools ahead of time and has health performing significantly better than pure random. Given a much larger operation count, we'd likely see health perform better in scenario 3 as well.
  • Considering health leads to more evenly populated pools.
  • Considering health performed better than pure random in Scenario 6 (lots of ops on both mongoses take a long time). This seems to align with the intended goal of better handling of slow operations.

None of these workloads actually made the pools reach maxPoolSize and start queuing operations, though when I did try one that did (e.g. burst of 5000 tasks at once), the results started to look the same between the two methods. This seems to indicate that we do in fact need to consider the length of the WaitQueue when calculating load, since otherwise both pools will be considered under equal load when they're full and we'll just choose at random. I'll update the spec to include this.

Note that I omitted the scenario we've been discussing here. I still need to figure out a good way to emulate that and record proper results. I figured these preliminary ones were worth sharing though.

@ShaneHarvey
Copy link

ShaneHarvey commented Oct 20, 2020

I wonder if we then ought to consider not including tests for this feature in the specification and instead simply recommend drivers write their own tests as they see fit.

I don't think we can get away with that. At the least we need prose tests to cover various cases. And at that point I think it's less effort and more maintainable to create a new spec test format.

Testing the server selection bits via spec tests would require a new format that would entail some amount of work to implement, and such tests would need to exclude testing the random selection of two servers part in order to be deterministic.

It might be possible to account for random selection by having the driver run server selection 1000's of times and then compare the percentage of times each server gets selected (with some leeway of course). This kind of testing would also give us a better understanding of how the algorithm performs in various scenarios (different # of servers, different pool sizes, different # of active vs available connections, etc...).

@patrickfreed
Copy link
Owner Author

I sketched out a test format and added a sample case, let me know what you think.

Copy link

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat I like this new testing approach!

As an added benefit, the new algorithm gives preference to nodes that have
recently been discovered and are thus are more likely to be alive (e.g. during a
rolling restart). The narrowing to two random choices first ensures new servers
aren't overly preferred however, preventing a "thundering herd" situation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the case where we go from 1->2 servers,

What about the case where there are already 2 servers (A and B) that are each getting 50%? A transient network error on A would clear A's pool and according to the new scheme, server A would get 100% of the requests until abs(server1.pool.active_connection_count - server2.pool.active_connection_count) > max(0.05 * max_pool_size, 1) returns False.

Also remember that targeting newer nodes is a benefit of this algorithm,

I find that it's not very straightforward to determine why this statement is true. As a counter example, consider the scenario where the client is idle (all pools have activeConnectionCount==0) and a server is restarted, will the next request prefer the restarted node? I think the answer is no.

The unwritten caveat here is that we only prefer new/restarted servers when there is a consistently high request load (activeConnectionCount > max(0.05 * max_pool_size, 1)). Is that right?

- ``in_window``: array of servers in the latency window that the selected server
is to be chosen from. Each element will have all of the following fields:

- ``id``: a unique string identifier for this server

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention in our tests is to identify servers by address like this address: a:27017. Can we do the same in these tests (in both in_window and expected_frequencies)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Each YAML file for these tests has the following format:

- ``in_window``: array of servers in the latency window that the selected server
is to be chosen from. Each element will have all of the following fields:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help implementors to add a TopologyDescription field to the test setup similar to the existing Server Selection Logic Tests. This would help drivers that need to create Topology/ServerDescriptions in order to test selection logic.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I added it as just another field in the file, separate from the in_window one, seeing as connection pool information is likely stored separate from the actual ServerDescription type within drivers. Let me know if this is what you had in mind.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll test this out later today!

@patrickfreed
Copy link
Owner Author

FYI - I opened a PR against the actual specifications repo with the current set of changes here: mongodb#876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants