Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPEC-1555 Consider connection pool health during server selection #876

Merged
merged 21 commits into from Nov 17, 2020
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
31 changes: 26 additions & 5 deletions source/server-selection/server-selection-tests.rst
Expand Up @@ -217,12 +217,13 @@ correctly passed to Mongos in the following scenarios:
- $readPreference is used


Random Selection Within Latency Window
======================================
Random Selection Within Latency Window (single-threaded drivers)
================================================================

The Server Selection spec mandates that drivers select a server at random from the
set of suitable servers that are within the latency window. Drivers implementing the
spec SHOULD test their implementations in a language-specific way to confirm randomness.
The Server Selection spec mandates that single-threaded drivers select
a server at random from the set of suitable servers that are within
the latency window. Drivers implementing the spec SHOULD test their
implementations in a language-specific way to confirm randomness.

For example, the following topology description, operation, and read preference will
return a set of three suitable servers within the latency window::
Expand Down Expand Up @@ -257,6 +258,26 @@ return a set of three suitable servers within the latency window::
Drivers SHOULD check that their implementation selects one of ``primary``, ``secondary_1``,
and ``secondary_2`` at random.

operationCount-based Selection Within Latency Window (multi-threaded or async drivers)
======================================================================================

The Server Selection spec mandates that multi-threaded or async
drivers select a server from within the latency window according to
their operationCounts. There are YAML tests verifying that drivers
implement this selection correctly. Multi-threaded or async drivers
implementing the spec MUST use them to test their implementations.

The tests each include some information about the servers within the
latency window. For each case, the driver passes this information into
whatever function it uses to select from within the window. Because
the selection algorithm relies on randomness, this process MUST be
repeated 2000 times. Once the 2000 selections are complete, the runner
tallies up the number of times each server was selected and compares
those counts to the expected results included in the test
case. Specifics of the test format and how to run the tests are
included in the tests README.


Application-Provided Server Selector
====================================

Expand Down
156 changes: 123 additions & 33 deletions source/server-selection/server-selection.rst
Expand Up @@ -10,7 +10,7 @@ Server Selection
:Status: Accepted
:Type: Standards
:Last Modified: 2020-03-17
:Version: 1.11.0
:Version: 1.12.0

.. contents::

Expand Down Expand Up @@ -798,8 +798,24 @@ a pool; or (b) governed by a global or client-wide limit on number of
waiting threads, depending on how resource limits are implemented by a
driver.

For multi-threaded clients, the server selection algorithm is
as follows:
operationCount
``````````````

Multi-threaded or async drivers MUST keep track of the number of operations that
a given server is currently executing (the server's ``operationCount``). This
value MUST be incremented once a server is selected for an operation and MUST be
decremented once that operation has completed, regardless of its outcome. This
value SHOULD be stored on the ``Server`` type that also owns the connection pool
for the server, if there exists such a type in the driver's implementation, or
on the pool itself. Incrementing or decrementing a server's ``operationCount``
MUST NOT wake up any threads that are waiting for a topology update as part of
server selection. See `operationCount-based selection within the latency window
(multi-threaded or async)`_ for the rationale behind the way this value is used.

Server Selection Algorithm
``````````````````````````

For multi-threaded clients, the server selection algorithm is as follows:

1. Record the server selection start time

Expand All @@ -810,17 +826,29 @@ as follows:
4. Filter the suitable servers by calling the optional, application-provided server
selector.

5. If there are any suitable servers, choose one at random from those
within the latency window and return it; otherwise, continue to the next step
5. If there are any suitable servers, filter them according to `Filtering
suitable servers based on the latency window`_ and continue to the next step;
otherwise, goto Step #9.

6. Choose two servers at random from the set of suitable servers in the latency
window. If there is only 1 server in the latency window, just select that
server and goto Step #8.

7. Of the two randomly chosen servers, select the one with the lower
``operationCount``. If both servers have the same ``operationCount``, select
arbitrarily between the two of them.

8. Increment the ``operationCount`` of the selected server and return it. Do not
go onto later steps.

6. Request an immediate topology check, then block the server selection
thread until the topology changes or until the server selection
timeout has elapsed
9. Request an immediate topology check, then block the server selection thread
until the topology changes or until the server selection timeout has elapsed

7. If more than ``serverSelectionTimeoutMS`` milliseconds have elapsed since
the selection start time, raise a `server selection error`_
10. If more than ``serverSelectionTimeoutMS`` milliseconds have elapsed since
the selection start time, raise a `server selection error`_

11. Goto Step #2

8. Goto Step #2

Single-threaded server selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -867,9 +895,10 @@ as follows:
7. Filter the suitable servers by calling the optional, application-provided
server selector.

8. If there are any suitable servers, choose one at random from those
within the latency window and return it; otherwise, mark the topology
stale and continue to step #8
8. If there are any suitable servers, filter them according to `Filtering
suitable servers based on the latency window`_ and return one at random from
the filtered servers; otherwise, mark the topology stale and continue to step
#9.

9. If `serverSelectionTryOnce`_ is true and the last scan time is newer than
the selection start time, raise a `server selection error`_; otherwise,
Expand Down Expand Up @@ -958,8 +987,9 @@ If ``mode`` is 'secondary' or 'nearest':
#. From these, select one server within the latency window.

(See `algorithm for filtering by staleness`_, `algorithm for filtering by
tag_sets`_, and `selecting servers within the latency window`_ for details
on each step, and `why is maxStalenessSeconds applied before tag_sets?`_.)
tag_sets`_, and `filtering suitable servers based on the latency window`_ for
details on each step, and `why is maxStalenessSeconds applied before
tag_sets?`_.)

If ``mode`` is 'secondaryPreferred', attempt the selection algorithm with
``mode`` 'secondary' and the user's ``maxStalenessSeconds`` and ``tag_sets``. If
Expand Down Expand Up @@ -995,8 +1025,9 @@ server, but are passed through to mongos. See `Passing read preference to mongos

For write operations, all servers of type Mongos are suitable.

If more than one mongos is suitable, drivers MUST randomly select a suitable
server within the latency window.
If more than one mongos is suitable, drivers MUST select a suitable server
within the latency window (see `Filtering suitable servers based on the latency
window`_).

Round Trip Times and the Latency Window
---------------------------------------
Expand Down Expand Up @@ -1027,12 +1058,12 @@ following formula::
A weighting factor of 0.2 was chosen to put about 85% of the weight of the
average RTT on the 9 most recent observations.

Selecting servers within the latency window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Filtering suitable servers based on the latency window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Server selection results in a set of zero or more suitable servers. If more
than one server is suitable, a server MUST be selected randomly from among
those within the latency window.
than one server is suitable, a server MUST be selected from among those within
the latency window.

The ``localThresholdMS`` configuration parameter controls the size of the
latency window used to select a suitable server.
Expand All @@ -1041,12 +1072,16 @@ The shortest average round trip time (RTT) from among suitable servers anchors
one end of the latency window (``A``). The other end is determined by adding
``localThresholdMS`` (``B = A + localThresholdMS``).

A server MUST be selected randomly from among suitable servers that have an
average RTT (``RTT``) within the latency window (i.e. ``A ≤ RTT ≤ B``).
A server MUST be selected from among suitable servers that have an average RTT
(``RTT``) within the latency window (i.e. ``A ≤ RTT ≤ B``). In other words, the
suitable server with the shortest average RTT is **always** a possible choice.
Other servers could be chosen if their average RTTs are no more than
``localThresholdMS`` more than the shortest average RTT.

See either `Single-threaded server selection`_ or `Multi-threaded or
asynchronous server selection`_ for information on how to select a server from
among those within the latency window.

In other words, the suitable server with the shortest average RTT is **always**
a possible choice. Other servers could be chosen if their average RTTs are no
more than ``localThresholdMS`` more than the shortest average RTT.

Checking an Idle Socket After socketCheckIntervalMS
---------------------------------------------------
Expand Down Expand Up @@ -1178,7 +1213,8 @@ Multi-threaded server selection implementation
The following example uses a single lock for clarity. Drivers are free to
implement whatever concurrency model best suits their design.

Pseudocode for `multi-threaded or asynchronous server selection`_::
The following is pseudocode for `multi-threaded or asynchronous server
selection`_::

def getServer(criteria):
client.lock.acquire()
Expand Down Expand Up @@ -1211,7 +1247,15 @@ Pseudocode for `multi-threaded or asynchronous server selection`_::

if servers is not empty:
in_window = servers within the latency window
selected = random entry from in_window
if len(in_window) == 1:
selected = in_window[0]
else:
server1, server2 = random two entries from in_window
if server1.operation_count <= server2.operation_count:
selected = server1
else:
selected = server2
selected.operation_count += 1
client.lock.release()
return selected

Expand All @@ -1232,7 +1276,7 @@ Pseudocode for `multi-threaded or asynchronous server selection`_::
Single-threaded server selection implementation
-----------------------------------------------

Pseudocode for `single-threaded server selection`_::
The following is pseudocode for `single-threaded server selection`_::

def getServer(criteria):
startTime = gettime()
Expand Down Expand Up @@ -1419,8 +1463,8 @@ However, given a choice between the two, ``localThreshold`` is a more general
term. For drivers, we add the ``MS`` suffix for clarity about units and
consistency with other configuration options.

Random selection within the latency window
------------------------------------------
Random selection within the latency window (single-threaded)
------------------------------------------------------------

When more than one server is judged to be suitable, the spec calls for random
selection to ensure a fair distribution of work among servers within the
Expand All @@ -1431,6 +1475,44 @@ servers to come and go. Making newly available servers either first or last
could lead to unbalanced work. Random selection has a better fairness
guarantee and keeps the design simpler.

operationCount-based selection within the latency window (multi-threaded or async)
ShaneHarvey marked this conversation as resolved.
Show resolved Hide resolved
----------------------------------------------------------------------------------

As operation execution slows down on a node (e.g. due to degraded server-side
performance or increased network latency), checked-out pooled connections to
that node will begin to remain checked out for longer periods of time. Assuming
at least constant incoming operation load, more connections will then need to be
opened against the node to service new operations that it gets selected for,
further straining it and slowing it down. This can lead to runaway connection
creation scenarios that can cripple a deployment ("connection storms"). As part
of DRIVERS-781, the random choice portion of multi-threaded server selection was
changed to more evenly spread out the workload among suitable servers in order
to prevent any single node from being overloaded. The new steps achieve this by
approximating an individual server's load via the number of concurrent
operations that node is processing (operationCount) and then routing operations
to servers with less load. This should reduce the number of new operations
routed towards nodes that are busier and thus increase the number routed towards
nodes that are servicing operations faster or are simply less busy. The previous
random selection mechanism did not take load into account and could assign work
to nodes that were under too much stress already.

As an added benefit, the new approach gives preference to nodes that have
recently been discovered and are thus are more likely to be alive (e.g. during a
rolling restart). The narrowing to two random choices first ensures new servers
aren't overly preferred however, preventing a "thundering herd"
situation. Additionally, the `maxConnecting`_ provisions included in the CMAP
specification prevent drivers from crippling new nodes with connection storms.

This approach is based on the `"Power of Two Random Choices with Least Connections" <https://web.archive.org/web/20191212194243/https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/>`_
load balancing algorithm.

An alternative approach to this would be to prefer selecting servers that
already have available connections. While that approach could help reduce
latency, it does not achieve the benefits of routing operations away from slow
servers or of preferring newly introduced servers. Additionally, that approach
could lead to the same node being selected repeatedly rather than spreading the
load out among all suitable servers.

The slaveOK wire protocol flag
------------------------------

Expand Down Expand Up @@ -1536,6 +1618,7 @@ considers the socket closed, without requiring a round-trip to the server.
However, this technique usually will not detect an uncleanly shutdown server or
a network outage.


Backwards Compatibility
=======================

Expand Down Expand Up @@ -1629,7 +1712,7 @@ Mongos HA has similar problems with pinning, in that one can wind up pinned
to a high-latency mongos even if a lower-latency mongos later becomes
available.

Random selection within the latency window avoids this problem and makes server
Selection within the latency window avoids this problem and makes server
selection exactly analogous to having multiple suitable servers from a replica
set. This is easier to explain and implement.

Expand Down Expand Up @@ -1705,17 +1788,21 @@ The user's intent in specifying two tag sets was to fall back to the second set
if needed, so we filter by maxStalenessSeconds first, then tag_sets, and select
Node 2.


References
==========

- `Server Discovery and Monitoring`_ specification
- `Driver Authentication`_ specification
- `Connection Monitoring and Pooling`_ specification

.. _Server Discovery and Monitoring: https://github.com/mongodb/specifications/tree/master/source/server-discovery-and-monitoring
.. _heartbeatFrequencyMS: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#heartbeatfrequencyms
.. _Max Staleness: https://github.com/mongodb/specifications/tree/master/source/max-staleness
.. _idleWritePeriodMS: https://github.com/mongodb/specifications/blob/master/source/max-staleness/max-staleness.rst#idlewriteperiodms
.. _Driver Authentication: https://github.com/mongodb/specifications/blob/master/source/auth
.. _maxConnecting: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool
.. _Connection Monitoring and Pooling: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

Changes
=======
Expand Down Expand Up @@ -1769,5 +1856,8 @@ selection rules.

2020-03-17: Specify read preferences with support for server hedged reads

2020-10-10: Consider server load when selecting servers within the latency
window.

.. [#] mongos 3.4 refuses to connect to mongods with maxWireVersion < 5,
so it does no additional wire version checks related to maxStalenessSeconds.
37 changes: 37 additions & 0 deletions source/server-selection/tests/README.rst
Expand Up @@ -65,3 +65,40 @@ the TopologyDescription. Each YAML file contains a key for these stages of serve
Drivers implementing server selection MUST test that their implementation
correctly returns the set of servers in ``in_latency_window``. Drivers SHOULD also test
against ``suitable_servers`` if possible.

Selection Within Latency Window Tests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the "algorithm" is much simpler, is it still worth it to have all drivers implement these unit tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still worthwhile. I very much prefer these tests to no tests. That said, do you have anything in mind for real end-to-end tests for this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're limited a bit by the topology. We could include some of those experiments I did as prose tests--for example, describe a test against a two mongos sharded topology where one of the mongoses has a failpoint that makes every operation take 500ms, then do a ton of concurrent stuff and then assert that the non-failpoint node got picked a lot more. While this does verify the operationCount behavior, I'm not sure whether this is preferable to having a bunch of unit spec tests, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a jira ticket so we can further discuss adding that prose test. I'm in favor of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just go ahead and add it now? I think it'll be most useful if drivers can have the test case when they're implementing this for the first time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's just add it now. I doubt the unified test format is expressive enough for a test like this so it'll need to be a prose test like you said. You can use the blockConnection option for failCommand like this:

db.adminCommand({
    configureFailPoint: "failCommand",
    mode: {times: 100000},  # or "alwaysOn"
    data: {
      failCommands: ["find"],
      blockConnection: true,
      blockTimeMS: 500,
      appName: "loadBalancingTest",
    },
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. When you run it, can you let me know what you get for % of operations routed to the slow node? Curious to see if theres any differences across drivers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. I ran it a few times and see:

{('localhost', 27017): 11, ('localhost', 27018): 89}
{('localhost', 27017): 12, ('localhost', 27018): 88}
{('localhost', 27017): 13, ('localhost', 27018): 87}
{('localhost', 27017): 15, ('localhost', 27018): 85}
{('localhost', 27017): 17, ('localhost', 27018): 83}

Exciting to see this feature in action!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get around 12-15% too, so it seems like our implementations are consistent, nice!

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

These tests verify that servers select servers from within the latency
window correctly. These tests MUST only be implemented by
multi-threaded or async drivers.

Each YAML file for these tests has the following format:

- ``topology_description``: the state of a mocked cluster

- ``in_window``: array of servers in the latency window that the selected server
is to be chosen from. Each element will have all of the following fields:

- ``address``: a unique address identifying this server

- ``operation_count``: the ``operationCount`` for this server

- ``expected_frequencies``: a document whose keys are the server addresses from the
``in_window`` array and values are numbers in [0, 1] indicating the frequency
at which the server should have been selected.

For each file, pass the information from `in_window` to whatever function is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should also configure the read preference used for server selection. I suggest adding a read_preference field to be consistent with the existing tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed this in the meeting, but since this test is verifying what happens after the set of available servers is determined and the topologies will always be sharded, a read preference shouldn't be necessary here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe my test runner implementation is different. In mine the read preference does actually have an impact. Right now I'm arbitrarily using Nearest. My implementation is roughly:

topology = create_mock_topology_from(test['topology_description'])
mock_operation_counts(topology, test['in_window'])  # Mock operationCount for each server in 'in_window'
read_preference = Nearest()
counts = ...
for _ in range(ITERATIONS):
        server = topology.select_server(read_preference)
        counts[server.address] += 1
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, yeah that makes sense. In my implementation I skipped right to the in window portion, so I didn't need a TopologyDescription or a read preference. I imagine most drivers will probably implement your way though.

Rather than including a read preference in each spec test, I just updated the runner's description to say use a default or "primary" read preference since it shouldn't matter for sharded topologies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels a little odd that we have no explicit tests for replica sets. Perhaps the prose test can cover it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server selection logic tests already ensure that drivers properly determine what is in the latency window across different topology types, and these tests ensure that a server is selected properly from what's within the latency window. Putting those together should provide us full coverage, so I don't think it's necessary to test all topologies in these tests.

Given that selecting from within the window is topology agnostic, the topology is provided in these tests purely as a convenience for implementing the test runners; it doesn't really have any bearing on the results of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand It's not unreasonable for a driver to have two code paths for server selection and adding one test for a replica set is trivial, so why not add a single replica set test? The test format supports it as long as we say to use Nearest read preference instead of Primary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main hesitation for adding it was that if we're treating the provided topology as more than just a convenience (i.e. as part of the tests' coverage), then drivers would be required to use it to implement the tests, even though it isn't necessary since the feature we're testing should ideally be topology agnostic. That said, if we think there could be separate implementations for selection based on topology, then it's probably worth having full coverage of that. Updated to include a few replset tests.

used to select a server from within the latency window 10000 times, counting how
many times each server is selected. Once 10000 selections have been made, verify
that each server was selected at a frequency within 0.02 of the frequency
contained in ``expected_frequencies`` for that server. If the expected frequency
for a given server is 1 or 0, then the observed frequency MUST be exactly equal
to the expected one.

Mocking may be required to implement these tests. A mocked topology description
is included in each file for drivers that require a full description to
implement these tests. If a ReadPreference needs to be specified as part of
running these tests, specify one with the default ("primary") mode. The provided
topologies will always be sharded so this should not have an effect on the
results of the test.