mongodb · patrickfreed · Nov 17, 2020 · Oct 10, 2020 · Nov 6, 2020 · Nov 6, 2020
@@ -217,12 +217,13 @@ correctly passed to Mongos in the following scenarios:
   - $readPreference is used
 
 
-Random Selection Within Latency Window
-======================================
+Random Selection Within Latency Window (single-threaded drivers)
+================================================================
 
-The Server Selection spec mandates that drivers select a server at random from the
-set of suitable servers that are within the latency window. Drivers implementing the
-spec SHOULD test their implementations in a language-specific way to confirm randomness.
+The Server Selection spec mandates that single-threaded drivers select
+a server at random from the set of suitable servers that are within
+the latency window. Drivers implementing the spec SHOULD test their
+implementations in a language-specific way to confirm randomness.
 
 For example, the following topology description, operation, and read preference will
 return a set of three suitable servers within the latency window::
@@ -257,6 +258,26 @@ return a set of three suitable servers within the latency window::
 Drivers SHOULD check that their implementation selects one of ``primary``, ``secondary_1``,
 and ``secondary_2`` at random.
 
+operationCount-based Selection Within Latency Window (multi-threaded or async drivers)
+======================================================================================
+
+The Server Selection spec mandates that multi-threaded or async
+drivers select a server from within the latency window according to
+their operationCounts. There are YAML tests verifying that drivers
+implement this selection correctly. Multi-threaded or async drivers
+implementing the spec MUST use them to test their implementations.
+
+The tests each include some information about the servers within the
+latency window. For each case, the driver passes this information into
+whatever function it uses to select from within the window. Because
+the selection algorithm relies on randomness, this process MUST be
+repeated 2000 times. Once the 2000 selections are complete, the runner
+tallies up the number of times each server was selected and compares
+those counts to the expected results included in the test
+case. Specifics of the test format and how to run the tests are
+included in the tests README.
+
+
 Application-Provided Server Selector
 ====================================
 

@@ -10,7 +10,7 @@ Server Selection
 :Status: Accepted
 :Type: Standards
 :Last Modified: 2020-03-17
-:Version: 1.11.0
+:Version: 1.12.0
 
 .. contents::
 
@@ -798,8 +798,24 @@ a pool; or (b) governed by a global or client-wide limit on number of
 waiting threads, depending on how resource limits are implemented by a
 driver.
 
-For multi-threaded clients, the server selection algorithm is
-as follows:
+operationCount
+``````````````
+
+Multi-threaded or async drivers MUST keep track of the number of operations that
+a given server is currently executing (the server's ``operationCount``). This
+value MUST be incremented once a server is selected for an operation and MUST be
+decremented once that operation has completed, regardless of its outcome. This
+value SHOULD be stored on the ``Server`` type that also owns the connection pool
+for the server, if there exists such a type in the driver's implementation, or
+on the pool itself. Incrementing or decrementing a server's ``operationCount``
+MUST NOT wake up any threads that are waiting for a topology update as part of
+server selection. See `operationCount-based selection within the latency window
+(multi-threaded or async)`_ for the rationale behind the way this value is used.
+
+Server Selection Algorithm
+``````````````````````````
+
+For multi-threaded clients, the server selection algorithm is as follows:
 
 1. Record the server selection start time
 
@@ -810,17 +826,29 @@ as follows:
 4. Filter the suitable servers by calling the optional, application-provided server
    selector.
 
-5. If there are any suitable servers, choose one at random from those
-   within the latency window and return it; otherwise, continue to the next step
+5. If there are any suitable servers, filter them according to `Filtering
+   suitable servers based on the latency window`_ and continue to the next step;
+   otherwise, goto Step #9.
+
+6. Choose two servers at random from the set of suitable servers in the latency
+   window. If there is only 1 server in the latency window, just select that
+   server and goto Step #8.
+
+7. Of the two randomly chosen servers, select the one with the lower
+   ``operationCount``. If both servers have the same ``operationCount``, select
+   arbitrarily between the two of them.
+
+8. Increment the ``operationCount`` of the selected server and return it. Do not
+   go onto later steps.
 
-6. Request an immediate topology check, then block the server selection
-   thread until the topology changes or until the server selection
-   timeout has elapsed
+9. Request an immediate topology check, then block the server selection thread
+   until the topology changes or until the server selection timeout has elapsed
 
-7. If more than ``serverSelectionTimeoutMS`` milliseconds have elapsed since
-   the selection start time, raise a `server selection error`_
+10. If more than ``serverSelectionTimeoutMS`` milliseconds have elapsed since
+    the selection start time, raise a `server selection error`_
+
+11. Goto Step #2
 
-8. Goto Step #2
 
 Single-threaded server selection
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -867,9 +895,10 @@ as follows:
 7. Filter the suitable servers by calling the optional, application-provided
    server selector.
 
-8. If there are any suitable servers, choose one at random from those
-   within the latency window and return it; otherwise, mark the topology
-   stale and continue to step #8
+8. If there are any suitable servers, filter them according to `Filtering
+   suitable servers based on the latency window`_ and return one at random from
+   the filtered servers; otherwise, mark the topology stale and continue to step
+   #9.
 
 9. If `serverSelectionTryOnce`_ is true and the last scan time is newer than
    the selection start time, raise a `server selection error`_; otherwise,
@@ -958,8 +987,9 @@ If ``mode`` is 'secondary' or 'nearest':
   #. From these, select one server within the latency window.
 
 (See `algorithm for filtering by staleness`_, `algorithm for filtering by
-tag_sets`_, and `selecting servers within the latency window`_ for details
-on each step, and `why is maxStalenessSeconds applied before tag_sets?`_.)
+tag_sets`_, and `filtering suitable servers based on the latency window`_ for
+details on each step, and `why is maxStalenessSeconds applied before
+tag_sets?`_.)
 
 If ``mode`` is 'secondaryPreferred', attempt the selection algorithm with
 ``mode`` 'secondary' and the user's ``maxStalenessSeconds`` and ``tag_sets``. If
@@ -995,8 +1025,9 @@ server, but are passed through to mongos. See `Passing read preference to mongos
 
 For write operations, all servers of type Mongos are suitable.
 
-If more than one mongos is suitable, drivers MUST randomly select a suitable
-server within the latency window.
+If more than one mongos is suitable, drivers MUST select a suitable server
+within the latency window (see `Filtering suitable servers based on the latency
+window`_).
 
 Round Trip Times and the Latency Window
 ---------------------------------------
@@ -1027,12 +1058,12 @@ following formula::
 A weighting factor of 0.2 was chosen to put about 85% of the weight of the
 average RTT on the 9 most recent observations.
 
-Selecting servers within the latency window
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Filtering suitable servers based on the latency window
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Server selection results in a set of zero or more suitable servers.  If more
-than one server is suitable, a server MUST be selected randomly from among
-those within the latency window.
+than one server is suitable, a server MUST be selected from among those within
+the latency window.
 
 The ``localThresholdMS`` configuration parameter controls the size of the
 latency window used to select a suitable server.
@@ -1041,12 +1072,16 @@ The shortest average round trip time (RTT) from among suitable servers anchors
 one end of the latency window (``A``).  The other end is determined by adding
 ``localThresholdMS`` (``B = A + localThresholdMS``).
 
-A server MUST be selected randomly from among suitable servers that have an
-average RTT (``RTT``) within the latency window (i.e. ``A ≤ RTT ≤ B``).
+A server MUST be selected from among suitable servers that have an average RTT
+(``RTT``) within the latency window (i.e. ``A ≤ RTT ≤ B``). In other words, the
+suitable server with the shortest average RTT is **always** a possible choice.
+Other servers could be chosen if their average RTTs are no more than
+``localThresholdMS`` more than the shortest average RTT.
+
+See either `Single-threaded server selection`_ or `Multi-threaded or
+asynchronous server selection`_ for information on how to select a server from
+among those within the latency window.
 
-In other words, the suitable server with the shortest average RTT is **always**
-a possible choice.  Other servers could be chosen if their average RTTs are no
-more than ``localThresholdMS`` more than the shortest average RTT.
 
 Checking an Idle Socket After socketCheckIntervalMS
 ---------------------------------------------------
@@ -1178,7 +1213,8 @@ Multi-threaded server selection implementation
 The following example uses a single lock for clarity.  Drivers are free to
 implement whatever concurrency model best suits their design.
 
-Pseudocode for `multi-threaded or asynchronous server selection`_::
+The following is pseudocode for `multi-threaded or asynchronous server
+selection`_::
 
     def getServer(criteria):
         client.lock.acquire()
@@ -1211,7 +1247,15 @@ Pseudocode for `multi-threaded or asynchronous server selection`_::
 
             if servers is not empty:
                 in_window = servers within the latency window
-                selected = random entry from in_window
+                if len(in_window) == 1:
+                    selected = in_window[0]
+                else:
+                    server1, server2 = random two entries from in_window
+                    if server1.operation_count <= server2.operation_count:
+                        selected = server1
+                    else:
+                        selected = server2
+                selected.operation_count += 1
                 client.lock.release()
                 return selected
 
@@ -1232,7 +1276,7 @@ Pseudocode for `multi-threaded or asynchronous server selection`_::
 Single-threaded server selection implementation
 -----------------------------------------------
 
-Pseudocode for `single-threaded server selection`_::
+The following is pseudocode for `single-threaded server selection`_::
 
     def getServer(criteria):
         startTime = gettime()
@@ -1419,8 +1463,8 @@ However, given a choice between the two, ``localThreshold`` is a more general
 term.  For drivers, we add the ``MS`` suffix for clarity about units and
 consistency with other configuration options.
 
-Random selection within the latency window
-------------------------------------------
+Random selection within the latency window (single-threaded)
+------------------------------------------------------------
 
 When more than one server is judged to be suitable, the spec calls for random
 selection to ensure a fair distribution of work among servers within the
@@ -1431,6 +1475,44 @@ servers to come and go.  Making newly available servers either first or last
 could lead to unbalanced work.  Random selection has a better fairness
 guarantee and keeps the design simpler.
 
+operationCount-based selection within the latency window (multi-threaded or async)
+----------------------------------------------------------------------------------
+
+As operation execution slows down on a node (e.g. due to degraded server-side
+performance or increased network latency), checked-out pooled connections to
+that node will begin to remain checked out for longer periods of time. Assuming
+at least constant incoming operation load, more connections will then need to be
+opened against the node to service new operations that it gets selected for,
+further straining it and slowing it down. This can lead to runaway connection
+creation scenarios that can cripple a deployment ("connection storms"). As part
+of DRIVERS-781, the random choice portion of multi-threaded server selection was
+changed to more evenly spread out the workload among suitable servers in order
+to prevent any single node from being overloaded. The new steps achieve this by
+approximating an individual server's load via the number of concurrent
+operations that node is processing (operationCount) and then routing operations
+to servers with less load. This should reduce the number of new operations
+routed towards nodes that are busier and thus increase the number routed towards
+nodes that are servicing operations faster or are simply less busy. The previous
+random selection mechanism did not take load into account and could assign work
+to nodes that were under too much stress already.
+
+As an added benefit, the new approach gives preference to nodes that have
+recently been discovered and are thus are more likely to be alive (e.g. during a
+rolling restart). The narrowing to two random choices first ensures new servers
+aren't overly preferred however, preventing a "thundering herd"
+situation. Additionally, the `maxConnecting`_ provisions included in the CMAP
+specification prevent drivers from crippling new nodes with connection storms.
+
+This approach is based on the `"Power of Two Random Choices with Least Connections" <https://web.archive.org/web/20191212194243/https://www.nginx.com/blog/nginx-power-of-two-choices-load-balancing-algorithm/>`_
+load balancing algorithm.
+
+An alternative approach to this would be to prefer selecting servers that
+already have available connections. While that approach could help reduce
+latency, it does not achieve the benefits of routing operations away from slow
+servers or of preferring newly introduced servers. Additionally, that approach
+could lead to the same node being selected repeatedly rather than spreading the
+load out among all suitable servers.
+
 The slaveOK wire protocol flag
 ------------------------------
 
@@ -1536,6 +1618,7 @@ considers the socket closed, without requiring a round-trip to the server.
 However, this technique usually will not detect an uncleanly shutdown server or
 a network outage.
 
+
 Backwards Compatibility
 =======================
 
@@ -1629,7 +1712,7 @@ Mongos HA has similar problems with pinning, in that one can wind up pinned
 to a high-latency mongos even if a lower-latency mongos later becomes
 available.
 
-Random selection within the latency window avoids this problem and makes server
+Selection within the latency window avoids this problem and makes server
 selection exactly analogous to having multiple suitable servers from a replica
 set.  This is easier to explain and implement.
 
@@ -1705,17 +1788,21 @@ The user's intent in specifying two tag sets was to fall back to the second set
 if needed, so we filter by maxStalenessSeconds first, then tag_sets, and select
 Node 2.
 
+
 References
 ==========
 
 - `Server Discovery and Monitoring`_ specification
 - `Driver Authentication`_ specification
+- `Connection Monitoring and Pooling`_ specification
 
 .. _Server Discovery and Monitoring: https://github.com/mongodb/specifications/tree/master/source/server-discovery-and-monitoring
 .. _heartbeatFrequencyMS: https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#heartbeatfrequencyms
 .. _Max Staleness: https://github.com/mongodb/specifications/tree/master/source/max-staleness
 .. _idleWritePeriodMS: https://github.com/mongodb/specifications/blob/master/source/max-staleness/max-staleness.rst#idlewriteperiodms
 .. _Driver Authentication: https://github.com/mongodb/specifications/blob/master/source/auth
+.. _maxConnecting: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#connection-pool
+.. _Connection Monitoring and Pooling: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
 
 Changes
 =======
@@ -1769,5 +1856,8 @@ selection rules.
 
 2020-03-17: Specify read preferences with support for server hedged reads
 
+2020-10-10: Consider server load when selecting servers within the latency
+window.
+
 .. [#] mongos 3.4 refuses to connect to mongods with maxWireVersion < 5,
    so it does no additional wire version checks related to maxStalenessSeconds.
@@ -65,3 +65,40 @@ the TopologyDescription. Each YAML file contains a key for these stages of serve
 Drivers implementing server selection MUST test that their implementation
 correctly returns the set of servers in ``in_latency_window``. Drivers SHOULD also test
 against ``suitable_servers`` if possible.
+
+Selection Within Latency Window Tests
+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+
+These tests verify that servers select servers from within the latency
+window correctly. These tests MUST only be implemented by
+multi-threaded or async drivers.
+
+Each YAML file for these tests has the following format:
+
+- ``topology_description``: the state of a mocked cluster
+
+- ``in_window``: array of servers in the latency window that the selected server
+  is to be chosen from. Each element will have all of the following fields:
+
+  - ``address``: a unique address identifying this server
+
+  - ``operation_count``: the ``operationCount`` for this server
+
+- ``expected_frequencies``: a document whose keys are the server addresses from the
+  ``in_window`` array and values are numbers in [0, 1] indicating the frequency
+  at which the server should have been selected.
+
+For each file, pass the information from `in_window` to whatever function is
+used to select a server from within the latency window 10000 times, counting how
+many times each server is selected.  Once 10000 selections have been made, verify
+that each server was selected at a frequency within 0.02 of the frequency
+contained in ``expected_frequencies`` for that server. If the expected frequency
+for a given server is 1 or 0, then the observed frequency MUST be exactly equal
+to the expected one.
+
+Mocking may be required to implement these tests. A mocked topology description
+is included in each file for drivers that require a full description to
+implement these tests. If a ReadPreference needs to be specified as part of
+running these tests, specify one with the default ("primary") mode. The provided
+topologies will always be sharded so this should not have an effect on the
+results of the test.