GODRIVER-2762 Use minimum RTT for CSOT #1507

prestonvasquez · 2023-12-21T23:02:09Z

Summary

This PR remove the rtt90 logic throughout the Go Driver and implements the "moving minimum" logic defined in mongodb/specifications@c06650d .

Background & Motivation

From the PR:

Drivers must use the minimum RTT for CSOT maxTimeMS calculation instead of 90th percentile. At least 2 RTT samples are required otherwise drivers must use 0 as RTT. Only keep at most the last 10 samples. These changes were made to avoid preemptively failing operations due to inaccurate or unstable RTT measurements.

…of 90th percentile

prestonvasquez · 2023-12-21T23:03:39Z

x/mongo/driver/topology/server.go

@@ -663,7 +664,7 @@ func (s *Server) update() {
 			s.rttMonitor.connect()
 		}

-		if isStreamable(s) || connectionIsStreaming || transitionedFromNetworkError {
+		if isStreamable(s) && (serverSupportsStreaming || connectionIsStreaming) || transitionedFromNetworkError {


The logic for this was not implemented correctly, see specifications here: https://github.com/mongodb/specifications/blame/master/source/server-discovery-and-monitoring/server-monitoring.rst#L747

prestonvasquez · 2023-12-21T23:05:42Z

internal/integration/unified/testrunner_operation.go

@@ -186,6 +186,15 @@ func executeTestRunnerOperation(ctx context.Context, operation *operation, loopD
 				}
 			}
 		}
+		return nil
+	case "wait":


This is scheduled to be added in GODRIVER-2466 , but this part is required for the commaned-execution.json UST.

matthewdale · 2023-12-29T05:40:23Z

x/mongo/driver/operation.go

@@ -1546,23 +1540,22 @@ func (op Operation) addClusterTime(dst []byte, desc description.SelectedServer)
 // if the ctx is a Timeout context. If the context is not a Timeout context, it uses the
 // operation's MaxTimeMS if set. If no MaxTimeMS is set on the operation, and context is
 // not a Timeout context, calculateMaxTimeMS returns 0.
-func (op Operation) calculateMaxTimeMS(ctx context.Context, rtt90 time.Duration, rttStats string) (uint64, error) {
+func (op Operation) calculateMaxTimeMS(ctx context.Context, rtt RTTMonitor, rttStats string) (uint64, error) {


Since Min() is the only method called on rtt, passing minRTT as a time.Duration makes testing this method significantly simpler because the test doesn't have to create a type that satisfies the RTTMonitor interface.

An additional simplification is to make calculateMaxTimeMS a function that accepts

opMaxTime *time.Duration

so that you don't have to create an Operation to test it.

matthewdale · 2023-12-29T18:31:41Z

x/mongo/driver/topology/rtt_monitor.go

 	r.samples[r.offset] = rtt
 	r.offset = (r.offset + 1) % len(r.samples)


As far as I can tell, the existing logic for calculating "min" and "p90" are only used in the Stats method. If that's the case, we should remove the existing logic for calculating "min" and "p90" and update the Stats method to return information based on the actually used "min" implementation.

matthewdale · 2023-12-29T18:34:14Z

x/mongo/driver/topology/rtt_monitor.go

-	return r.minRTT
+	return r.min()


Calculating the minimum RTT when Min is called can create a performance issue. We expect Min to be called once for every operation, which could be thousands of times per second. In contrast, we expect new samples to be added about once every 10 seconds. Because of that, it's better to calculate the updated minimum RTT when a new sample is added instead of when Min is called.

matthewdale · 2024-01-03T22:53:07Z

x/mongo/driver/topology/rtt_monitor.go

-		`average RTT: %v, minimum RTT: %v, 90th percentile RTT: %v, standard dev: %v`+"\n",
-		time.Duration(avg), r.minRTT, r.rtt90, time.Duration(stdDev))
+		`average RTT: %v, minimum RTT: %v, standard dev: %v`+"\n",
+		time.Duration(avg), r.minRTT, time.Duration(stdDev))


Can we calculate stdDev using the movingMin values instead of samples?

There is no mention in the specifications on propagating the standard deviation of RTT samples, so I think we can use the full set of samples or the movingMin window.

I'm assuming that we provide this information for users to determine if the stability of the connection could potentially be responsible for a timeout error. That is, how close are our the RTTs to the mean? Basing this information on the movingMin, a 10-element window of all of the samples, will only tell us how close that window is to it's own mean. Furthermore, there is no obvious way to calculate the overall standard deviation as an aggregation of these windows since the windows themselves are not disjoint. If the goal is to give the user insights on the variability or dispersion of the RTT values, then not using the full sample set would be a mistake.

The best thing we could do to give the most comprehensive story is to show both: (1) the standard deviation of all samples taken upto the error, (2) the standard deviation of the most recent window of samples.

The stddev was indeed added to give users (and us) an idea of the distribution of RTT measurements when the RTT logic short-circuits an operation. That was more important when we were using the 90th percentile RTT, which is more sensitive to widely distributed RTT values than the minimum. However, since we're using minimum and only keeping 10 samples, I think we should include those 10 samples in the error message instead of deriving stddev.

Keep in mind that samples isn't a complete set of RTT samples either, it's a moving window of 5 minutes of samples.

If we really wanted to quantify connection stability and remove the samples slice, we could create a type to hold the sum of variances calculated every minRTTSamplesForMovingMin-th call to appendMovingMin. We would need to keep another type to keep track of calls to appendMovingMin Then we could provide an average standard deviation, something like this:

const maxRTTSamplesForMovingMin = 10 type rttMonitor struct { varSum float64 callsToAppendMovingMin int movingMin *list.List } func (rtt *rttMonitor) appendMovingMin(t time.Duration) { defer func() { rtt.callsToAppendMovingMin++ }() if rtt.movingMin == nil || t < 0 { return } if rtt.movingMin.Len() == maxRTTSamplesForMovingMin { rtt.movingMin.Remove(rtt.movingMin.Front()) } rtt.movingMin.PushBack(t) // Collect a sum of variances ever 10 calls, ignoring the fist call. if rtt.callsToAppendMovingMin > 1 && rtt.callsToAppendMovingMin%10 == 0 { // Piecewise update latest variance, varSum } }

We could also just qualify using the last ten samples:

return fmt.Sprintf(`Round-trip-time monitor statistics:`+"\n"+ `average RTT: %v, minimum RTT: %v, %v, standard dev of last 10 samples: %v`+"\n", time.Duration(avg), r.minRTT, r.rtt90, time.Duration(stdDev))

I do think we should remove samples since it adds a lot of complexity for only giving us stddev. That also allows us to stop using the "github.com/montanaflynn/stats" package and remove it from our dependencies (after removing it from the benchmark package).

Your plan to calculate the average stddev sounds great! Should we use average or a moving average like EWMA?

MSD would be more sensitive to dispersion, which is good. This would just be the average standard deviations of the $S_{10}$ moving average defined here. Which would result in a Nx1 vector of standard deviations, like this. This would be the only change required IIUC:

if rtt.callsToAppendMovingMin >= maxRTTSamplesForMovingMin { // Update latest variance, varSum }

Here is a working example, you roughly confirm the hypothesis by adjusting the rang variable which should grow larger by a factor of itself: https://go.dev/play/p/s54Z4BQWPCR

I think this should be done as a follow-up ticket: GODRIVER-3095

matthewdale · 2024-01-03T22:53:58Z

x/mongo/driver/topology/rtt_monitor.go

@@ -47,10 +50,10 @@ type rttMonitor struct {
 	connMu        sync.Mutex
 	samples       []time.Duration


Now that we calculate minRTT using movingMin, can we remove samples?

If we want to give the user the standard deviation of the RTTs, then we need to keep the samples slice.

I don't think stddev is necessary anymore since we're using minimum RTT calculated from only 10 samples. See comment above.

matthewdale · 2024-01-10T01:19:32Z

x/mongo/driver/operation.go

-			if csot.IsTimeoutContext(ctx) && time.Now().Add(srvr.RTTMonitor().P90()).After(deadline) {
-				err = fmt.Errorf(
-					"remaining time %v until context deadline is less than 90th percentile RTT: %w\n%v",
-					time.Until(deadline),
-					ErrDeadlineWouldBeExceeded,
-					srvr.RTTMonitor().Stats())
-			} else if time.Now().Add(srvr.RTTMonitor().Min()).After(deadline) {
+			if time.Now().Add(srvr.RTTMonitor().Min()).After(deadline) {
 				err = context.DeadlineExceeded


We should use ErrDeadlineWouldBeExceeded here with the additional info of the remaining time and the RTT stats (probably just the full list of 10 RTT samples).

matthewdale · 2024-01-10T01:32:59Z

x/mongo/driver/operation.go

 			if maxTimeMS <= 0 {
 				return 0, fmt.Errorf(
 					"remaining time %v until context deadline is less than or equal to 90th percentile RTT: %w\n%v",
 					remainingTimeout,
 					ErrDeadlineWouldBeExceeded,
 					rttStats)
 			}
-			return uint64(maxTimeMS), nil
+			return uint64(maxTimeMS.Milliseconds()), nil


Calling Milliseconds on a duration value smaller than 1ms will truncate it to 0. We need to retain the "round up" behavior described above to prevent unintentionally sending maxTimeMS=0 (0 means "no timeout" to the server).

mongodb-drivers-pr-bot · 2024-01-11T21:55:40Z

API Change Report

./x/mongo/driver

incompatible changes

RTTMonitor.P90: removed

matthewdale

Looks good! 👍

matthewdale · 2024-01-12T00:56:59Z

internal/cmd/compilecheck/go.mod

@@ -12,7 +12,6 @@ require go.mongodb.org/mongo-driver v1.11.7
 require (
 	github.com/golang/snappy v0.0.1 // indirect
 	github.com/klauspost/compress v1.13.6 // indirect
-	github.com/montanaflynn/stats v0.0.0-20171201202039-1bf9dbcd8cbe // indirect


Can we actually remove this yet? Looks like it's still used in internal/benchmark/harness_results.go. It seems fine if the compile check compiles, I'm just curious.

I created GODRIVER-3096 to track the rest of the work to remove the dependency from the Go driver's go.mod.

That's a good question. This was done automatically, I'm assuming it's because internal/benchmark isn't needed for our use case of the dependency:

If needed, it adds require directives to your go.mod file for modules needed to build packages named on the command line. A require directive tracks the minimum version of a module that your module depends on: https://go.dev/doc/modules/managing-dependencies#adding_dependency

…iver into GODRIVER-2762

x/mongo/driver/topology/rtt_monitor.go

Co-authored-by: Matt Dale <9760375+matthewdale@users.noreply.github.com>

prestonvasquez added 2 commits December 21, 2023 15:54

GODRIVER-2762 Use minimum RTT for CSOT maxTimeMS calculation instead …

76fe99a

…of 90th percentile

GODRIVER-2762 Clean up rtt monitor code

c96883a

prestonvasquez requested a review from a team as a code owner December 21, 2023 23:02

prestonvasquez requested review from blink1073 and removed request for a team December 21, 2023 23:02

prestonvasquez commented Dec 21, 2023

View reviewed changes

GODRIVER-2762 Add command-execution.yml

88e9f0d

prestonvasquez requested a review from matthewdale December 21, 2023 23:07

prestonvasquez added the priority-2-medium Medium Priority PR for Review label Dec 21, 2023

GODRIVER-2762 Fix typo

85812f6

matthewdale reviewed Dec 29, 2023

View reviewed changes

prestonvasquez added 2 commits December 29, 2023 11:48

Merge branch 'master' into GODRIVER-2762

1a0517c

GODRIVER-2762 PR requests

49298ee

prestonvasquez requested a review from matthewdale December 29, 2023 20:13

matthewdale reviewed Jan 3, 2024

View reviewed changes

GODRIVER-2762 Remove reference to RTT90 in max time calc

04aeb6d

prestonvasquez requested a review from matthewdale January 5, 2024 19:10

matthewdale reviewed Jan 10, 2024

View reviewed changes

GODRIVER-2762 Fix rounding, cleanup deadline exceeded error

0ba0d69

prestonvasquez requested a review from matthewdale January 11, 2024 21:31

matthewdale previously approved these changes Jan 12, 2024

View reviewed changes

Merge branch 'master' into GODRIVER-2762

bccb529

prestonvasquez dismissed matthewdale’s stale review via bccb529 January 12, 2024 02:35

prestonvasquez requested a review from matthewdale January 12, 2024 02:35

prestonvasquez added 2 commits January 12, 2024 18:01

GODRIVER-2762 Use args in wait event

698e4e0

Merge branch 'GODRIVER-2762' of github.com:prestonvasquez/mongo-go-dr…

8b10b54

…iver into GODRIVER-2762

matthewdale previously approved these changes Jan 13, 2024

View reviewed changes

x/mongo/driver/topology/rtt_monitor.go Outdated Show resolved Hide resolved

prestonvasquez dismissed matthewdale’s stale review via ce0c7c2 January 13, 2024 02:58

prestonvasquez and others added 2 commits January 12, 2024 19:58

Update x/mongo/driver/topology/rtt_monitor.go

ce0c7c2

Co-authored-by: Matt Dale <9760375+matthewdale@users.noreply.github.com>

Merge branch 'master' into GODRIVER-2762

6095151

prestonvasquez requested a review from matthewdale January 16, 2024 15:56

blink1073 approved these changes Jan 16, 2024

View reviewed changes

prestonvasquez merged commit df800a9 into mongodb:master Jan 16, 2024
37 of 40 checks passed

prestonvasquez deleted the GODRIVER-2762 branch January 16, 2024 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GODRIVER-2762 Use minimum RTT for CSOT #1507

GODRIVER-2762 Use minimum RTT for CSOT #1507

prestonvasquez commented Dec 21, 2023

prestonvasquez Dec 21, 2023

prestonvasquez Dec 21, 2023

matthewdale Dec 29, 2023

matthewdale Dec 29, 2023

matthewdale Dec 29, 2023

matthewdale Jan 3, 2024

prestonvasquez Jan 5, 2024 •

edited

matthewdale Jan 10, 2024

prestonvasquez Jan 11, 2024 •

edited

matthewdale Jan 11, 2024

prestonvasquez Jan 11, 2024 •

edited

matthewdale Jan 3, 2024

prestonvasquez Jan 5, 2024

matthewdale Jan 10, 2024

matthewdale Jan 10, 2024

matthewdale Jan 10, 2024

mongodb-drivers-pr-bot bot commented Jan 11, 2024

matthewdale left a comment

matthewdale Jan 12, 2024

prestonvasquez Jan 12, 2024

		r.samples[r.offset] = rtt
		r.offset = (r.offset + 1) % len(r.samples)

		@@ -47,10 +50,10 @@ type rttMonitor struct {
		connMu sync.Mutex
		samples []time.Duration

GODRIVER-2762 Use minimum RTT for CSOT #1507

GODRIVER-2762 Use minimum RTT for CSOT #1507

Conversation

prestonvasquez commented Dec 21, 2023

Summary

Background & Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prestonvasquez Jan 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prestonvasquez Jan 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prestonvasquez Jan 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mongodb-drivers-pr-bot bot commented Jan 11, 2024

API Change Report

./x/mongo/driver

incompatible changes

matthewdale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prestonvasquez Jan 5, 2024 •

edited

prestonvasquez Jan 11, 2024 •

edited

prestonvasquez Jan 11, 2024 •

edited