-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GODRIVER-2762 Use minimum RTT for CSOT #1507
Conversation
@@ -663,7 +664,7 @@ func (s *Server) update() { | |||
s.rttMonitor.connect() | |||
} | |||
|
|||
if isStreamable(s) || connectionIsStreaming || transitionedFromNetworkError { | |||
if isStreamable(s) && (serverSupportsStreaming || connectionIsStreaming) || transitionedFromNetworkError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for this was not implemented correctly, see specifications here: https://github.com/mongodb/specifications/blame/master/source/server-discovery-and-monitoring/server-monitoring.rst#L747
@@ -186,6 +186,15 @@ func executeTestRunnerOperation(ctx context.Context, operation *operation, loopD | |||
} | |||
} | |||
} | |||
return nil | |||
case "wait": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is scheduled to be added in GODRIVER-2466 , but this part is required for the commaned-execution.json UST.
x/mongo/driver/operation.go
Outdated
@@ -1546,23 +1540,22 @@ func (op Operation) addClusterTime(dst []byte, desc description.SelectedServer) | |||
// if the ctx is a Timeout context. If the context is not a Timeout context, it uses the | |||
// operation's MaxTimeMS if set. If no MaxTimeMS is set on the operation, and context is | |||
// not a Timeout context, calculateMaxTimeMS returns 0. | |||
func (op Operation) calculateMaxTimeMS(ctx context.Context, rtt90 time.Duration, rttStats string) (uint64, error) { | |||
func (op Operation) calculateMaxTimeMS(ctx context.Context, rtt RTTMonitor, rttStats string) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Min()
is the only method called on rtt
, passing minRTT
as a time.Duration
makes testing this method significantly simpler because the test doesn't have to create a type that satisfies the RTTMonitor
interface.
An additional simplification is to make calculateMaxTimeMS
a function that accepts
opMaxTime *time.Duration
so that you don't have to create an Operation
to test it.
r.samples[r.offset] = rtt | ||
r.offset = (r.offset + 1) % len(r.samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the existing logic for calculating "min" and "p90" are only used in the Stats
method. If that's the case, we should remove the existing logic for calculating "min" and "p90" and update the Stats
method to return information based on the actually used "min" implementation.
return r.minRTT | ||
return r.min() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating the minimum RTT when Min
is called can create a performance issue. We expect Min
to be called once for every operation, which could be thousands of times per second. In contrast, we expect new samples to be added about once every 10 seconds. Because of that, it's better to calculate the updated minimum RTT when a new sample is added instead of when Min
is called.
`average RTT: %v, minimum RTT: %v, 90th percentile RTT: %v, standard dev: %v`+"\n", | ||
time.Duration(avg), r.minRTT, r.rtt90, time.Duration(stdDev)) | ||
`average RTT: %v, minimum RTT: %v, standard dev: %v`+"\n", | ||
time.Duration(avg), r.minRTT, time.Duration(stdDev)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we calculate stdDev using the movingMin
values instead of samples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no mention in the specifications on propagating the standard deviation of RTT samples, so I think we can use the full set of samples or the movingMin
window.
I'm assuming that we provide this information for users to determine if the stability of the connection could potentially be responsible for a timeout error. That is, how close are our the RTTs to the mean? Basing this information on the movingMin
, a 10-element window of all of the samples, will only tell us how close that window is to it's own mean. Furthermore, there is no obvious way to calculate the overall standard deviation as an aggregation of these windows since the windows themselves are not disjoint. If the goal is to give the user insights on the variability or dispersion of the RTT values, then not using the full sample set would be a mistake.
The best thing we could do to give the most comprehensive story is to show both: (1) the standard deviation of all samples taken upto the error, (2) the standard deviation of the most recent window of samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stddev was indeed added to give users (and us) an idea of the distribution of RTT measurements when the RTT logic short-circuits an operation. That was more important when we were using the 90th percentile RTT, which is more sensitive to widely distributed RTT values than the minimum. However, since we're using minimum and only keeping 10 samples, I think we should include those 10 samples in the error message instead of deriving stddev.
Keep in mind that samples
isn't a complete set of RTT samples either, it's a moving window of 5 minutes of samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really wanted to quantify connection stability and remove the samples
slice, we could create a type to hold the sum of variances calculated every minRTTSamplesForMovingMin-th call to appendMovingMin
. We would need to keep another type to keep track of calls to appendMovingMin
Then we could provide an average standard deviation, something like this:
const maxRTTSamplesForMovingMin = 10
type rttMonitor struct {
varSum float64
callsToAppendMovingMin int
movingMin *list.List
}
func (rtt *rttMonitor) appendMovingMin(t time.Duration) {
defer func() { rtt.callsToAppendMovingMin++ }()
if rtt.movingMin == nil || t < 0 {
return
}
if rtt.movingMin.Len() == maxRTTSamplesForMovingMin {
rtt.movingMin.Remove(rtt.movingMin.Front())
}
rtt.movingMin.PushBack(t)
// Collect a sum of variances ever 10 calls, ignoring the fist call.
if rtt.callsToAppendMovingMin > 1 && rtt.callsToAppendMovingMin%10 == 0 {
// Piecewise update latest variance, varSum
}
}
We could also just qualify using the last ten samples:
return fmt.Sprintf(`Round-trip-time monitor statistics:`+"\n"+
`average RTT: %v, minimum RTT: %v, %v, standard dev of last 10 samples: %v`+"\n",
time.Duration(avg), r.minRTT, r.rtt90, time.Duration(stdDev))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we should remove samples
since it adds a lot of complexity for only giving us stddev. That also allows us to stop using the "github.com/montanaflynn/stats" package and remove it from our dependencies (after removing it from the benchmark
package).
Your plan to calculate the average stddev sounds great! Should we use average or a moving average like EWMA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSD would be more sensitive to dispersion, which is good. This would just be the average standard deviations of the
if rtt.callsToAppendMovingMin >= maxRTTSamplesForMovingMin {
// Update latest variance, varSum
}
Here is a working example, you roughly confirm the hypothesis by adjusting the rang
variable which should grow larger by a factor of itself: https://go.dev/play/p/s54Z4BQWPCR
I think this should be done as a follow-up ticket: GODRIVER-3095
@@ -47,10 +50,10 @@ type rttMonitor struct { | |||
connMu sync.Mutex | |||
samples []time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we calculate minRTT
using movingMin
, can we remove samples
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to give the user the standard deviation of the RTTs, then we need to keep the samples slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think stddev is necessary anymore since we're using minimum RTT calculated from only 10 samples. See comment above.
x/mongo/driver/operation.go
Outdated
if csot.IsTimeoutContext(ctx) && time.Now().Add(srvr.RTTMonitor().P90()).After(deadline) { | ||
err = fmt.Errorf( | ||
"remaining time %v until context deadline is less than 90th percentile RTT: %w\n%v", | ||
time.Until(deadline), | ||
ErrDeadlineWouldBeExceeded, | ||
srvr.RTTMonitor().Stats()) | ||
} else if time.Now().Add(srvr.RTTMonitor().Min()).After(deadline) { | ||
if time.Now().Add(srvr.RTTMonitor().Min()).After(deadline) { | ||
err = context.DeadlineExceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ErrDeadlineWouldBeExceeded
here with the additional info of the remaining time and the RTT stats (probably just the full list of 10 RTT samples).
x/mongo/driver/operation.go
Outdated
if maxTimeMS <= 0 { | ||
return 0, fmt.Errorf( | ||
"remaining time %v until context deadline is less than or equal to 90th percentile RTT: %w\n%v", | ||
remainingTimeout, | ||
ErrDeadlineWouldBeExceeded, | ||
rttStats) | ||
} | ||
return uint64(maxTimeMS), nil | ||
return uint64(maxTimeMS.Milliseconds()), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling Milliseconds
on a duration value smaller than 1ms will truncate it to 0. We need to retain the "round up" behavior described above to prevent unintentionally sending maxTimeMS=0
(0 means "no timeout" to the server).
API Change Report./x/mongo/driverincompatible changesRTTMonitor.P90: removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
@@ -12,7 +12,6 @@ require go.mongodb.org/mongo-driver v1.11.7 | |||
require ( | |||
github.com/golang/snappy v0.0.1 // indirect | |||
github.com/klauspost/compress v1.13.6 // indirect | |||
github.com/montanaflynn/stats v0.0.0-20171201202039-1bf9dbcd8cbe // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually remove this yet? Looks like it's still used in internal/benchmark/harness_results.go
. It seems fine if the compile check compiles, I'm just curious.
I created GODRIVER-3096 to track the rest of the work to remove the dependency from the Go driver's go.mod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. This was done automatically, I'm assuming it's because internal/benchmark isn't needed for our use case of the dependency:
If needed, it adds require directives to your go.mod file for modules needed to build packages named on the command line. A require directive tracks the minimum version of a module that your module depends on: https://go.dev/doc/modules/managing-dependencies#adding_dependency
Co-authored-by: Matt Dale <9760375+matthewdale@users.noreply.github.com>
GODRIVER-2762
Summary
This PR remove the rtt90 logic throughout the Go Driver and implements the "moving minimum" logic defined in mongodb/specifications@c06650d .
Background & Motivation
From the PR: