-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query scheduler test for multidimensional queueing effectiveness scenarios #7162
query scheduler test for multidimensional queueing effectiveness scenarios #7162
Conversation
pkg/scheduler/scheduler_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not technically required of course, but there are only two initSomethingLoop
funcs in this test suite, those two loops talk to each other, and previously the two funcs were 400 lines apart, so I found this helpful
pkg/scheduler/queue/queue_test.go
Outdated
// production code uses a histogram for queue duration, but the counter is a more direct metric for test cases | ||
queueDuration := promauto.With(promRegistry).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "cortex_query_scheduler_queue_duration_total_seconds", | ||
Help: "Total time spent by requests in queue before getting picked up by a querier.", | ||
}, []string{"additional_queue_dimensions"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a Prometheus metric at all here? I'm wondering if we could just have two atomic floats, one for normal requests and the other for slow requests, and accumulate the total duration in those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but isn't the prometheus metric just a more convenient interface for multiple atomic floats? It also allows us to reuse the same code for recording the observations as we use in the prod scheduler code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong preference on Prometheus metric vs atomic floats but I'd rather see this named something different if you decide to keep the Prometheus metric. Creating a metric named like this just for a test is distracting to me. It makes me pause and double check that this metric is only for the test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed metric name and improved comment above it 👍
expected := queueDurationTotals[false][normalQueueDimension] / 2 | ||
actual := queueDurationTotals[true][normalQueueDimension] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all we need here is the duration of normal requests and not the slow requests, could we simplify queueDurationTotals
to only store the normal request durations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could, but even though were are not asserting on it at the end, it is still a metric that is helpful to look at when building and understanding the behavior of the test case
pkg/scheduler/queue/queue_test.go
Outdated
startProducersChan := make(chan struct{}) | ||
producersErrGroup, _ := errgroup.WithContext(ctx) | ||
|
||
runProducer := runQueueProducerIters( | ||
queue, maxQueriersPerTenant, totalRequests/2, numProducers, numTenants, startProducersChan, queueDimensionFunc, | ||
) | ||
for producerIdx := 0; producerIdx < numProducers; producerIdx++ { | ||
producerIdx := producerIdx | ||
producersErrGroup.Go(func() error { | ||
return runProducer(producerIdx) | ||
}) | ||
} | ||
close(startProducersChan) | ||
err := producersErrGroup.Wait() | ||
if err != nil { | ||
require.NoError(t, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we want to fill the queue ahead of time, and don't need to do anything like waiting for some requests to be consumed, could we just directly add requests to the queue here with queueProduce
? This would remove the need for the bookkeeping introduced by the channels and WaitGroup
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check - I originally did this because I was adding >1000 requests and the time it took to add them all to the queue with just a loop was long enough that it skewed the timing results at at the end.
I am also not totally convinced that waitGroup-ing a bunch of goroutines is something we should consider a lot of overhead/bookkeeping - particularly because this mirrors the other patterns in the test cases, which we can't change to a synchronous loop.
pkg/scheduler/queue/queue_test.go
Outdated
// production code uses a histogram for queue duration, but the counter is a more direct metric for test cases | ||
queueDuration := promauto.With(promRegistry).NewCounterVec(prometheus.CounterOpts{ | ||
Name: "cortex_query_scheduler_queue_duration_total_seconds", | ||
Help: "Total time spent by requests in queue before getting picked up by a querier.", | ||
}, []string{"additional_queue_dimensions"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong preference on Prometheus metric vs atomic floats but I'd rather see this named something different if you decide to keep the Prometheus metric. Creating a metric named like this just for a test is distracting to me. It makes me pause and double check that this metric is only for the test here.
pkg/scheduler/queue/queue_test.go
Outdated
close(startProducersChan) | ||
err := producersErrGroup.Wait() | ||
if err != nil { | ||
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern of checking for a non-nil error before doing require.NoError
is only really needed in benchmarks due to the overhead of require.NoError
. It's fine to just call the require method on the result of the wait group in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed 👍
// variance is also a function of the number of consumers and the consumer delay chosen. | ||
// variance can be tighter if the test runs longer but there is a tradeoff for testing and CI speed | ||
delta := expected * 0.10 | ||
require.InDelta(t, expected, actual, delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test that depends on timing makes me nervous... BUT I haven't been able to get it to flake locally in a few hundred runs 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be plenty of padding in the parameters here, which are (consumer delay time, number of consumers, and allowed delta).
Changing only the number of consumers for example, don't see it start flaking until I push it up to 5 consumers. I've run 10,000 iterations with 4 consumers with no flakes. The code here only uses 1 consumer which has much more room for error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Creates a contrived test to emulate the load scenario that the query scheduler queue splitting was created to address.
Test docstring description is fairly extensive for more details.
minor refactor of some test util funcs that were only written for the BenchmarkConcurrentQueueOperations but are now shared between the two tests
Open to suggestions about what other scenarios we could represent
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.