Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query scheduler test for multidimensional queueing effectiveness scenarios #7162

Conversation

francoposa
Copy link
Member

@francoposa francoposa commented Jan 18, 2024

Creates a contrived test to emulate the load scenario that the query scheduler queue splitting was created to address.

Test docstring description is fairly extensive for more details.

minor refactor of some test util funcs that were only written for the BenchmarkConcurrentQueueOperations but are now shared between the two tests

Open to suggestions about what other scenarios we could represent

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@francoposa francoposa requested a review from a team as a code owner January 18, 2024 17:30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not technically required of course, but there are only two initSomethingLoop funcs in this test suite, those two loops talk to each other, and previously the two funcs were 400 lines apart, so I found this helpful

pkg/scheduler/queue/queue_test.go Outdated Show resolved Hide resolved
Comment on lines 127 to 131
// production code uses a histogram for queue duration, but the counter is a more direct metric for test cases
queueDuration := promauto.With(promRegistry).NewCounterVec(prometheus.CounterOpts{
Name: "cortex_query_scheduler_queue_duration_total_seconds",
Help: "Total time spent by requests in queue before getting picked up by a querier.",
}, []string{"additional_queue_dimensions"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a Prometheus metric at all here? I'm wondering if we could just have two atomic floats, one for normal requests and the other for slow requests, and accumulate the total duration in those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but isn't the prometheus metric just a more convenient interface for multiple atomic floats? It also allows us to reuse the same code for recording the observations as we use in the prod scheduler code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong preference on Prometheus metric vs atomic floats but I'd rather see this named something different if you decide to keep the Prometheus metric. Creating a metric named like this just for a test is distracting to me. It makes me pause and double check that this metric is only for the test here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed metric name and improved comment above it 👍

Comment on lines +214 to +215
expected := queueDurationTotals[false][normalQueueDimension] / 2
actual := queueDurationTotals[true][normalQueueDimension]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all we need here is the duration of normal requests and not the slow requests, could we simplify queueDurationTotals to only store the normal request durations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but even though were are not asserting on it at the end, it is still a metric that is helpful to look at when building and understanding the behavior of the test case

Comment on lines 151 to 167
startProducersChan := make(chan struct{})
producersErrGroup, _ := errgroup.WithContext(ctx)

runProducer := runQueueProducerIters(
queue, maxQueriersPerTenant, totalRequests/2, numProducers, numTenants, startProducersChan, queueDimensionFunc,
)
for producerIdx := 0; producerIdx < numProducers; producerIdx++ {
producerIdx := producerIdx
producersErrGroup.Go(func() error {
return runProducer(producerIdx)
})
}
close(startProducersChan)
err := producersErrGroup.Wait()
if err != nil {
require.NoError(t, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we want to fill the queue ahead of time, and don't need to do anything like waiting for some requests to be consumed, could we just directly add requests to the queue here with queueProduce? This would remove the need for the bookkeeping introduced by the channels and WaitGroup etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check - I originally did this because I was adding >1000 requests and the time it took to add them all to the queue with just a loop was long enough that it skewed the timing results at at the end.

I am also not totally convinced that waitGroup-ing a bunch of goroutines is something we should consider a lot of overhead/bookkeeping - particularly because this mirrors the other patterns in the test cases, which we can't change to a synchronous loop.

Comment on lines 127 to 131
// production code uses a histogram for queue duration, but the counter is a more direct metric for test cases
queueDuration := promauto.With(promRegistry).NewCounterVec(prometheus.CounterOpts{
Name: "cortex_query_scheduler_queue_duration_total_seconds",
Help: "Total time spent by requests in queue before getting picked up by a querier.",
}, []string{"additional_queue_dimensions"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong preference on Prometheus metric vs atomic floats but I'd rather see this named something different if you decide to keep the Prometheus metric. Creating a metric named like this just for a test is distracting to me. It makes me pause and double check that this metric is only for the test here.

close(startProducersChan)
err := producersErrGroup.Wait()
if err != nil {
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern of checking for a non-nil error before doing require.NoError is only really needed in benchmarks due to the overhead of require.NoError. It's fine to just call the require method on the result of the wait group in tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed 👍

// variance is also a function of the number of consumers and the consumer delay chosen.
// variance can be tighter if the test runs longer but there is a tradeoff for testing and CI speed
delta := expected * 0.10
require.InDelta(t, expected, actual, delta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test that depends on timing makes me nervous... BUT I haven't been able to get it to flake locally in a few hundred runs 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be plenty of padding in the parameters here, which are (consumer delay time, number of consumers, and allowed delta).

Changing only the number of consumers for example, don't see it start flaking until I push it up to 5 consumers. I've run 10,000 iterations with 4 consumers with no flakes. The code here only uses 1 consumer which has much more room for error.

Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@francoposa francoposa merged commit 295b208 into main Jan 24, 2024
28 checks passed
@francoposa francoposa deleted the francoposa/query-scheduler-benchmark-multidimensional-queuing-effectiveness branch January 24, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants