query scheduler test for multidimensional queueing effectiveness scenarios #7162

francoposa · 2024-01-18T17:30:58Z

Creates a contrived test to emulate the load scenario that the query scheduler queue splitting was created to address.

Test docstring description is fairly extensive for more details.

minor refactor of some test util funcs that were only written for the BenchmarkConcurrentQueueOperations but are now shared between the two tests

Open to suggestions about what other scenarios we could represent

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…arios

charleskorn · 2024-01-19T00:41:26Z

pkg/scheduler/scheduler_test.go

Is this change required?

not technically required of course, but there are only two initSomethingLoop funcs in this test suite, those two loops talk to each other, and previously the two funcs were 400 lines apart, so I found this helpful

pkg/scheduler/queue/queue_test.go

charleskorn · 2024-01-19T00:47:21Z

pkg/scheduler/queue/queue_test.go

+		// production code uses a histogram for queue duration, but the counter is a more direct metric for test cases
+		queueDuration := promauto.With(promRegistry).NewCounterVec(prometheus.CounterOpts{
+			Name: "cortex_query_scheduler_queue_duration_total_seconds",
+			Help: "Total time spent by requests in queue before getting picked up by a querier.",
+		}, []string{"additional_queue_dimensions"})


Do we need a Prometheus metric at all here? I'm wondering if we could just have two atomic floats, one for normal requests and the other for slow requests, and accumulate the total duration in those

sure, but isn't the prometheus metric just a more convenient interface for multiple atomic floats? It also allows us to reuse the same code for recording the observations as we use in the prod scheduler code.

I don't have strong preference on Prometheus metric vs atomic floats but I'd rather see this named something different if you decide to keep the Prometheus metric. Creating a metric named like this just for a test is distracting to me. It makes me pause and double check that this metric is only for the test here.

fixed metric name and improved comment above it 👍

charleskorn · 2024-01-19T00:49:28Z

pkg/scheduler/queue/queue_test.go

+	expected := queueDurationTotals[false][normalQueueDimension] / 2
+	actual := queueDurationTotals[true][normalQueueDimension]


Given all we need here is the duration of normal requests and not the slow requests, could we simplify queueDurationTotals to only store the normal request durations?

we could, but even though were are not asserting on it at the end, it is still a metric that is helpful to look at when building and understanding the behavior of the test case

charleskorn · 2024-01-19T00:53:51Z

pkg/scheduler/queue/queue_test.go

+			startProducersChan := make(chan struct{})
+			producersErrGroup, _ := errgroup.WithContext(ctx)
+
+			runProducer := runQueueProducerIters(
+				queue, maxQueriersPerTenant, totalRequests/2, numProducers, numTenants, startProducersChan, queueDimensionFunc,
+			)
+			for producerIdx := 0; producerIdx < numProducers; producerIdx++ {
+				producerIdx := producerIdx
+				producersErrGroup.Go(func() error {
+					return runProducer(producerIdx)
+				})
+			}
+			close(startProducersChan)
+			err := producersErrGroup.Wait()
+			if err != nil {
+				require.NoError(t, err)
+			}


Given we want to fill the queue ahead of time, and don't need to do anything like waiting for some requests to be consumed, could we just directly add requests to the queue here with queueProduce? This would remove the need for the bookkeeping introduced by the channels and WaitGroup etc.

I'll check - I originally did this because I was adding >1000 requests and the time it took to add them all to the queue with just a loop was long enough that it skewed the timing results at at the end.

I am also not totally convinced that waitGroup-ing a bunch of goroutines is something we should consider a lot of overhead/bookkeeping - particularly because this mirrors the other patterns in the test cases, which we can't change to a synchronous loop.

56quarters · 2024-01-23T14:55:28Z

pkg/scheduler/queue/queue_test.go

+		// production code uses a histogram for queue duration, but the counter is a more direct metric for test cases
+		queueDuration := promauto.With(promRegistry).NewCounterVec(prometheus.CounterOpts{
+			Name: "cortex_query_scheduler_queue_duration_total_seconds",
+			Help: "Total time spent by requests in queue before getting picked up by a querier.",
+		}, []string{"additional_queue_dimensions"})


I don't have strong preference on Prometheus metric vs atomic floats but I'd rather see this named something different if you decide to keep the Prometheus metric. Creating a metric named like this just for a test is distracting to me. It makes me pause and double check that this metric is only for the test here.

56quarters · 2024-01-23T14:58:36Z

pkg/scheduler/queue/queue_test.go

+			close(startProducersChan)
+			err := producersErrGroup.Wait()
+			if err != nil {
+				require.NoError(t, err)


The pattern of checking for a non-nil error before doing require.NoError is only really needed in benchmarks due to the overhead of require.NoError. It's fine to just call the require method on the result of the wait group in tests.

56quarters · 2024-01-23T15:00:05Z

pkg/scheduler/queue/queue_test.go

+	// variance is also a function of the number of consumers and the consumer delay chosen.
+	// variance can be tighter if the test runs longer but there is a tradeoff for testing and CI speed
+	delta := expected * 0.10
+	require.InDelta(t, expected, actual, delta)


A test that depends on timing makes me nervous... BUT I haven't been able to get it to flake locally in a few hundred runs 👍

there should be plenty of padding in the parameters here, which are (consumer delay time, number of consumers, and allowed delta).

Changing only the number of consumers for example, don't see it start flaking until I push it up to 5 consumers. I've run 10,000 iterations with 4 consumers with no flakes. The code here only uses 1 consumer which has much more room for error.

56quarters

LGTM, thanks!

query scheduler test for multidimensional queueing effectiveness scen…

9e01ef7

…arios

francoposa requested a review from a team as a code owner January 18, 2024 17:30

charleskorn reviewed Jan 19, 2024

View reviewed changes

clarify dimensions names for test

ce6f674

56quarters reviewed Jan 23, 2024

View reviewed changes

francoposa added 2 commits January 23, 2024 09:05

pr feedback: remove redundant error check, rename prometheus metric

33c6f66

better comment for prometheus metric in test

bc53c61

francoposa requested a review from 56quarters January 23, 2024 17:18

56quarters approved these changes Jan 24, 2024

View reviewed changes

francoposa merged commit 295b208 into main Jan 24, 2024
28 checks passed

francoposa deleted the francoposa/query-scheduler-benchmark-multidimensional-queuing-effectiveness branch January 24, 2024 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query scheduler test for multidimensional queueing effectiveness scenarios #7162

query scheduler test for multidimensional queueing effectiveness scenarios #7162

francoposa commented Jan 18, 2024 •

edited

charleskorn Jan 19, 2024

francoposa Jan 22, 2024 •

edited

charleskorn Jan 19, 2024

francoposa Jan 19, 2024

56quarters Jan 23, 2024

francoposa Jan 23, 2024

charleskorn Jan 19, 2024

francoposa Jan 19, 2024

charleskorn Jan 19, 2024

francoposa Jan 19, 2024 •

edited

56quarters Jan 23, 2024

56quarters Jan 23, 2024

francoposa Jan 23, 2024

56quarters Jan 23, 2024

francoposa Jan 23, 2024 •

edited

56quarters left a comment

		expected := queueDurationTotals[false][normalQueueDimension] / 2
		actual := queueDurationTotals[true][normalQueueDimension]

query scheduler test for multidimensional queueing effectiveness scenarios #7162

query scheduler test for multidimensional queueing effectiveness scenarios #7162

Conversation

francoposa commented Jan 18, 2024 • edited

Checklist

Choose a reason for hiding this comment

francoposa Jan 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francoposa Jan 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

francoposa Jan 23, 2024 • edited

Choose a reason for hiding this comment

56quarters left a comment

Choose a reason for hiding this comment

francoposa commented Jan 18, 2024 •

edited

francoposa Jan 22, 2024 •

edited

francoposa Jan 19, 2024 •

edited

francoposa Jan 23, 2024 •

edited