[TSDB] Use index sampling to determine shard factor #6396

owen-d · 2022-06-14T21:42:37Z

This adds some rudimentary request based sharding logic which will choose shard factor based on the amount of bytes in the query and the tenant's max parallelism.

ref #5428

grafanabot · 2022-06-14T21:55:59Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
-               loki	-0.6%

cyriltovena

LGTM

grafanabot · 2022-06-15T17:43:56Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.1%
-               loki	-0.6%

grafanabot · 2022-06-15T20:40:21Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.1%
+               loki	0%

grafanabot · 2022-06-15T21:25:02Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.2%
+               loki	0%

grafanabot · 2022-06-15T21:48:05Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.2%
+               loki	0.6%

grafanabot · 2022-06-15T22:17:02Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.2%
+               loki	0%

grafanabot · 2022-06-15T22:19:53Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.2%
+               loki	0%

…ing chunk size

grafanabot · 2022-06-15T22:30:15Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.2%
+               loki	0%

cstyan · 2022-06-16T03:19:58Z

pkg/querier/queryrange/shard_resolver.go

+		lookback := grps[i].Interval
+		if lookback == 0 {
+			lookback = r.defaultLookback
+		}
+		diff := lookback + grps[i].Offset
+		adjustedFrom := r.from.Add(-diff)
+		adjustedThrough := r.through.Add(-diff)


excuse my ignorance when it comes to the AST code and structs, here you're just aligning the queries time range to be nice multiples of the interval right? wouldn't we want to align the time range for the last group to be later than r.through to cover through the requested time range?

Good catch - I shouldn't be subtracting the defaultLookback from the through.

cstyan · 2022-06-16T03:25:38Z

pkg/querier/queryrange/shard_resolver.go

+import (
+	"context"
+	"fmt"
+	math "math"


nit: did you mean to include the name prefix here?

Thanks (I did not)

cstyan · 2022-06-16T03:28:52Z

pkg/querier/queryrange/shard_resolver.go

+const (
+	// Just some observed values to get us started on better query planning.
+	p90BytesPerSecond = 300 << 20 // 300MB/s/core
+	// At max, schedule a query for 10s of execution before
+	// splitting it into more requests. This is a lot of guesswork.
+	maxSeconds          = 10
+	maxSchedulableBytes = maxSeconds * p90BytesPerSecond
+)


these #'s are the throughput we're trying to target?

Yeah, it's just a rough first pass based on some metrics from clusters we run.

cstyan · 2022-06-16T03:31:02Z

pkg/querier/queryrange/shard_resolver.go

+func guessShardFactor(stats stats.Stats, maxParallelism int) int {
+	expectedSeconds := float64(stats.Bytes / p90BytesPerSecond)
+	if expectedSeconds <= float64(maxParallelism) {
+		power := math.Ceil(math.Log2(expectedSeconds)) // round up to nearest power of 2
+		// Ideally, parallelize down to 1s queries
+		return int(math.Pow(2, power))
+	}
+
+	n := stats.Bytes / maxSchedulableBytes
+	power := math.Ceil(math.Log2(float64(n)))
+	return int(math.Pow(2, power))
+}


and here you're just calculating the sharding factor do get some data on what me might want to configure that value to?

Yes, we're trying to guess an optimal shard factor based on the amount of data we see in the index.

jeschkies

This is pretty exciting! Do we have some data on the improvement for throughput or reduction of shards?

jeschkies · 2022-06-16T12:03:30Z

pkg/querier/queryrange/downstreamer.go

+	// We may increase parallelism above the default,
+	// ensure we don't end up bottlenecking here.
+	if user, err := tenant.TenantID(ctx); err == nil {
+		if x := h.limits.MaxQueryParallelism(user); x > 0 {


Where was this enforced before?

jeschkies · 2022-06-16T12:05:28Z

pkg/querier/queryrange/shard_resolver.go

+	defaultLookback time.Duration
+}
+
+// from, through, max concurrency to run


I'm not sure I understand this comment. What do you mean?

I think I left this in from when I was hacking around 😆 . Will remove

jeschkies · 2022-06-16T12:09:06Z

pkg/querier/queryrange/shard_resolver.go

+	// We try to shard subtrees in the AST independently if possible, although
+	// nested binary expressions can make this difficult. In this case,
+	// we query the index stats for all matcher groups then sum the results.
+	grps := syntax.MatcherGroups(e)


I may jump the gun here but what do you think about introducing a query plan that is generated from the AST in the future? This plan could already express if expressions are shardable or not.

I'm sure there will be a big query planning rewrite in Loki's future, but let's leave that for a separate task :)

Of course. I just wanted to verify that this is out there on our road 🙂

jeschkies · 2022-06-16T12:10:12Z

pkg/querier/queryrange/shard_resolver.go

+		adjustedThrough := r.through.Add(-diff)
+
+		start := time.Now()
+		resp, err := r.handler.Do(r.ctx, &indexgatewaypb.IndexStatsRequest{


Would it make sense to cache the stats?

Yeah, but I don't think we'll do it in this PR 😅

grafanabot · 2022-06-16T13:41:09Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.3%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.2%
+               loki	0%

grafanabot · 2022-06-16T14:07:11Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.1%
+               loki	0.6%

grafanabot · 2022-06-16T14:22:45Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.1%
+               loki	0%

grafanabot · 2022-06-16T14:47:34Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.3%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.1%
+               loki	0%

grafanabot · 2022-06-16T15:00:44Z

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
-              logql	-0.1%
+               loki	0%

owen-d added 3 commits June 14, 2022 15:37

use index sampling to determine shard factor

80aedc4

logging/tracing + better parallelism planning

4bffa2d

queryrange downstreamer now checks max_query_parallelism

9770b64

owen-d requested a review from a team as a code owner June 14, 2022 21:42

pull-request-size bot added the size/L label Jun 14, 2022

lint

11e4e77

cyriltovena approved these changes Jun 15, 2022

View reviewed changes

handles interval, offset in tsdb planning and adds logging

c61de3d

fixes ns->ms confusion in index stats proto

5a6d985

handle zero shard value without panics

77e6c51

shardmapper will downstream a single unsharded query

451dd91

owen-d added 2 commits June 15, 2022 18:11

uses concat expr with no shards to avoid query parsing errors

5275e6f

better logging

d792d8e

fixes wrong Size() method call and rounds to nearest KB when calculat…

e21340b

…ing chunk size

cstyan reviewed Jun 16, 2022

View reviewed changes

jeschkies reviewed Jun 16, 2022

View reviewed changes

humanize bytes in log line

b2b7f30

only adds defaultLookback to index sampling when interval is zero

8c046fd

removes comment

3a52965

owen-d added 2 commits June 16, 2022 10:37

more logging

0e37b97

adjust through correctly

c5a0b85

adds query length for index queries

2e53329

owen-d merged commit d6f50ca into grafana:main Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TSDB] Use index sampling to determine shard factor #6396

[TSDB] Use index sampling to determine shard factor #6396

owen-d commented Jun 14, 2022

grafanabot commented Jun 14, 2022

cyriltovena left a comment

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

cstyan Jun 16, 2022

owen-d Jun 16, 2022

cstyan Jun 16, 2022

owen-d Jun 16, 2022

cstyan Jun 16, 2022

owen-d Jun 16, 2022

cstyan Jun 16, 2022

owen-d Jun 16, 2022

jeschkies left a comment

jeschkies Jun 16, 2022

jeschkies Jun 16, 2022

owen-d Jun 16, 2022

jeschkies Jun 16, 2022

owen-d Jun 16, 2022

jeschkies Jun 17, 2022

jeschkies Jun 16, 2022

owen-d Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

[TSDB] Use index sampling to determine shard factor #6396

[TSDB] Use index sampling to determine shard factor #6396

Conversation

owen-d commented Jun 14, 2022

grafanabot commented Jun 14, 2022

cyriltovena left a comment

Choose a reason for hiding this comment

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

grafanabot commented Jun 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeschkies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022

grafanabot commented Jun 16, 2022