Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set shuffle sharding ingester lookback automatically #2110

Merged
merged 3 commits into from
Jun 20, 2022

Conversation

56quarters
Copy link
Contributor

What this PR does

This change deprecates the
querier.shuffle-sharding-ingesters-lookback-period option, instead
setting the lookback period from the value of the
querier.query-ingesters-within option. Outside of our own integration
tests, there's no situation where it's useful to set the lookback period
to a different value than the query-ingesters-within option.

This also adds a new option querier.shuffle-sharding-ingesters-enabled
that can be used to enable or disable shuffle sharding on ingesters on
the read path instead of using the lookback period for this.

Which issue(s) this PR fixes or relates to

Fixes #1810

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

This change deprecates the
`querier.shuffle-sharding-ingesters-lookback-period` option, instead
setting the lookback period from the value of the
`querier.query-ingesters-within` option. Outside of our own integration
tests, there's no situation where it's useful to set the lookback period
to a different value than the `query-ingesters-within` option.

This also adds a new option `querier.shuffle-sharding-ingesters-enabled`
that can be used to enable or disable shuffle sharding on ingesters on
the read path instead of using the lookback period for this.

Fixes #1810

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters added type/docs Improvements or additions to documentation ease-of-use/config-cleanup labels Jun 15, 2022
@56quarters 56quarters marked this pull request as ready for review June 15, 2022 15:56
@stevesg
Copy link
Contributor

stevesg commented Jun 16, 2022

Unfortunately this flag is used when disabling ingester read path shuffle sharding, which is necessary during the migration from single-zone ingesters to multi-zone ingesters.

Edit: I didn't see you added an option to disable this - then there should be nothing to worry about.

@pracucci pracucci self-requested a review June 16, 2022 08:14
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Logic changes LGTM. I left few comments here and there that I would like you to consider before merging. I also left a possible idea on the integration test.

CHANGELOG.md Outdated Show resolved Hide resolved
cfg.StoreGatewayClient.RegisterFlagsWithPrefix("querier.store-gateway-client", f)
f.BoolVar(&cfg.Iterators, "querier.iterators", false, "Use iterators to execute query, as opposed to fully materialising the series in memory.")
f.BoolVar(&cfg.BatchIterators, "querier.batch-iterators", true, "Use batch iterators to execute query, as opposed to fully materialising the series in memory. Takes precedent over the -querier.iterators flag.")
f.DurationVar(&cfg.QueryIngestersWithin, queryIngestersWithinFlag, 13*time.Hour, "Maximum lookback beyond which queries are not sent to ingester. 0 means all queries are sent to ingester.")
f.DurationVar(&cfg.MaxQueryIntoFuture, "querier.max-query-into-future", 10*time.Minute, "Maximum duration into the future you can query. 0 to disable.")
f.DurationVar(&cfg.QueryStoreAfter, queryStoreAfterFlag, 12*time.Hour, "The time after which a metric should be queried from storage and not just ingesters. 0 means all queries are sent to store. If this option is enabled, the time range of the query sent to the store-gateway will be manipulated to ensure the query end is not more recent than 'now - query-store-after'.")
f.DurationVar(&cfg.ShuffleShardingIngestersLookbackPeriod, shuffleShardingIngestersLookbackPeriodFlag, 13*time.Hour, "When this setting is > 0, queriers fetch in-memory series from the minimum set of required ingesters, selecting only ingesters which may have received series since 'now - lookback period'. The lookback period should be greater or equal than the configured -querier.query-store-after and -querier.query-ingesters-within. If this setting is 0, queriers always query all ingesters (ingesters shuffle sharding on read path is disabled).")
flagext.DeprecatedFlag(f, shuffleShardingIngestersLookbackPeriodFlag, fmt.Sprintf("Deprecated: this setting should always be the same as %s and will now behave as if it is", queryIngestersWithinFlag), logger)
f.BoolVar(&cfg.ShuffleShardingIngestersEnabled, "querier.shuffle-sharding-ingesters-enabled", true, fmt.Sprintf("Fetch in-memory series from the minimum set of required ingesters, selecting only ingesters which may have received series since %s. If this setting is false or %s is '0', queriers always query all ingesters (ingesters shuffle sharding on read path is disabled).", queryIngestersWithinFlag, queryIngestersWithinFlag))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to prefix flags with - to make it more clear they're CLI flags.

Suggested change
f.BoolVar(&cfg.ShuffleShardingIngestersEnabled, "querier.shuffle-sharding-ingesters-enabled", true, fmt.Sprintf("Fetch in-memory series from the minimum set of required ingesters, selecting only ingesters which may have received series since %s. If this setting is false or %s is '0', queriers always query all ingesters (ingesters shuffle sharding on read path is disabled).", queryIngestersWithinFlag, queryIngestersWithinFlag))
f.BoolVar(&cfg.ShuffleShardingIngestersEnabled, "querier.shuffle-sharding-ingesters-enabled", true, fmt.Sprintf("Fetch in-memory series from the minimum set of required ingesters, selecting only ingesters which may have received series since -%s. If this setting is false or -%s is '0', queriers always query all ingesters (ingesters shuffle sharding on read path is disabled).", queryIngestersWithinFlag, queryIngestersWithinFlag))

pkg/querier/querier.go Outdated Show resolved Hide resolved
pkg/querier/querier.go Outdated Show resolved Hide resolved
operations/mimir/shuffle-sharding.libsonnet Outdated Show resolved Hide resolved
// path). This is because the way which ingesters to query is calculated depends on
// when they were registered compared to the "query ingesters within" time. They'll
// never be registered long enough as part of this test to be queried for the series.
// Instead, all ingesters are queried.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not very good. We're not really testing shuffle sharding anymore here. What if we set a query-ingesters-within to a low value (e.g. 3s), then we start ingesters, we wait 3s, we push samples and query back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I tried out using a small value (30s) and a time.Sleep() locally. I was able to get the tests passing once but they seemed a little flaky. I'll look into this more and see if I can get them working reliably.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to set very very low ring heartbeat period, otherwise the registration timestamp is not updated frequently enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a lookback and delay of 5s. I'm hoping this is enough for these tests to work reliably (haven't seen any failures so far) 🤞

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters force-pushed the 56quarters/remove-shuffle-sharding-lookback branch from 10dd18b to aaa8cbf Compare June 16, 2022 19:39
@@ -2022,16 +2022,15 @@ func TestDistributor_MetricsMetadata(t *testing.T) {
_, err := ds[0].Push(ctx, req)
require.NoError(t, err)

// Check how many ingesters are queried as part of the shuffle sharding subring.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to add unit tests explicitly for GetIngestersForMetadata and GetIngestersForQuery but this test is already doing a similar verification. Swapped it to use this method instead of counting mock calls because we don't have to test both expectedIngesters and expectedIngesters - 1

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

integration/ingester_sharding_test.go Outdated Show resolved Hide resolved
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters merged commit b13d2df into main Jun 20, 2022
@56quarters 56quarters deleted the 56quarters/remove-shuffle-sharding-lookback branch June 20, 2022 13:04
masonmei pushed a commit to udmire/mimir that referenced this pull request Jul 11, 2022
This change deprecates the
`querier.shuffle-sharding-ingesters-lookback-period` option, instead
setting the lookback period from the value of the
`querier.query-ingesters-within` option. Outside of our own integration
tests, there's no situation where it's useful to set the lookback period
to a different value than the `query-ingesters-within` option.

This also adds a new option `querier.shuffle-sharding-ingesters-enabled`
that can be used to enable or disable shuffle sharding on ingesters on
the read path instead of using the lookback period for this.

Fixes grafana#1810

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Improvements or additions to documentation
Projects
None yet
3 participants