Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limits and errors for ephemeral storage #4004

Merged
merged 17 commits into from
Jan 20, 2023
Merged

Conversation

pstibrany
Copy link
Member

@pstibrany pstibrany commented Jan 18, 2023

What this PR does

This PR

  • introduces instance-limit for max number of ephemeral series in ingester (-ingester.instance-limits.max-ephemeral-series)
  • introduces per-user series limit for ephemeral storage (-ingester.max-ephemeral-series-per-user)
  • modifies "discard" reasons for discarded samples when ingesting ephemeral series (all reasons have "ephemeral-" prefix)
  • introduces new error codes into error catalog:
    • err-mimir-max-ephemeral-series-per-user
    • err-mimir-ingester-max-ephemeral-series
    • err-mimir-ephemeral-sample-timestamp-too-old
    • err-mimir-ephemeral-sample-out-of-order
    • err-mimir-ephemeral-sample-duplicate-timestamp
    • err-mimir-ephemeral-storage-not-enabled-for-user

This PR does not update changelog, nor does it add documentation for newly introduced errors in the error catalog. This will be handled by subsequent PRs.

(If needed, I'm happy to split the PR into smaller PRs)

Which issue(s) this PR fixes or relates to

Part of #3884

Checklist

  • Tests updated
  • [na] Documentation added
  • [na] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany pstibrany requested a review from a team as a code owner January 18, 2023 18:41
@pstibrany pstibrany changed the title Limits for Limits for ephemeral storage Jan 18, 2023
@pstibrany pstibrany mentioned this pull request Jan 18, 2023
@pstibrany pstibrany changed the title Limits for ephemeral storage Limits and errors for ephemeral storage Jan 18, 2023
@pstibrany pstibrany requested a review from a team as a code owner January 18, 2023 19:01
cmd/mimir/help-all.txt.tmpl Outdated Show resolved Hide resolved
@replay
Copy link
Contributor

replay commented Jan 19, 2023

This is not really related to this PR, more to how the ephemeral TSDB is implemented:

I realized that we currently always create a persistent TSDB, even if a user would want to run an ingester with ephemeral-only data (for example in the ephemeral matchers they could specify __name__=~".+"). I wonder if there is a valid use-case where a user might want to have no persistent TSDBs at all, if so we currently don't support that.

@replay
Copy link
Contributor

replay commented Jan 19, 2023

I'm thinking about whether it makes sense to add new metrics like for example cortex_ingester_queried_ephemeral_series_bucket instead of just adding a store label to the existing metrics, then we'd have something like cortex_ingester_queried_series_bucket{store="ephemeral"}.

I think both ways can be considered a breaking change.

The current solution with the new metrics would be a breaking change if a user currently has an alert on failures to ingest samples, using the metric cortex_ingester_queried_series_bucket. If they now start using the ephemeral storage and the ingestion into the ephemeral storage fails their alert won't trigger, so it's broken if the alert doesn't get updated.

OTH if we'd add the store label to the existing metrics like cortex_ingester_queried_series_bucket, then there is a risk that we break other existing queries of users.

I think it's safe to assume that the majority of users won't use the ephemeral store, so probably the safer choice is to just not touch the existing metrics. That way only users who want to start using the ephemeral store, so the minority of users, will have to experience a breaking change which requires them to potentially update their alerts/dashboards.

Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! I had only very few comments on this huge PR.

I'll approve to unblock, but before merging I think at least the question about 0 to disable the limit should be addressed.

(btw the linter is failing)

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, LGTM! I left few minor comments. My main comment which I would love to see addressed is the runbooks doc, cause I think could be easily done with a copy-paste (still better than "TBD" 😉 ).

pkg/ingester/user_tsdb.go Outdated Show resolved Hide resolved
@@ -1361,6 +1361,10 @@ How to **fix** it:

- See [`MimirIngesterReachingSeriesLimit`](#MimirIngesterReachingSeriesLimit) runbook.

### err-mimir-ingester-max-ephemeral-series

This error refers to experimental ephemeral storage in ingesters. Documentation TBD.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "Documentation TBD", what if you copy-paste the content of the persistent counterpart runbook and just replace the config names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here: 083b57f

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
updateFirstPartial(func() error { return makeLimitError(perUserSeriesLimit, i.limiter.FormatError(userID, cause)) })
updateFirstPartial(func() error {
if ephemeral {
return makeLimitError(ephemeralDiscardPrefix+perUserSeriesLimit, i.limiter.FormatError(userID, cause))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] What if we define a const ephemeralPerUserSeriesLimit instead of ephemeralDiscardPrefix+perUserSeriesLimit? I've the feeling may be easier in the future to grep for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is what you had in mind: 8a6cda5

pkg/ingester/ingester.go Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pstibrany
Copy link
Member Author

Great job, LGTM! I left few minor comments. My main comment which I would love to see addressed is the runbooks doc, cause I think could be easily done with a copy-paste (still better than "TBD" 😉 ).

Based on your feedback I've updated runbooks, but they are not always just copy&pasted versions. I was hoping to move that work to different PR to avoid lengthy docs review loop here (my english language skills are generally worse than Go skills, although some people may disagree). But let's give this a shot.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@pstibrany pstibrany merged commit 0e281ac into main Jan 20, 2023
@pstibrany pstibrany deleted the ephemeral-storage-limits branch January 20, 2023 09:35
@pstibrany pstibrany mentioned this pull request Jan 20, 2023
1 task
fayzal-g added a commit that referenced this pull request Jan 20, 2023
* Update test

* Add missing changelog entries for commits since Mimir 2.5 (#4006)

All other commits weren't user-facing or were helm-chart specific.

See #3979

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Add --concurrency support to 'mimirtool rules sync' command (#3996)

* Add --concurrency support to 'mimirtool rules sync' command

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Update pkg/mimirtool/commands/rules.go

Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com>

* store-gateway: ExpandedPostings shortcut: avoid LabelValues unless necessary (#3872)

* Return `Canceled` rather than `Aborted` when a `Series` request to a store-gateway is cancelled by the calling querier. (#4007)

* Return Canceled rather than Aborted when a Series request to a store-gateway is cancelled.

* Add changelog entry.

* Update mimir-prometheus, add support for align_evaluation_time_on_interval. (#4013)

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Fix title of guide in link text; reword phrase. (#4008)

* Fix ExampleInitLogger to work in UTC (#4016)

The test didn't pass in my time zone (tm).

--- FAIL: ExampleInitLogger (0.00s)
got:
ts=1970-01-01T01:00:00+01:00 caller=log_test.go:31 level=info test=1
ts=1970-01-01T01:00:00+01:00 caller=log_test.go:33 level=info msg="test 3"
want:
ts=1970-01-01T00:00:00Z caller=log_test.go:31 level=info test=1
ts=1970-01-01T00:00:00Z caller=log_test.go:33 level=info msg="test 3"

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Create outline of Mimir 2.6 release notes (#4002)

Includes notable features and bugfixes based on the CHANGELOG. Helm changes
to be filled out later by product and engineering.

See #3979

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Fix post-merge comments from PR #4013. (#4014)

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Update CODEOWNERS to include mimir-ruler-and-alertmanager-maintainers (#4019)

For those who only want notifications re the ruler or Alertmanager.

* Remove internal use of store.max-query-length (#4017)

Make deprecation of the option more obvious and attempt to remove
any use of store.max-query-length in our documentation, jsonnet, helm,
and integration tests.

See #2793
See #3825

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* [otlp] Update OTel Collector to latest release (#3852)

* [otlp] Update otel collector dependecy to latest
* Update code to deal with deprecated functions

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* [otlp] Docs: Highlight common issues with OTLP --> Prometheus (#3629)

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>

* make it possible to inject memberlist kv codecs (#4018)

* make it possible to inject memberlist kv codecs

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* add comment

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* improve comment wording

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* Limits and errors for ephemeral storage (#4004)

* Add limits for ephemeral storage.
* Add new reason when ingestion of ephemeral metrics fails.
* Add tests for max ephemeral series limit.
* Introduce new discard reasons when ingesting ephemeral series.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

* Reduce maintainership and step down as team member. (#4023)

* Reduce maintainership and step down as team member.

My future priorities will be on the alerting aspects of Mimir, so I think it is
right to reduce my maintainership accordingly and allow others to take my place.
Similarly, remove myself as a team member.

* Sort previous team members.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Nick Pillitteri <56quarters@users.noreply.github.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com>
Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Co-authored-by: Charles Korn <charleskorn@users.noreply.github.com>
Co-authored-by: Peter Štibraný <pstibrany@gmail.com>
Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com>
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Co-authored-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants