Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BasicLifecycler for distributors and auto-forget #2154

Merged
merged 4 commits into from
Jun 24, 2022

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Jun 21, 2022

What this PR does

Use the BasicLifecycler in distributors for managing their lifecycle so
that we can take advantage of the "auto-forget" delegates feature. This
prevents the ring from filling up with "unhealthy" distributors that are
never removed. This wasn't a bug but it was confusing for users and
operators.

Which issue(s) this PR fixes or relates to

Fixes #2138

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

Checklist

  • Tests updated
  • [na] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@56quarters 56quarters force-pushed the 56quarters/distributor-lifecycle branch 2 times, most recently from e461b3e to 2a5c10c Compare June 21, 2022 14:13
@56quarters 56quarters marked this pull request as ready for review June 21, 2022 14:13
@56quarters 56quarters force-pushed the 56quarters/distributor-lifecycle branch 3 times, most recently from 100669f to a096530 Compare June 22, 2022 15:56
@56quarters 56quarters requested a review from colega June 22, 2022 16:00
@56quarters 56quarters marked this pull request as draft June 22, 2022 21:32
@56quarters 56quarters force-pushed the 56quarters/distributor-lifecycle branch from c05e273 to 83c065d Compare June 23, 2022 00:03
@colega
Copy link
Contributor

colega commented Jun 23, 2022

LGTM, left some nitpicks, waiting to be un-drafted.

@krajorama
Copy link
Contributor

Hi, let me know if there's anything useful here: #684 , I'd like to close that old PR of mine - I never had time to finish it, plus lifecycler had some feature that I didn't know how to emulate.

pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/distributor.go Outdated Show resolved Hide resolved
pkg/distributor/instance_count.go Outdated Show resolved Hide resolved
pkg/distributor/distributor_ring.go Outdated Show resolved Hide resolved
@56quarters 56quarters marked this pull request as ready for review June 23, 2022 15:35
@56quarters
Copy link
Contributor Author

Hi, let me know if there's anything useful here: #684 , I'd like to close that old PR of mine - I never had time to finish it, plus lifecycler had some feature that I didn't know how to emulate.

Definitely some useful parts!

I see some similarity with needing a number of healthy ring members:

  • Your PR adds a healthy instances + zone count to the BasicLifecycler.
  • This PR uses a BasicLifecyclerDelegate implementation to count healthy instances.

I'm not sure what the "MinReadyDuration" feature would look like as a delegate.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Nice work.

pkg/distributor/instance_count_test.go Show resolved Hide resolved
Use the BasicLifecycler in distributors for managing their lifecycle so
that we can take advantage of the "auto-forget" delegates feature. This
prevents the ring from filling up with "unhealthy" distributors that are
never removed. This wasn't a bug but it was confusing for users and
operators.

Fixes #2138

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters force-pushed the 56quarters/distributor-lifecycle branch from 315bc07 to ed3350f Compare June 24, 2022 15:15
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters merged commit 7e4accd into main Jun 24, 2022
@56quarters 56quarters deleted the 56quarters/distributor-lifecycle branch June 24, 2022 17:08
rlex added a commit to rlex/mimir that referenced this pull request Jun 28, 2022
* main: (63 commits)
  Add new section on website for links to blog posts, podcasts and talks. (grafana#2216)
  Rename codified errors to errors catalog (grafana#2256)
  Helm: add a step to contributing doc (grafana#2257)
  Signal that 2.2 release is now in progress. (grafana#2254)
  Removed migration of alertmanager local state files from old hierarchy (Cortex 1.8 and earlier) (grafana#2253)
  operations/mimir: Change multi_zone_ingester_max_unavailable to 25 (grafana#2251)
  Helm: weekly release (grafana#2252)
  Jsonnet: Configure ingester max global metadata per user and per metric (grafana#2250)
  Helm: metamonitor naming (grafana#2236)
  Mimir documentation about out-of-order (grafana#2183)
  Vendor latest mimir-prometheus/main (grafana#2243)
  Set CODEOWNERS to primary technical writer (grafana#2242)
  Use BasicLifecycler for distributors and auto-forget (grafana#2154)
  Docs: Basic documentation for deploying the ruler using jsonnet. (grafana#2127)
  Fix post merge reviews on 2187 (grafana#2230)
  Add tests for user metadata in the ingester (grafana#2184)
  Change the error message template for per-tenant limits (grafana#2234)
  helm: meta-monitoring (grafana#2068)
  Article about migrating from Consul to memberlist. Added documentation for /memberlist endpoint. (grafana#2166)
  Update runbooks to mention possibility to investigate memberlist KV store in various alerts (grafana#2158)
  ...
masonmei pushed a commit to udmire/mimir that referenced this pull request Jul 11, 2022
Use the BasicLifecycler in distributors for managing their lifecycle so
that we can take advantage of the "auto-forget" delegates feature. This
prevents the ring from filling up with "unhealthy" distributors that are
never removed. This wasn't a bug but it was confusing for users and
operators.

Fixes grafana#2138

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

// ringAutoForgetUnhealthyPeriods is how many consecutive timeout periods an unhealthy instance
// in the ring will be automatically removed after.
ringAutoForgetUnhealthyPeriods = 10
Copy link
Collaborator

@pracucci pracucci Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very high. I would be more aggressive with distributors, like 2 should be enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate distributors hash ring to BasicLifecycler and add auto-forget feature
5 participants