Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform initial owned series calculation before starting lifecycler #7087

Merged
merged 6 commits into from
Jan 16, 2024

Conversation

pr00se
Copy link
Contributor

@pr00se pr00se commented Jan 10, 2024

What this PR does

When testing the owned series service we noticed an increase in read errors during ingester startup, with the message "ingester is unavailable (current state: Starting)". This was because the owned series service was explicitly waiting for the ring state to become ACTIVE before moving to service state Running, which in turn delayed the overall ingester service moving to Running as well. Any requests that arrived between the ring state becoming ACTIVE and the ingester service becoming Running were rejected with the above message.

Ideally there should be no work done between those two state change events. Unfortunately, we still use the old Lifecycler in the ingester, so our control over the move to the ACTIVE state is very limited without a significant rework; once the lifecycler is started it will move to the ACTIVE state as soon as it can.

In order to minimize the delay between the lifecycler start and the ingester service switching to Running, this PR moves the initial owned series calculation to after WAL replay but before lifecycler start. This is possible because the owned series calculation isn't dependent on the state of the ingesters in the ring, just the token distribution.

Because we are no longer waiting for the ring lifecycler at all, it is possible for the ring to not contain the ingester, or even be completely empty, when we try to compute the owned series for the first time. I've tested the following ingester start-up scenarios:

1. Ingester has TSDB users and exists in the ring

  • this is the standard case of the ingester rebooting and re-joining the ring by reclaiming its existing entry
  • after WAL replay it will calculate the correct owned series count because the token ranges are unchanged

2. Ingester has TSDB users but doesn't exist in the ring

  • this shouldn't happen under normal operation, but is possible if the ingester was removed from the ring and is restarted with a volume that contains WALs for replay
  • after WAL replay the ingester will attempt to calculate the owned series, but not find itself in the ring and set its token ranges to nil, and the owned series count to 0
  • the owned series count will be incremented as writes come in, and correctly calculated on the next iteration of the owned series service loop, after the lifecycler has added the instance back to the ring

3. Ingester has no TSDB users and doesn't exist in the ring

  • this is the standard case of a brand new ingester being added to the ring
  • after WAL replay it will attempt to calculate the owned series, but the loop inside of updateAllTenants will short circuit because the length is 0
  • owned series will be correctly incremented as new writes come in, and the correct token ranges cached on a future owned series service loop iteration, after the lifecycler adds the instance to the ring

4. Ingester has TSDB users but the ring is empty

  • this can happen if the ring state is completely lost and needs to be rebuilt from scratch
  • the initial owned series calculation will be skipped entirely, with the owned series count for all TSDBs set to 0
  • the count will be incremented as new series come in, and correctly calculated on a subsequent owned series service loop iteration

5. Ingester has no users, and the ring is empty

  • this happens when creating the ingester ring for the very first time
  • the initial owned series calculation will be skipped entirely, and correctly calculated on subsequent service loop iterations

This PR also changes the two limit-related end-to-end tests to run both with and without the owned series service enabled. The changes to the tests are best viewed with whitespace changes ignored.

Checklist

  • Tests updated.
  • [n/a] Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [n/a] about-versioning.md updated with experimental features.

@pr00se pr00se requested a review from a team as a code owner January 10, 2024 05:41
@pr00se pr00se marked this pull request as draft January 10, 2024 05:41
@pr00se pr00se force-pushed the start-owned-series-service-earlier branch 2 times, most recently from c902604 to 1346932 Compare January 10, 2024 08:18
oss := i.ownedSeriesService

// Fetch and cache current ring state
_, err := oss.checkRingForChanges()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this in starting method of owned series service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue we're trying to prevent making worse is the ingester becoming ACTIVE in the ring before the ingester service is Running. This issue exists even without the owned series service, but the potentially lengthy calculation of owned series being done after the lifecycler is started (which starts the clock on the ingester becoming ACTIVE) but before the service switches to Running makes it more likely that we start accepting requests and rejecting them because the ingester service is still Starting.

Doing the owned series calculation before starting the lifecycler ensures that the calculation doesn't increase the delay between those two state changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you're asying @pr00se but, from a design perspective, this logic should still be in oss.starting(). What you can simply do here, is to start oss and wait until running, while moving this code block back to oss.starting().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pracucci I think it's a bit more complicated than that. Because the services.Manager requires all subservices to be in New state, starting oss here would mean we can no longer use the manager to watch and handle its state. Since all we really needed was to run an iteration of the oss loop before starting it, I opted to pull the starting logic out rather than change the whole service management aspect, though I agree that it's not the cleanest design.

If you feel strongly I can move this back to starting and handle the state of the oss outside of the manager. I assume I can base the code off of how we handle the Lifecycler already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done in #7583

@pr00se pr00se force-pushed the start-owned-series-service-earlier branch from 1346932 to e4061a1 Compare January 12, 2024 21:28
@pr00se pr00se requested a review from pracucci January 12, 2024 21:32
@pr00se pr00se marked this pull request as ready for review January 12, 2024 21:32
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@pr00se pr00se merged commit 81af14a into main Jan 16, 2024
28 checks passed
@pr00se pr00se deleted the start-owned-series-service-earlier branch January 16, 2024 15:14
grafanabot pushed a commit that referenced this pull request Jan 16, 2024
…7087)

* Perform initial owned series calculation before starting lifecycler

* Enable owned series tracking in integration tests

* Enable limiting based on owned series in e2e tests

* Run ingester limits e2e tests with owned series enabled and disabled

* Clean up

* Update CHANGELOG

(cherry picked from commit 81af14a)
@grafanabot
Copy link
Contributor

The backport to r273 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-7087-to-r273 origin/r273
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 81af14aa19e532913730acfb9c3bd367a33502ff
# Push it to GitHub
git push --set-upstream origin backport-7087-to-r273
git switch main
# Remove the local backport branch
git branch -D backport-7087-to-r273

Then, create a pull request where the base branch is r273 and the compare/head branch is backport-7087-to-r273.

pr00se added a commit that referenced this pull request Jan 16, 2024
…7087) (#7140)

* Perform initial owned series calculation before starting lifecycler

* Enable owned series tracking in integration tests

* Enable limiting based on owned series in e2e tests

* Run ingester limits e2e tests with owned series enabled and disabled

* Clean up

* Update CHANGELOG

(cherry picked from commit 81af14a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants