-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform initial owned series calculation before starting lifecycler #7087
Conversation
c902604
to
1346932
Compare
oss := i.ownedSeriesService | ||
|
||
// Fetch and cache current ring state | ||
_, err := oss.checkRingForChanges() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this in starting method of owned series service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue we're trying to prevent making worse is the ingester becoming ACTIVE
in the ring before the ingester service is Running
. This issue exists even without the owned series service, but the potentially lengthy calculation of owned series being done after the lifecycler is started (which starts the clock on the ingester becoming ACTIVE) but before the service switches to Running
makes it more likely that we start accepting requests and rejecting them because the ingester service is still Starting
.
Doing the owned series calculation before starting the lifecycler ensures that the calculation doesn't increase the delay between those two state changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you're asying @pr00se but, from a design perspective, this logic should still be in oss.starting()
. What you can simply do here, is to start oss
and wait until running, while moving this code block back to oss.starting()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pracucci I think it's a bit more complicated than that. Because the services.Manager
requires all subservices to be in New
state, starting oss
here would mean we can no longer use the manager to watch and handle its state. Since all we really needed was to run an iteration of the oss
loop before starting it, I opted to pull the starting
logic out rather than change the whole service management aspect, though I agree that it's not the cleanest design.
If you feel strongly I can move this back to starting
and handle the state of the oss
outside of the manager. I assume I can base the code off of how we handle the Lifecycler
already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done in #7583
1346932
to
e4061a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
…7087) * Perform initial owned series calculation before starting lifecycler * Enable owned series tracking in integration tests * Enable limiting based on owned series in e2e tests * Run ingester limits e2e tests with owned series enabled and disabled * Clean up * Update CHANGELOG (cherry picked from commit 81af14a)
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-7087-to-r273 origin/r273
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 81af14aa19e532913730acfb9c3bd367a33502ff
# Push it to GitHub
git push --set-upstream origin backport-7087-to-r273
git switch main
# Remove the local backport branch
git branch -D backport-7087-to-r273 Then, create a pull request where the |
…7087) (#7140) * Perform initial owned series calculation before starting lifecycler * Enable owned series tracking in integration tests * Enable limiting based on owned series in e2e tests * Run ingester limits e2e tests with owned series enabled and disabled * Clean up * Update CHANGELOG (cherry picked from commit 81af14a)
What this PR does
When testing the owned series service we noticed an increase in read errors during ingester startup, with the message
"ingester is unavailable (current state: Starting)"
. This was because the owned series service was explicitly waiting for the ring state to becomeACTIVE
before moving to service stateRunning
, which in turn delayed the overall ingester service moving toRunning
as well. Any requests that arrived between the ring state becomingACTIVE
and the ingester service becomingRunning
were rejected with the above message.Ideally there should be no work done between those two state change events. Unfortunately, we still use the old
Lifecycler
in the ingester, so our control over the move to theACTIVE
state is very limited without a significant rework; once the lifecycler is started it will move to theACTIVE
state as soon as it can.In order to minimize the delay between the lifecycler start and the ingester service switching to
Running
, this PR moves the initial owned series calculation to after WAL replay but before lifecycler start. This is possible because the owned series calculation isn't dependent on the state of the ingesters in the ring, just the token distribution.Because we are no longer waiting for the ring lifecycler at all, it is possible for the ring to not contain the ingester, or even be completely empty, when we try to compute the owned series for the first time. I've tested the following ingester start-up scenarios:
1. Ingester has TSDB users and exists in the ring
2. Ingester has TSDB users but doesn't exist in the ring
nil
, and the owned series count to0
3. Ingester has no TSDB users and doesn't exist in the ring
updateAllTenants
will short circuit because the length is0
4. Ingester has TSDB users but the ring is empty
0
5. Ingester has no users, and the ring is empty
This PR also changes the two limit-related end-to-end tests to run both with and without the owned series service enabled. The changes to the tests are best viewed with whitespace changes ignored.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.