Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in Query Scheduler ring with frontend/worker #4545

Merged
merged 1 commit into from
Oct 26, 2021

Conversation

trevorwhitney
Copy link
Collaborator

What this PR does / why we need it:

When the query frontend and/or query worker targets are enabled on the same instance as the query scheduler (which happens in both the All and Read targets), the fronted and worker require the scheduler's initialization sequence to run before their own so that the scheduler ring will be populated. The Frontend and Worker rely on this ring in the scenario where a fronted or scheduler address has not been specified, which is also usually true in single-binary (All target) or simple scalable deployment (Read target) deployments.

@trevorwhitney trevorwhitney requested a review from a team as a code owner October 25, 2021 21:44
@slim-bean
Copy link
Collaborator

I'm a little hesitant on this, we talked about this in slack a little @trevorwhitney but I'd like to avoid this solution if we can better understand what's causing this, it feels like a race with services starting in parallel and I think this may not address the issue if that's the case.

Could you post the error you see when it fails? I'm trying to recreate but not having any luck so far.

@trevorwhitney
Copy link
Collaborator Author

trevorwhitney commented Oct 26, 2021

So, the apps did not crash, but queries failed. I have a nginx in front of the read path, and it would return 502s on most queries. When I looked at the nginx logs, they report a downstream code of 499, which is a special code nginx uses for when a client closes a connection.

What I think was happening is that the scheduler ring is nil when the the worker gets initialized (which I confirmed with some printlns), so the worker gets a fronted address of localhost:9095. However, then the query scheduler ring does indeed spin up. As a result, jobs get distributed across all workers, but the workers, if I understand correctly, are only communicating back with the frontend on their own instance. So if a job was handled by a frontend on one instance, then by a worker on the other, that request will never complete and will timeout. That was my guess as to what was happening. After making this change, queries worked normally.

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@owen-d owen-d merged commit d79a8e0 into final-ssd-configs Oct 26, 2021
@owen-d owen-d deleted the scheduler-ring-race branch October 26, 2021 21:14
slim-bean pushed a commit that referenced this pull request Oct 28, 2021
…storage config (#4543)

* further config simplifications and better defaults

* default ruler api to on
* register defaults for storage configs in the common config (ie.
  signature version for S3)
* enable ingester wal by default
* disable chunk retries by default
* allow common path prefix to include or omit trailing slash
* allow partial storage config to be defined for overriding things like
  bucket names per component

* fix tests

* add changelog and upgrade docs

* explain impact of wal change in upgrading.md

Co-authored-by: Owen Diehl <ow.diehl@gmail.com>

* rename add WithPrefix to RegisterFlags

* fix (frontend OR worker) and scheduler boot order when both are running (#4545)

Co-authored-by: Owen Diehl <ow.diehl@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants