Fix race condition in Query Scheduler ring with frontend/worker #4545

trevorwhitney · 2021-10-25T21:44:39Z

What this PR does / why we need it:

When the query frontend and/or query worker targets are enabled on the same instance as the query scheduler (which happens in both the All and Read targets), the fronted and worker require the scheduler's initialization sequence to run before their own so that the scheduler ring will be populated. The Frontend and Worker rely on this ring in the scenario where a fronted or scheduler address has not been specified, which is also usually true in single-binary (All target) or simple scalable deployment (Read target) deployments.

slim-bean · 2021-10-25T23:24:49Z

I'm a little hesitant on this, we talked about this in slack a little @trevorwhitney but I'd like to avoid this solution if we can better understand what's causing this, it feels like a race with services starting in parallel and I think this may not address the issue if that's the case.

Could you post the error you see when it fails? I'm trying to recreate but not having any luck so far.

trevorwhitney · 2021-10-26T02:15:08Z

So, the apps did not crash, but queries failed. I have a nginx in front of the read path, and it would return 502s on most queries. When I looked at the nginx logs, they report a downstream code of 499, which is a special code nginx uses for when a client closes a connection.

What I think was happening is that the scheduler ring is nil when the the worker gets initialized (which I confirmed with some printlns), so the worker gets a fronted address of localhost:9095. However, then the query scheduler ring does indeed spin up. As a result, jobs get distributed across all workers, but the workers, if I understand correctly, are only communicating back with the frontend on their own instance. So if a job was handled by a frontend on one instance, then by a worker on the other, that request will never complete and will timeout. That was my guess as to what was happening. After making this change, queries worked normally.

owen-d

This looks good to me.

…ng (#4545)

…storage config (#4543) * further config simplifications and better defaults * default ruler api to on * register defaults for storage configs in the common config (ie. signature version for S3) * enable ingester wal by default * disable chunk retries by default * allow common path prefix to include or omit trailing slash * allow partial storage config to be defined for overriding things like bucket names per component * fix tests * add changelog and upgrade docs * explain impact of wal change in upgrading.md Co-authored-by: Owen Diehl <ow.diehl@gmail.com> * rename add WithPrefix to RegisterFlags * fix (frontend OR worker) and scheduler boot order when both are running (#4545) Co-authored-by: Owen Diehl <ow.diehl@gmail.com>

fix (frontend OR worker) and scheduler boot order when both are running

56ce655

trevorwhitney requested a review from a team as a code owner October 25, 2021 21:44

pull-request-size bot added the size/S label Oct 25, 2021

owen-d approved these changes Oct 26, 2021

View reviewed changes

owen-d merged commit d79a8e0 into final-ssd-configs Oct 26, 2021

owen-d deleted the scheduler-ring-race branch October 26, 2021 21:14

trevorwhitney added a commit that referenced this pull request Oct 27, 2021

fix (frontend OR worker) and scheduler boot order when both are runni…

ae881fe

…ng (#4545)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in Query Scheduler ring with frontend/worker #4545

Fix race condition in Query Scheduler ring with frontend/worker #4545

trevorwhitney commented Oct 25, 2021

slim-bean commented Oct 25, 2021

trevorwhitney commented Oct 26, 2021 •

edited

Loading

owen-d left a comment

Fix race condition in Query Scheduler ring with frontend/worker #4545

Fix race condition in Query Scheduler ring with frontend/worker #4545

Conversation

trevorwhitney commented Oct 25, 2021

slim-bean commented Oct 25, 2021

trevorwhitney commented Oct 26, 2021 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

trevorwhitney commented Oct 26, 2021 •

edited

Loading