Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic as a result of duplicate write request cleanup #7176

Merged
merged 3 commits into from
Jan 19, 2024

Conversation

Logiraptor
Copy link
Contributor

@Logiraptor Logiraptor commented Jan 19, 2024

What this PR does

This PR fixes a possible panic introduced by #6970. In prePushRelabelMiddleware, when MetricRelabelingEnabled returns true, it would call to the next middleware in the chain, then call CleanUp in a defer. This results in CleanUp being called twice. Two conditions result:

  1. Distributors can panic during marshaling because the WriteRequest has already been reset to zero length
  2. Ingesters can ingest corrupt data because the Label values in the request may be unsorted. They are then stuck failing to compact blocks.

The existing middleware test was attempting to find this condition via the cleanupCallCount field, but unfortunately the test is too simple to reproduce the behavior of actual distributor middleware in a running Mimir cluster. To reproduce the issue in test, we can insert a fake cleanup function after calling CleanUp, so that if it is called again we will catch it and fail the test.

I've reproduced this issue in a local mimir cluster via helm and verified that the proposed fix works.

Which issue(s) this PR fixes or relates to

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@Logiraptor Logiraptor requested a review from a team as a code owner January 19, 2024 19:43
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, confirmed that tests fail without cleanupInDefer = false in the relabeling and HA middlewares.

@56quarters
Copy link
Contributor

Should this have a CHANGELOG entry? It's a niche bug but seems worth mentioning.

@Logiraptor Logiraptor merged commit 4b1d049 into main Jan 19, 2024
28 checks passed
@Logiraptor Logiraptor deleted the logiraptor/fix-panic-relabel-middleware branch January 19, 2024 19:59
colega added a commit that referenced this pull request Apr 1, 2024
We follow a simple logic: if we don't call next(), we should call
pushReq.Cleanup(). However, there's nothing in the code ensuring that,
which leads to bugs and fixes like:

- #7176
- #7755

A simple way to prevent that is to couple the `cleanupInDefer = false`
to the `next()` call, and this does exactly that.

Note that `nextOrCleanup` is introduced just for convenience, and it's
inlined.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
colega added a commit that referenced this pull request Apr 1, 2024
We follow a simple logic: if we don't call next(), we should call
pushReq.Cleanup(). However, there's nothing in the code ensuring that,
which leads to bugs and fixes like:

- #7176
- #7755

A simple way to prevent that is to couple the `cleanupInDefer = false`
to the `next()` call, and this does exactly that.

Note that `nextOrCleanup` is introduced just for convenience, and it's
inlined.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants