Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor stalls when failing to compact a blocks group due to corrupted source blocks #112

Open
grafanabot opened this issue Aug 10, 2021 · 2 comments

Comments

@grafanabot
Copy link
Contributor

We experienced the compactor being stalled (not compacting any block) because of a corrupted source block. The compactor was continuously failing while trying to compact this blocks group due to a corrupted source block:

level=error ts=2020-07-12T17:35:05.516823471Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=REDACTED err="compaction: group 0@6672437747845546250: block with not healthy index found /data/compact/0@6672437747845546250/REDACTED; Compaction level 1; Labels: map[__org_id__:REDACTED]: 1/1183085 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

Given investigating and fixing the root cause of the out-of-order chunks should be done but it's out of the scope of this issue, the compactor should ideally either skip the corrupted block and compact the other ones, or move on compacting other non-overlapping blocks (eg. other non-overlapping time ranges) if available, otherwise the compactor just stalls even if other work could be done.

Similarly to what we do with the deletion-mark.json we could also consider to mark a block as corrupted (eg. corruption-mark.json) and automatically exclude blocks marked as corrupted from compaction while alerting on it. An operator can offline investigate it and, if a repairing tool is available, compactor will compact it once the block will be fixed and unmarked as corrupted.

/cc @bwplotka @codesome @pstibrany

Submitted by: pracucci
Cortex Issue Number: 2866

@grafanabot
Copy link
Contributor Author

With vertical compaction enabled we can do that now yes. Previously it was a no go for that reason.

Submitted by: bwplotka

@grafanabot
Copy link
Contributor Author

@bwplotka Probably we should discuss it in Thanos, but what's your take about having a corruption-mark.json? I've mixed feelings, cause it will further add GET object API calls.

Submitted by: pracucci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant