You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We experienced the compactor being stalled (not compacting any block) because of a corrupted source block. The compactor was continuously failing while trying to compact this blocks group due to a corrupted source block:
level=error ts=2020-07-12T17:35:05.516823471Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=REDACTED err="compaction: group 0@6672437747845546250: block with not healthy index found /data/compact/0@6672437747845546250/REDACTED; Compaction level 1; Labels: map[__org_id__:REDACTED]: 1/1183085 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
Given investigating and fixing the root cause of the out-of-order chunks should be done but it's out of the scope of this issue, the compactor should ideally either skip the corrupted block and compact the other ones, or move on compacting other non-overlapping blocks (eg. other non-overlapping time ranges) if available, otherwise the compactor just stalls even if other work could be done.
Similarly to what we do with the deletion-mark.json we could also consider to mark a block as corrupted (eg. corruption-mark.json) and automatically exclude blocks marked as corrupted from compaction while alerting on it. An operator can offline investigate it and, if a repairing tool is available, compactor will compact it once the block will be fixed and unmarked as corrupted.
@bwplotka Probably we should discuss it in Thanos, but what's your take about having a corruption-mark.json? I've mixed feelings, cause it will further add GET object API calls.
We experienced the compactor being stalled (not compacting any block) because of a corrupted source block. The compactor was continuously failing while trying to compact this blocks group due to a corrupted source block:
Given investigating and fixing the root cause of the
out-of-order chunks
should be done but it's out of the scope of this issue, the compactor should ideally either skip the corrupted block and compact the other ones, or move on compacting other non-overlapping blocks (eg. other non-overlapping time ranges) if available, otherwise the compactor just stalls even if other work could be done.Similarly to what we do with the
deletion-mark.json
we could also consider to mark a block as corrupted (eg.corruption-mark.json
) and automatically exclude blocks marked as corrupted from compaction while alerting on it. An operator can offline investigate it and, if a repairing tool is available, compactor will compact it once the block will be fixed and unmarked as corrupted./cc @bwplotka @codesome @pstibrany
Submitted by: pracucci
Cortex Issue Number: 2866
The text was updated successfully, but these errors were encountered: