Compactor stalls when failing to compact a blocks group due to corrupted source blocks #112

grafanabot · 2021-08-10T18:14:01Z

We experienced the compactor being stalled (not compacting any block) because of a corrupted source block. The compactor was continuously failing while trying to compact this blocks group due to a corrupted source block:

level=error ts=2020-07-12T17:35:05.516823471Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=REDACTED err="compaction: group 0@6672437747845546250: block with not healthy index found /data/compact/0@6672437747845546250/REDACTED; Compaction level 1; Labels: map[__org_id__:REDACTED]: 1/1183085 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

Given investigating and fixing the root cause of the out-of-order chunks should be done but it's out of the scope of this issue, the compactor should ideally either skip the corrupted block and compact the other ones, or move on compacting other non-overlapping blocks (eg. other non-overlapping time ranges) if available, otherwise the compactor just stalls even if other work could be done.

Similarly to what we do with the deletion-mark.json we could also consider to mark a block as corrupted (eg. corruption-mark.json) and automatically exclude blocks marked as corrupted from compaction while alerting on it. An operator can offline investigate it and, if a repairing tool is available, compactor will compact it once the block will be fixed and unmarked as corrupted.

/cc @bwplotka @codesome @pstibrany

Submitted by: pracucci
Cortex Issue Number: 2866

The text was updated successfully, but these errors were encountered:

grafanabot · 2021-08-10T18:14:02Z

With vertical compaction enabled we can do that now yes. Previously it was a no go for that reason.

Submitted by: bwplotka

grafanabot · 2021-08-10T18:14:03Z

@bwplotka Probably we should discuss it in Thanos, but what's your take about having a corruption-mark.json? I've mixed feelings, cause it will further add GET object API calls.

Submitted by: pracucci

grafanabot added component/compactor storage/blocks labels Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor stalls when failing to compact a blocks group due to corrupted source blocks #112

Compactor stalls when failing to compact a blocks group due to corrupted source blocks #112

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

Compactor stalls when failing to compact a blocks group due to corrupted source blocks #112

Compactor stalls when failing to compact a blocks group due to corrupted source blocks #112

Comments

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021

grafanabot commented Aug 10, 2021