Skip to content

Commit

Permalink
compactor: change default of partial-block-deletion-delay, remove M…
Browse files Browse the repository at this point in the history
…imirTenantHasPartialBlocks (#5026)

* compactor: change default of `compactor.partial-block-deletion-delay`, remove MimirTenantHasPartialBlocks

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* remove alert

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

* Update helm tests

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

---------

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
  • Loading branch information
dimitarvdimitrov committed May 17, 2023
1 parent fadd1a1 commit 33842d2
Show file tree
Hide file tree
Showing 11 changed files with 7 additions and 71 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
* `-blocks-storage.bucket-store.chunk-pool-min-bucket-size-bytes`
* `-blocks-storage.bucket-store.chunk-pool-max-bucket-size-bytes`
* [CHANGE] Store-gateway: remove metrics `cortex_bucket_store_chunk_pool_requested_bytes_total` and `cortex_bucket_store_chunk_pool_returned_bytes_total`. #4996
* [CHANGE] Compactor: change default of `-compactor.partial-block-deletion-delay` to `1d`. This will automatically clean up partial blocks that were a result of failed block upload or deletion. #5026
* [ENHANCEMENT] Add per-tenant limit `-validation.max-native-histogram-buckets` to be able to ignore native histogram samples that have too many buckets. #4765
* [ENHANCEMENT] Store-gateway: reduce memory usage in some LabelValues calls. #4789
* [ENHANCEMENT] Store-gateway: add a `stage` label to the metric `cortex_bucket_store_series_data_touched`. This label now applies to `data_type="chunks"` and `data_type="series"`. The `stage` label has 2 values: `processed` - the number of series that parsed - and `returned` - the number of series selected from the processed bytes to satisfy the query. #4797 #4830
Expand Down Expand Up @@ -56,6 +57,7 @@
### Mixin

* [CHANGE] Alerts: Remove `MimirQuerierHighRefetchRate`. #4980
* [CHANGE] Alerts: Remove `MimirTenantHasPartialBlocks`. This is obsoleted by the changed default of `-compactor.partial-block-deletion-delay` to `1d`, which will auto remediate this alert. #5026
* [ENHANCEMENT] Alertmanager dashboard: display active aggregation groups #4772
* [ENHANCEMENT] Alerts: `MimirIngesterTSDBWALCorrupted` now only fires when there are more than one corrupted WALs in single-zone deployments and when there are more than two zones affected in multi-zone deployments. #4920
* [ENHANCEMENT] dashboards: fix holes in graph for lightly loaded clusters #4915
Expand Down
2 changes: 1 addition & 1 deletion cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3222,7 +3222,7 @@
"required": false,
"desc": "If a partial block (unfinished block without meta.json file) hasn't been modified for this time, it will be marked for deletion. The minimum accepted value is 4h0m0s: a lower value will be ignored and the feature disabled. 0 to disable.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldDefaultValue": 86400000000000,
"fieldFlag": "compactor.partial-block-deletion-delay",
"fieldType": "duration"
},
Expand Down
2 changes: 1 addition & 1 deletion cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -900,7 +900,7 @@ Usage of ./cmd/mimir/mimir:
-compactor.meta-sync-concurrency int
Number of Go routines to use when syncing block meta files from the long term storage. (default 20)
-compactor.partial-block-deletion-delay duration
If a partial block (unfinished block without meta.json file) hasn't been modified for this time, it will be marked for deletion. The minimum accepted value is 4h0m0s: a lower value will be ignored and the feature disabled. 0 to disable.
If a partial block (unfinished block without meta.json file) hasn't been modified for this time, it will be marked for deletion. The minimum accepted value is 4h0m0s: a lower value will be ignored and the feature disabled. 0 to disable. (default 1d)
-compactor.ring.consul.acl-token string
ACL Token used to interact with Consul.
-compactor.ring.consul.cas-retry-delay duration
Expand Down
2 changes: 1 addition & 1 deletion cmd/mimir/help.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ Usage of ./cmd/mimir/mimir:
-compactor.data-dir string
Directory to temporarily store blocks during compaction. This directory is not required to be persisted between restarts. (default "./data-compactor/")
-compactor.partial-block-deletion-delay duration
If a partial block (unfinished block without meta.json file) hasn't been modified for this time, it will be marked for deletion. The minimum accepted value is 4h0m0s: a lower value will be ignored and the feature disabled. 0 to disable.
If a partial block (unfinished block without meta.json file) hasn't been modified for this time, it will be marked for deletion. The minimum accepted value is 4h0m0s: a lower value will be ignored and the feature disabled. 0 to disable. (default 1d)
-compactor.ring.consul.hostname string
Hostname and port of Consul. (default "localhost:8500")
-compactor.ring.etcd.endpoints string
Expand Down
23 changes: 0 additions & 23 deletions docs/sources/mimir/operators-guide/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -668,29 +668,6 @@ How to **investigate**:
- Ensure the compactor is successfully running
- Look for any error in the compactor logs

### MimirTenantHasPartialBlocks

This alert fires when Mimir finds partial blocks for a given tenant. A partial block is a block missing the `meta.json` and this may usually happen in two circumstances:

1. A block upload has been interrupted and not cleaned up or retried
2. A block deletion has been interrupted and `deletion-mark.json` has been deleted before `meta.json`

How to **investigate**:

1. Look for partial blocks in the logs. Example Loki query: `{cluster="<cluster>",namespace="<namespace>",container="compactor"} |= "skipped partial block"`
1. Pick a block and note its ID (`block` field in log entry) and tenant ID (`org_id` in log entry)
1. Find the bucket used by the Mimir cluster, such as checking the configured `blocks_storage_bucket_name` if you are using Jsonnet.
1. Find out which Mimir component operated on the block last (e.g. uploaded by ingester/compactor, or deleted by compactor)
1. Determine when the partial block was uploaded: `gsutil ls -l gs://${BUCKET}/${TENANT_ID}/${BLOCK_ID}`. Alternatively you can use `ulidtime` command from Mimir tools directory `ulidtime ${BLOCK_ID}` to find block creation time.
1. Search in the logs around that time to find the log entry from when the compactor created the block ("compacted blocks" for log message)
1. From the compactor log entry you found, pick the job ID from the `groupKey` field, f.ex. `0@9748515562602778029-merge--1645711200000-1645718400000`
1. Then search the logs for the job ID and look for an entry with the message "compaction job failed", this will show that the compactor failed uploading the block
1. If you found a failed compaction job, as outlined in the previous step, try searching for a corresponding log message (for the same job ID) "compaction job succeeded". This will mean that the compaction job was retried successfully. Note: this should produce a different block ID from the failed upload.
1. Investigate if it was a partial upload or partial delete
1. If it was a partial delete or an upload failed by a compactor you can safely mark the block for deletion, and compactor will delete the block. You can use `markblocks` command from Mimir tools directory: `markblocks -mark deletion -allow-partial -tenant <tenant> <blockID>` with correct backend (eg. GCS: `-backend gcs -gcs.bucket-name <bucket-name>`) configuration.
1. If it was a failed upload by an ingester, but not later retried (ingesters are expected to retry uploads until succeed), further investigate
1. Prevent the issue from reoccurring by enabling automatic partial block cleanup. This can be enabled with the `-compactor.partial-block-deletion-delay` flag. It takes a duration as an argument. If a partial block persists past the specified duration, the compactor will automatically delete it. One can monitor automatic cleanup of partial blocks via the `cortex_compactor_blocks_marked_for_deletion_total{reason="partial"}` counter.

### MimirQueriesIncorrect

_TODO: this runbook has not been written yet._
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2855,7 +2855,7 @@ The `limits` block configures default and per-tenant limits imposed by component
# value is 4h0m0s: a lower value will be ignored and the feature disabled. 0 to
# disable.
# CLI flag: -compactor.partial-block-deletion-delay
[compactor_partial_block_deletion_delay: <duration> | default = 0s]
[compactor_partial_block_deletion_delay: <duration> | default = 1d]

# Enable block upload API for the tenant.
# CLI flag: -compactor.block-upload-enabled
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -793,16 +793,6 @@ spec:
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
labels:
severity: critical
- alert: MimirTenantHasPartialBlocks
annotations:
message: Mimir tenant {{ $labels.user }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has {{ $value }} partial blocks.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirtenanthaspartialblocks
expr: |
max by(cluster, namespace, user) (cortex_bucket_blocks_partials_count) > 0
for: 6h
labels:
severity: warning
- name: mimir_compactor_alerts
rules:
- alert: MimirCompactorHasNotSuccessfullyCleanedUpBlocks
Expand Down
10 changes: 0 additions & 10 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -773,16 +773,6 @@ groups:
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
labels:
severity: critical
- alert: MimirTenantHasPartialBlocks
annotations:
message: Mimir tenant {{ $labels.user }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has {{ $value }} partial blocks.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirtenanthaspartialblocks
expr: |
max by(cluster, namespace, user) (cortex_bucket_blocks_partials_count) > 0
for: 6h
labels:
severity: warning
- name: mimir_compactor_alerts
rules:
- alert: MimirCompactorHasNotSuccessfullyCleanedUpBlocks
Expand Down
10 changes: 0 additions & 10 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -781,16 +781,6 @@ groups:
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
labels:
severity: critical
- alert: MimirTenantHasPartialBlocks
annotations:
message: Mimir tenant {{ $labels.user }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has {{ $value }} partial blocks.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirtenanthaspartialblocks
expr: |
max by(cluster, namespace, user) (cortex_bucket_blocks_partials_count) > 0
for: 6h
labels:
severity: warning
- name: mimir_compactor_alerts
rules:
- alert: MimirCompactorHasNotSuccessfullyCleanedUpBlocks
Expand Down
14 changes: 0 additions & 14 deletions operations/mimir-mixin/alerts/blocks.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -242,20 +242,6 @@
message: '%(product)s bucket index for tenant {{ $labels.user }} in %(alert_aggregation_variables)s has not been updated since {{ $value | humanizeDuration }}.' % $._config,
},
},
{
// Alert if a we consistently find partial blocks for a given tenant over a relatively large time range.
alert: $.alertName('TenantHasPartialBlocks'),
'for': '6h',
expr: |||
max by(%(alert_aggregation_labels)s, user) (cortex_bucket_blocks_partials_count) > 0
||| % $._config,
labels: {
severity: 'warning',
},
annotations: {
message: '%(product)s tenant {{ $labels.user }} in %(alert_aggregation_variables)s has {{ $value }} partial blocks.' % $._config,
},
},
],
},
],
Expand Down
1 change: 1 addition & 0 deletions pkg/util/validation/limits.go
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,7 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) {
f.IntVar(&l.CompactorSplitAndMergeShards, "compactor.split-and-merge-shards", 0, "The number of shards to use when splitting blocks. 0 to disable splitting.")
f.IntVar(&l.CompactorSplitGroups, "compactor.split-groups", 1, "Number of groups that blocks for splitting should be grouped into. Each group of blocks is then split separately. Number of output split shards is controlled by -compactor.split-and-merge-shards.")
f.IntVar(&l.CompactorTenantShardSize, "compactor.compactor-tenant-shard-size", 0, "Max number of compactors that can compact blocks for single tenant. 0 to disable the limit and use all compactors.")
_ = l.CompactorPartialBlockDeletionDelay.Set("1d")
f.Var(&l.CompactorPartialBlockDeletionDelay, "compactor.partial-block-deletion-delay", fmt.Sprintf("If a partial block (unfinished block without %s file) hasn't been modified for this time, it will be marked for deletion. The minimum accepted value is %s: a lower value will be ignored and the feature disabled. 0 to disable.", block.MetaFilename, MinCompactorPartialBlockDeletionDelay.String()))
f.BoolVar(&l.CompactorBlockUploadEnabled, "compactor.block-upload-enabled", false, "Enable block upload API for the tenant.")
f.BoolVar(&l.CompactorBlockUploadValidationEnabled, "compactor.block-upload-validation-enabled", true, "Enable block upload validation for the tenant.")
Expand Down

0 comments on commit 33842d2

Please sign in to comment.