Skip to content

Commit

Permalink
Add out-of-order sample support (#2187)
Browse files Browse the repository at this point in the history
* Add out-of-order sample support

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>

* Fix review comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix tests

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Update test to check runtime change of OutOfOrderTimeWindow

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix race in the test

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix Peter's comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix CI

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

* Fix review comments

Signed-off-by: Ganesh Vernekar <ganeshvern@gmail.com>

Co-authored-by: Jesus Vazquez <jesus.vazquez@grafana.com>
  • Loading branch information
codesome and jesusvazquez committed Jun 28, 2022
1 parent e11ac85 commit 110d996
Show file tree
Hide file tree
Showing 12 changed files with 325 additions and 62 deletions.
14 changes: 13 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,15 @@
* [CHANGE] Config flag category overrides can be set dynamically at runtime. #1934
* [CHANGE] Ingester: deprecated `-ingester.ring.join-after`. Mimir now behaves as this setting is always set to 0s. This configuration option will be removed in Mimir 2.4.0. #1965
* [CHANGE] Blocks uploaded by ingester no longer contain `__org_id__` label. Compactor now ignores this label and will compact blocks with and without this label together. `mimirconvert` tool will remove the label from blocks as "unknown" label. #1972
* [CHANGE] Querier: deprecated `-querier.shuffle-sharding-ingesters-lookback-period`, instead adding `-querier.shuffle-sharding-ingesters-enabled` to enable or disable shuffle sharding on the read path. The value of `-querier.query-ingesters-within` is now used internally for shuffle sharding lookback. #2110
* [CHANGE] Memberlist: `-memberlist.abort-if-join-fails` now defaults to false. Previously it defaulted to true. #2168
* [CHANGE] Ruler: `/api/v1/rules*` and `/prometheus/rules*` configuration endpoints are removed. Use `/prometheus/config/v1/rules*`. #2182
* [CHANGE] Ingester: `-ingester.exemplars-update-period` has been renamed to `-ingester.tsdb-config-update-period`. You can use it to update multiple, per-tenant TSDB configurations. #2187
* [FEATURE] Ingester: (Experimental) Add the ability to ingest out-of-order samples up to an allowed limit. If you enable this feature, it requires additional memory and disk space. This feature also enables a write-behind log, which might lead to longer ingester-start replays. When this feature is disabled, there is no overhead on memory, disk space, or startup times. #2187
* `-ingester.out-of-order-time-window`, as duration string, allows you to set how back in time a sample can be. The default is `0s`, where `s` is seconds.
* `cortex_ingester_tsdb_out_of_order_samples_appended_total` metric tracks the total number of out-of-order samples ingested by the ingester.
* `cortex_discarded_samples_total` has a new label `reason="sample-too-old"`, when the `-ingester.out-of-order-time-window` flag is greater than zero. The label tracks the number of samples that were discarded for being too old; they were out of order, but beyond the time window allowed.
>>>>>>> adba8ec76 (Add out-of-order sample support (#2187))
* [ENHANCEMENT] Distributor: Added limit to prevent tenants from sending excessive number of requests: #1843
* The following CLI flags (and their respective YAML config options) have been added:
* `-distributor.request-rate-limit`
Expand All @@ -28,9 +37,12 @@
* [ENHANCEMENT] Upgrade Docker base images to `alpine:3.16.0`. #2028
* [ENHANCEMENT] Store-gateway: Add experimental configuration option for the store-gateway to attempt to pre-populate the file system cache when memory-mapping index-header files. Enabled with `-blocks-storage.bucket-store.index-header.map-populate-enabled=true`. Note this flag only has an effect when running on Linux. #2019 #2054
* [ENHANCEMENT] Chunk Mapper: reduce memory usage of async chunk mapper. #2043
* [ENHANCEMENT] Ingesters: Added new configuration option that makes it possible for mimir ingesters to perform queries on overlapping blocks in the filesystem. Enabled with `-blocks-storage.tsdb.allow-overlapping-queries`. #2091
* [ENHANCEMENT] Ingester: reduce sleep time when reading WAL. #2098
* [ENHANCEMENT] Compactor: Add HTTP API for uploading TSDB blocks. #1694
* [ENHANCEMENT] Compactor: Run sanity check on blocks storage configuration at startup. #2143
* [ENHANCEMENT] Compactor: Add HTTP API for uploading TSDB blocks. Enabled with `-compactor.block-upload-enabled`. #1694 #2126
* [ENHANCEMENT] Ingester: Enable querying overlapping blocks by default. #2187
>>>>>>> adba8ec76 (Add out-of-order sample support (#2187))
* [BUGFIX] Fix regexp parsing panic for regexp label matchers with start/end quantifiers. #1883
* [BUGFIX] Ingester: fixed deceiving error log "failed to update cached shipped blocks after shipper initialisation", occurring for each new tenant in the ingester. #1893
* [BUGFIX] Ring: fix bug where instances may appear unhealthy in the hash ring web UI even though they are not. #1933
Expand Down
50 changes: 36 additions & 14 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -2306,12 +2306,12 @@
},
{
"kind": "field",
"name": "exemplars_update_period",
"name": "tsdb_config_update_period",
"required": false,
"desc": "Period with which to update per-tenant max exemplar limit.",
"desc": "Period with which to update the per-tenant TSDB configuration.",
"fieldValue": null,
"fieldDefaultValue": 15000000000,
"fieldFlag": "ingester.exemplars-update-period",
"fieldFlag": "ingester.tsdb-config-update-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
Expand Down Expand Up @@ -2648,6 +2648,17 @@
"fieldType": "map of tracker name (string) to matcher (string)",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "out_of_order_time_window",
"required": false,
"desc": "Non-zero value enables out-of-order support for most recent samples that are within the time window in relation to the following two conditions: (1) The newest sample for that time series, if it exists. For example, within [series.maxTime-timeWindow, series.maxTime]). (2) The TSDB's maximum time, if the series does not exist. For example, within [db.maxTime-timeWindow, db.maxTime]). The ingester will need more memory as a factor of _rate of out-of-order samples being ingested_ and _the number of series that are getting out-of-order samples_. You can configure it per tenant.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.out-of-order-time-window",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_fetched_chunks_per_query",
Expand Down Expand Up @@ -5370,17 +5381,6 @@
"fieldType": "boolean",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "allow_overlapping_queries",
"required": false,
"desc": "Enable querying overlapping blocks. If there are going to be overlapping blocks in the ingesters this should be enabled.",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "blocks-storage.tsdb.allow-overlapping-queries",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "series_hash_cache_max_size_bytes",
Expand All @@ -5402,6 +5402,28 @@
"fieldFlag": "blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup",
"fieldType": "int",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "out_of_order_cap_min",
"required": false,
"desc": "Minimum capacity for out-of-order chunks, in samples between 0 and 255.",
"fieldValue": null,
"fieldDefaultValue": 4,
"fieldFlag": "blocks-storage.tsdb.out-of-order-cap-min",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "out_of_order_cap_max",
"required": false,
"desc": "Maximum capacity for out of order chunks, in samples between 1 and 255.",
"fieldValue": null,
"fieldDefaultValue": 32,
"fieldFlag": "blocks-storage.tsdb.out-of-order-cap-max",
"fieldType": "int",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
Expand Down
12 changes: 8 additions & 4 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -477,8 +477,6 @@ Usage of ./cmd/mimir/mimir:
OpenStack Swift user ID.
-blocks-storage.swift.username string
OpenStack Swift username.
-blocks-storage.tsdb.allow-overlapping-queries
[experimental] Enable querying overlapping blocks. If there are going to be overlapping blocks in the ingesters this should be enabled.
-blocks-storage.tsdb.block-ranges-period value
TSDB blocks range period. (default 2h0m0s)
-blocks-storage.tsdb.close-idle-tsdb-timeout duration
Expand Down Expand Up @@ -507,6 +505,10 @@ Usage of ./cmd/mimir/mimir:
[experimental] True to enable snapshotting of in-memory TSDB data on disk when shutting down.
-blocks-storage.tsdb.new-chunk-disk-mapper
[experimental] Temporary flag to select whether to use the new (used in upstream Prometheus) or the old (legacy) chunk disk mapper.
-blocks-storage.tsdb.out-of-order-cap-max int
[experimental] Maximum capacity for out of order chunks, in samples between 1 and 255. (default 32)
-blocks-storage.tsdb.out-of-order-cap-min int
[experimental] Minimum capacity for out-of-order chunks, in samples between 0 and 255. (default 4)
-blocks-storage.tsdb.retention-period duration
TSDB blocks retention in the ingester before a block is removed, relative to the newest block written for the tenant. This should be larger than the -blocks-storage.tsdb.block-ranges-period, -querier.query-store-after and large enough to give store-gateways and queriers enough time to discover newly uploaded blocks. (default 24h0m0s)
-blocks-storage.tsdb.series-hash-cache-max-size-bytes uint
Expand Down Expand Up @@ -839,8 +841,6 @@ Usage of ./cmd/mimir/mimir:
Path to the key file for the client certificate. Also requires the client certificate to be configured.
-ingester.client.tls-server-name string
Override the expected name on the server certificate.
-ingester.exemplars-update-period duration
[experimental] Period with which to update per-tenant max exemplar limit. (default 15s)
-ingester.ignore-series-limit-for-metric-names string
Comma-separated list of metric names, for which the -ingester.max-global-series-per-metric limit will be ignored. Does not affect the -ingester.max-global-series-per-user limit.
-ingester.instance-limits.max-inflight-push-requests int
Expand All @@ -863,6 +863,8 @@ Usage of ./cmd/mimir/mimir:
The maximum number of active series per tenant, across the cluster before replication. 0 to disable. (default 150000)
-ingester.metadata-retain-period duration
Period at which metadata we have not seen will remain in memory before being deleted. (default 10m0s)
-ingester.out-of-order-time-window value
[experimental] Non-zero value enables out-of-order support for most recent samples that are within the time window in relation to the following two conditions: (1) The newest sample for that time series, if it exists. For example, within [series.maxTime-timeWindow, series.maxTime]). (2) The TSDB's maximum time, if the series does not exist. For example, within [db.maxTime-timeWindow, db.maxTime]). The ingester will need more memory as a factor of _rate of out-of-order samples being ingested_ and _the number of series that are getting out-of-order samples_. You can configure it per tenant.
-ingester.rate-update-period duration
Period with which to update the per-tenant ingestion rates. (default 15s)
-ingester.ring.consul.acl-token string
Expand Down Expand Up @@ -947,6 +949,8 @@ Usage of ./cmd/mimir/mimir:
True to enable the zone-awareness and replicate ingested samples across different availability zones. This option needs be set on ingesters, distributors, queriers and rulers when running in microservices mode.
-ingester.stream-chunks-when-using-blocks
Stream chunks from ingesters to queriers. (default true)
-ingester.tsdb-config-update-period duration
[experimental] Period with which to update the per-tenant TSDB configuration. (default 15s)
-log.format value
Output log messages in the given format. Valid formats: [logfmt, json] (default logfmt)
-log.level value
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -741,9 +741,9 @@ ring:
# prod: '{namespace=~"prod-.*"}'
[active_series_custom_trackers: <map of tracker name (string) to matcher (string)> | default = ]

# (experimental) Period with which to update per-tenant max exemplar limit.
# CLI flag: -ingester.exemplars-update-period
[exemplars_update_period: <duration> | default = 15s]
# (experimental) Period with which to update the per-tenant TSDB configuration.
# CLI flag: -ingester.tsdb-config-update-period
[tsdb_config_update_period: <duration> | default = 15s]

instance_limits:
# (advanced) Max ingestion rate (samples/sec) that ingester will accept. This
Expand Down Expand Up @@ -2721,6 +2721,18 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -ingester.active-series-custom-trackers
[active_series_custom_trackers: <map of tracker name (string) to matcher (string)> | default = ]

# (experimental) Non-zero value enables out-of-order support for most recent
# samples that are within the time window in relation to the following two
# conditions: (1) The newest sample for that time series, if it exists. For
# example, within [series.maxTime-timeWindow, series.maxTime]). (2) The TSDB's
# maximum time, if the series does not exist. For example, within
# [db.maxTime-timeWindow, db.maxTime]). The ingester will need more memory as a
# factor of _rate of out-of-order samples being ingested_ and _the number of
# series that are getting out-of-order samples_. You can configure it per
# tenant.
# CLI flag: -ingester.out-of-order-time-window
[out_of_order_time_window: <duration> | default = 0s]

# Maximum number of chunks that can be fetched in a single query from ingesters
# and long-term storage. This limit is enforced in the querier, ruler and
# store-gateway. 0 to disable.
Expand Down Expand Up @@ -3515,11 +3527,6 @@ tsdb:
# CLI flag: -blocks-storage.tsdb.isolation-enabled
[isolation_enabled: <boolean> | default = false]

# (experimental) Enable querying overlapping blocks. If there are going to be
# overlapping blocks in the ingesters this should be enabled.
# CLI flag: -blocks-storage.tsdb.allow-overlapping-queries
[allow_overlapping_queries: <boolean> | default = false]

# (advanced) Max size - in bytes - of the in-memory series hash cache. The
# cache is shared across all tenants and it's used only when query sharding is
# enabled.
Expand All @@ -3529,6 +3536,16 @@ tsdb:
# (advanced) limit the number of concurrently opening TSDB's on startup
# CLI flag: -blocks-storage.tsdb.max-tsdb-opening-concurrency-on-startup
[max_tsdb_opening_concurrency_on_startup: <int> | default = 10]

# (experimental) Minimum capacity for out-of-order chunks, in samples between
# 0 and 255.
# CLI flag: -blocks-storage.tsdb.out-of-order-cap-min
[out_of_order_cap_min: <int> | default = 4]

# (experimental) Maximum capacity for out of order chunks, in samples between
# 1 and 255.
# CLI flag: -blocks-storage.tsdb.out-of-order-cap-max
[out_of_order_cap_max: <int> | default = 32]
```

### compactor
Expand Down
5 changes: 5 additions & 0 deletions docs/sources/operators-guide/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1412,6 +1412,11 @@ Common **causes**:

> **Note**: You can learn more about out of order samples in Prometheus, in the blog post [Debugging out of order samples](https://www.robustperception.io/debugging-out-of-order-samples/).
### err-mimir-sample-too-old

This error is similar to `err-mimir-sample-out-of-order`. The main difference is that the out-of-order support is enabled, but the sample is
older than the out-of-order time window as it relates to the latest sample for that particular time series or the TSDB.

### err-mimir-sample-duplicate-timestamp

This error occurs when the ingester rejects a sample because it is a duplicate of a previously received sample with the same timestamp but different value in the same time series.
Expand Down

0 comments on commit 110d996

Please sign in to comment.