Skip to content

2.12.0-rc.0

Pre-release
Pre-release
Compare
Choose a tag to compare
@duricanikolic duricanikolic released this 12 Mar 19:10
· 505 commits to main since this release
mimir-2.12.0-rc.0
d0ac52d

This release contains 525 PRs from 60 authors, including new contributors Benoit Schipper, Derek Cadzow, Edwin, Itay Kalfon, Ivan Farré Vicente, Jan O. Rundshagen, Jorge Turrado Ferrero, Lukas Monkevicius, Mickaël Canévet, Rafael Sathler, Rajakavitha Kodhandapani, Tim Kotowski, Vladimir Varankin, Zach, Zach Day, Zirko, blut, github-actions[bot], ncharaf, zhehao-grafana. Thank you!

Grafana Mimir version 2.12.0-rc.0 release notes

Grafana Labs is excited to announce version 2.12 of Grafana Mimir.

The highlights that follow include the top features, enhancements, and bug fixes in this release.
For the complete list of changes, refer to the CHANGELOG.

Features and enhancements

  • Added support to only count series that are considered active through the Cardinality API endpoint /api/v1/cardinality/label_names by passing the count_method parameter.
    If set to active it counts only series that are considered active according to the -ingester.active-series-metrics-idle-timeout flag setting rather than counting all in-memory series.

  • The "Store-gateway: bucket tenant blocks" admin page contains a new column "No Compact".
    If block no compaction marker is set, it specifies the reason and the date the marker is added.

  • The estimated number of compaction jobs based on the current bucket-index is now computed by the compactor.
    The result is tracked by the new cortex_bucket_index_compaction_jobs metric.
    If this computation fails, the cortex_bucket_index_compaction_jobs_errors_total metric is updated instead.
    The estimated number of compaction jobs is also shown in Top tenants, Tenants, and Compactor dashboards.

  • Added mimir-distroless container image built upon a distroless image (gcr.io/distroless/static-debian12).
    This improvement minimizes attack surfaces and potential CVEs by trimming down the dependencies within the image.
    After comprehensive testing, the Mimir maintainers plan to shift from the current image to the distroless version.

Additionally, the following previously experimental features are now considered stable:

  • The number of pre-allocated workers used to forward push requests to the ingesters, configurable via the -distributor.reusable-ingester-push-workers CLI flag on distributors.
    It now defaults to 2000.
    Note that this is a performance optimization, and not a limiting feature.
    If not enough workers available, new goroutines will be spawned.

  • The number of gRPC server workers used to serve the requests, configurable via the -server.grpc.num-workers CLI flag.
    It now defaults to 100.
    Note that this is the number of pre-allocated long-lived workers, and not a limiting feature.
    If not enough workers are available, new goroutines will be spawned.

  • The maximum number of concurrent index header loads across all tenants, configurable via the -blocks-storage.bucket-store.index-header.lazy-loading-concurrency CLI flag on store-gateways.
    It defaults to 4.

  • The maximum time to wait for the query-frontend to become ready before rejecting requests, configurable via the -query-frontend.not-running-timeout CLI flags on query-frontends.
    It now defaults to 2s.

  • Spread-minimizing token-related CLI flags: -ingester.ring.token-generation-strategy, -ingester.ring.spread-minimizing-zones and -ingester.ring.spread-minimizing-join-ring-in-order.
    You can read more about this feature in our blog post.

Important changes

In Grafana Mimir 2.12 the following behavior has changed:

  • Store-gateway now persists a sparse version of the index-header to disk on construction and loads sparse index-headers from disk instead of the whole index-header.
    This improves the speed at which index headers are lazy-loaded from disk by up to 90%. The added disk usage is in the order of 1-2%.

  • Alertmanager deprecated the v1 API. All v1 API endpoints now respond with a JSON deprecation notice and a status code of 410.
    All endpoints have a v2 equivalent.
    The list of endpoints is:

    • <alertmanager-web.external-url>/api/v1/alerts
    • <alertmanager-web.external-url>/api/v1/receivers
    • <alertmanager-web.external-url>/api/v1/silence/{id}
    • <alertmanager-web.external-url>/api/v1/silences
    • <alertmanager-web.external-url>/api/v1/status
  • Exemplar's label traceID has been changed to trace_id to be consistent with the OpenTelemetry standard.

  • Errors returned by ingesters now contain only gRPC status codes.
    Previously they contained both gRPC and HTTP status codes.
    To guarantee backwards compatibility when migrating from a version prior to 2.11, it's necessary to first migrate to version 2.11, and then to version 2.12.
    Otherwise, it might happen that during the migration, some ingester errors with HTTP status code 4xx won't be recognized, and the corresponding request will be repeated.

  • Responses with gRPC status codes are now reported as status_code labels in the cortex_request_duration_seconds and cortex_ingester_client_request_duration_seconds metrics.

  • Responses with HTTP 4xx status codes are now treated as errors and used in status_code label of request duration metric.

The default value of the following CLI flags have been changed:

  • -blocks-storage.tsdb.head-postings-for-matchers-cache-max-bytes from 10MB to 100MB.
  • -blocks-storage.tsdb.block-postings-for-matchers-cache-max-bytes from 10MB to 100MB.
  • -blocks-storage.bucket-store.tenant-sync-concurrency from 10 to 1.
  • -query-frontend.max-cache-freshness from 1m to 10m.
  • -distributor.write-requests-buffer-pooling-enabled from false to true.
  • -locks-storage.bucket-store.block-sync-concurrency from 20 to 4.
  • -memberlist.stream-timeout from 10s to 2s.
  • -server.report-grpc-codes-in-instrumentation-label-enabled from false to true.

The following deprecated configuration options are removed in Grafana Mimir 2.12:

  • The YAML setting frontend.cache_unaligned_requests.

The following configuration options are deprecated and will be removed in Grafana Mimir 2.14:

  • The CLI flag -ingester.limit-inflight-requests-using-grpc-method-limiter.
    It now defaults to true.

  • The CLI flag -ingester.return-only-grpc-errors.
    It now defaults to true.
    To guarantee backwards compatibility when migrating from a version prior to 2.11, it's necessary to first migrate to version 2.11, and then to version 2.12.
    Otherwise, it might happen that during the migration, some ingester errors with HTTP status code 4xx won't be recognized, and the corresponding request will be repeated.

  • The CLI flag -ingester.client.report-grpc-codes-in-instrumentation-label-enabled.
    It now defaults to true.

  • The CLI flag -distributor.limit-inflight-requests-using-grpc-method-limiter.
    It now defaults to true.

  • The CLI flag -distributor.enable-otlp-metadata-storage.
    It now defaults to true.

  • The CLI flag -querier.max-query-into-future.

The following metrics are removed or deprecated:

  • cortex_bucket_store_blocks_loaded_by_duration has been removed.
  • cortex_distributor_sample_delay_seconds has been deprecated and will be removed in Mimir 2.14.

Experimental features

Grafana Mimir 2.12 includes new features that are considered experimental and disabled by default.
Use them with caution and report any issues you encounter:

  • The maximum number of tenant IDs that may be for a federated query can be configured via the -tenant-federation.max-tenants CLI flag on query-frontends.
    By default, it's 0, meaning that the limit is disabled.

  • Sharding of active series queries can be enabled via the -query-frontend.shard-active-series-queries CLI flag on query-frontends.

  • Timely head compaction can be enabled via the -blocks-storage.tsdb.timely-head-compaction-enabled on ingesters.
    If enabled, the head compaction happens when the min block range can no longer be appended, without requiring 1.5x the chunk range worth of data in the head.

  • Streaming of responses from querier to query-frontend can be enabled via the -querier.response-streaming-enabled CLI flag on queriers.
    This is currently supported only for responses from the /api/v1/cardinality/active_series endpoint.

  • The maximum response size for active series queries, in bytes, can be set via the -querier.active-series-results-max-size-bytes CLI flag on queriers.

  • Metric relabeling on a per-tenant basis can be forcefully disabled via the -distributor.metric-relabeling-enabled CLI flag on rulers.
    Metrics relabeling is enabled by default.

  • Query Queue Load Balancing by Query Component. Tenant query queues in the query-scheduler can now be split into subqueues by which query component is expected to be utilized to complete the query: ingesters, store-gateways, both, or uncategorized.
    Dequeuing queries for a given tenant will rotate through the query component subqueues via simple round-robin.
    In the event that the one of the query components (ingesters or store-gateways) experience a slowdown, queries only utilizing the the other query component can continue to be serviced.
    This feature is recommended to be enabled.
    The following CLI flags must be set to true in order to be in effect:

    • -query-frontend.additional-query-queue-dimensions-enabled on the query-frontend.
    • -query-scheduler.additional-query-queue-dimensions-enabled on the query-scheduler.
  • Owned series tracking in ingesters can be enabled via the -ingester.track-ingester-owned-series CLI flag.
    When enabled, ingesters will track the number of in-memory series that still map to the ingester based on the ring state.
    These counts are more reactive to ring and shard changes than in-memory series, and can be used when enforcing tenant series limits by enabling the -ingester.use-ingester-owned-series-for-limits CLI flag.
    This feature requires zone-aware replication to be enabled, and the replication factor to be equal to the number of zones.

Bug fixes

  • Distributor: fixed an issue where -distributor.metric-relabeling-enabled could cause distributors to panic.
  • Distributor: fix an issue where -distributor.metric-relabeling-enabled could cause distributors to write unsorted labels and corrupt blocks.
  • Ingester: errors encountered while iterating through chunks or samples in response to a query request aren't ignored anymore.
  • Compactor: out-of-order blocks aren't allowed to prevent timely compaction anymore.
  • Querier: requests to store-gateway when a query gets canceled aren't retried anymore.
  • Querier: status code 499 is now returned instead of 500 when a request to remote read endpoint gets canceled.
  • Querier: fixed an issue where -querier.max-fetched-series-per-query wasn't applied to /series endpoint in case series loaded from ingesters.
  • Querier: fixed an issue with the remote-read requests HTTP status code translations.
    Previously, remote-read had conflicting behaviours: when returning samples all internal errors were translated to HTTP 400, while when returning chunks all internal errors were translated to HTTP 500.
    With this fix, all validation errors will be translated into HTTP 400 errors, while all other errors will be translated into HTTP 500 errors.
  • Query-frontend: the cortex_query_frontend_queries_total metric incorrectly reported op="query" for any request which wasn't a range query.
    Now the op label value can be one of the following:
    • query: instant query
    • query_range: range query
    • cardinality: cardinality query
    • label_names_and_values: label names / values query
    • active_series: active series query
    • other: any other request
  • Ruler: fixed an issue where "failed to remotely evaluate query expression, will retry" messages were logged without context such as the trace ID and didn't appear in trace events.
  • Ruler: requests to remote querier when server's response exceeds its configured max payload size aren't retried anymore.
  • Ruler: fixed a regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total metric.
  • Ruler: fixed an issue with recording rule result being corruption due to an usage of a bad native histogram pointer.

Helm chart improvements

The Grafana Mimir and Grafana Enterprise Metrics Helm charts are released independently.
Refer to the Grafana Mimir Helm chart documentation.

Changelog

2.12.0-rc.0

Grafana Mimir

  • [CHANGE] Alertmanager: Deprecates the v1 API. All v1 API endpoints now respond with a JSON deprecation notice and a status code of 410. All endpoints have a v2 equivalent. The list of endpoints is: #7103
    • <alertmanager-web.external-url>/api/v1/alerts
    • <alertmanager-web.external-url>/api/v1/receivers
    • <alertmanager-web.external-url>/api/v1/silence/{id}
    • <alertmanager-web.external-url>/api/v1/silences
    • <alertmanager-web.external-url>/api/v1/status
  • [CHANGE] Ingester: Increase default value of -blocks-storage.tsdb.head-postings-for-matchers-cache-max-bytes and -blocks-storage.tsdb.block-postings-for-matchers-cache-max-bytes to 100 MiB (previous default value was 10 MiB). #6764
  • [CHANGE] Validate tenant IDs according to documented behavior even when tenant federation is not enabled. Note that this will cause some previously accepted tenant IDs to be rejected such as those longer than 150 bytes or containing | characters. #6959
  • [CHANGE] Ruler: don't use backoff retry on remote evaluation in case of 4xx errors. #7004
  • [CHANGE] Server: responses with HTTP 4xx status codes are now treated as errors and used in status_code label of request duration metric. #7045
  • [CHANGE] Memberlist: change default for -memberlist.stream-timeout from 10s to 2s. #7076
  • [CHANGE] Memcached: remove legacy thanos_cache_memcached_* and thanos_memcached_* prefixed metrics. Instead, Memcached and Redis cache clients now emit thanos_cache_* prefixed metrics with a backend label. #7076
  • [CHANGE] Ruler: the following metrics, exposed when the ruler is configured to discover Alertmanager instances via service discovery, have been renamed: #7057
    • prometheus_sd_failed_configs renamed to cortex_prometheus_sd_failed_configs
    • prometheus_sd_discovered_targets renamed to cortex_prometheus_sd_discovered_targets
    • prometheus_sd_received_updates_total renamed to cortex_prometheus_sd_received_updates_total
    • prometheus_sd_updates_delayed_total renamed to cortex_prometheus_sd_updates_delayed_total
    • prometheus_sd_updates_total renamed to cortex_prometheus_sd_updates_total
    • prometheus_sd_refresh_failures_total renamed to cortex_prometheus_sd_refresh_failures_total
    • prometheus_sd_refresh_duration_seconds renamed to cortex_prometheus_sd_refresh_duration_seconds
  • [CHANGE] Query-frontend: the default value for -query-frontend.not-running-timeout has been changed from 0 (disabled) to 2s. The configuration option has also been moved from "experimental" to "advanced". #7126
  • [CHANGE] Store-gateway: to reduce disk contention on HDDs the default value for blocks-storage.bucket-store.tenant-sync-concurrency has been changed from 10 to 1 and the default value for blocks-storage.bucket-store.block-sync-concurrency has been changed from 20 to 4. #7136
  • [CHANGE] Store-gateway: Remove deprecated CLI flags -blocks-storage.bucket-store.index-header-lazy-loading-enabled and -blocks-storage.bucket-store.index-header-lazy-loading-idle-timeout and their corresponding YAML settings. Instead, use -blocks-storage.bucket-store.index-header.lazy-loading-enabled and -blocks-storage.bucket-store.index-header.lazy-loading-idle-timeout. #7521
  • [CHANGE] Store-gateway: Mark experimental CLI flag -blocks-storage.bucket-store.index-header.lazy-loading-concurrency and its corresponding YAML settings as advanced. #7521
  • [CHANGE] Store-gateway: Remove experimental CLI flag -blocks-storage.bucket-store.index-header.sparse-persistence-enabled since this is now the default behavior. #7535
  • [CHANGE] All: set -server.report-grpc-codes-in-instrumentation-label-enabled to true by default, which enables reporting gRPC status codes as status_code labels in the cortex_request_duration_seconds metric. #7144
  • [CHANGE] Distributor: report gRPC status codes as status_code labels in the cortex_ingester_client_request_duration_seconds metric by default. #7144
  • [CHANGE] Distributor: CLI flag -ingester.client.report-grpc-codes-in-instrumentation-label-enabled has been deprecated, and its default value is set to true. #7144
  • [CHANGE] Ingester: CLI flag -ingester.return-only-grpc-errors has been deprecated, and its default value is set to true. To ensure backwards compatibility, during a migration from a version prior to 2.11.0 to 2.12 or later, -ingester.return-only-grpc-errors should be set to false. Once all the components are migrated, the flag can be removed. #7151
  • [CHANGE] Ingester: the following CLI flags have been moved from "experimental" to "advanced": #7169
    • -ingester.ring.token-generation-strategy
    • -ingester.ring.spread-minimizing-zones
    • -ingester.ring.spread-minimizing-join-ring-in-order
  • [CHANGE] Query-frontend: the default value of the CLI flag -query-frontend.max-cache-freshness (and its respective YAML configuration parameter) has been changed from 1m to 10m. #7161
  • [CHANGE] Distributor: default the optimization -distributor.write-requests-buffer-pooling-enabled to true. #7165
  • [CHANGE] Tracing: Move query information to span attributes instead of span logs. #7046
  • [CHANGE] Distributor: the default value of circuit breaker's CLI flag -ingester.client.circuit-breaker.cooldown-period has been changed from 1m to 10s. #7310
  • [CHANGE] Store-gateway: remove cortex_bucket_store_blocks_loaded_by_duration. cortex_bucket_store_series_blocks_queried is better suited for detecting when compactors are not able to keep up with the number of blocks to compact. #7309
  • [CHANGE] Ingester, Distributor: the support for rejecting push requests received via gRPC before reading them into memory, enabled via -ingester.limit-inflight-requests-using-grpc-method-limiter and -distributor.limit-inflight-requests-using-grpc-method-limiter, is now stable and enabled by default. The configuration options have been deprecated and will be removed in Mimir 2.14. #7360
  • [CHANGE] Distributor: Change-distributor.enable-otlp-metadata-storage flag's default to true, and deprecate it. The flag will be removed in Mimir 2.14. #7366
  • [CHANGE] Store-gateway: Use a shorter TTL for cached items related to temporary blocks. #7407 #7534
  • [CHANGE] Standardise exemplar label as "trace_id". #7475
  • [CHANGE] The configuration option -querier.max-query-into-future has been deprecated and will be removed in Mimir 2.14. #7496
  • [CHANGE] Distributor: the metric cortex_distributor_sample_delay_seconds has been deprecated and will be removed in Mimir 2.14. #7516
  • [CHANGE] Query-frontend: The deprecated YAML setting frontend.cache_unaligned_requests has been moved to limits.cache_unaligned_requests. #7519
  • [FEATURE] Introduce -server.log-source-ips-full option to log all IPs from Forwarded, X-Real-IP, X-Forwarded-For headers. #7250
  • [FEATURE] Introduce -tenant-federation.max-tenants option to limit the max number of tenants allowed for requests when federation is enabled. #6959
  • [FEATURE] Cardinality API: added a new count_method parameter which enables counting active label values. #7085
  • [FEATURE] Querier / query-frontend: added -querier.promql-experimental-functions-enabled CLI flag (and respective YAML config option) to enable experimental PromQL functions. The experimental functions introduced are: mad_over_time(), sort_by_label() and sort_by_label_desc(). #7057
  • [FEATURE] Alertmanager API: added -alertmanager.grafana-alertmanager-compatibility-enabled CLI flag (and respective YAML config option) to enable an experimental API endpoints that support the migration of the Grafana Alertmanager. #7057
  • [FEATURE] Alertmanager: Added -alertmanager.utf8-strict-mode-enabled to control support for any UTF-8 character as part of Alertmanager configuration/API matchers and labels. It's default value is set to false. #6898
  • [FEATURE] Querier: added histogram_avg() function support to PromQL. #7293
  • [FEATURE] Ingester: added -blocks-storage.tsdb.timely-head-compaction flag, which enables more timely head compaction, and defaults to false. #7372
  • [FEATURE] Compactor: Added /compactor/tenants and /compactor/tenant/{tenant}/planned_jobs endpoints that provide functionality that was provided by tools/compaction-planner -- listing of planned compaction jobs based on tenants' bucket index. #7381
  • [FEATURE] Add experimental support for streaming response bodies from queriers to frontends via -querier.response-streaming-enabled. This is currently only supported for the /api/v1/cardinality/active_series endpoint. #7173
  • [FEATURE] Release: Added mimir distroless docker image. #7371
  • [FEATURE] Add support for the new grammar of {"metric_name", "l1"="val"} to promql and some of the exposition formats. #7475 #7541
  • [ENHANCEMENT] Distributor: Add a new metric cortex_distributor_otlp_requests_total to track the total number of OTLP requests. #7385
  • [ENHANCEMENT] Vault: add lifecycle manager for token used to authenticate to Vault. This ensures the client token is always valid. Includes a gauge (cortex_vault_token_lease_renewal_active) to check whether token renewal is active, and the counters cortex_vault_token_lease_renewal_success_total and cortex_vault_auth_success_total to see the total number of successful lease renewals / authentications. #7337
  • [ENHANCEMENT] Store-gateway: add no-compact details column on store-gateway tenants admin UI. #6848
  • [ENHANCEMENT] PromQL: ignore small errors for bucketQuantile #6766
  • [ENHANCEMENT] Distributor: improve efficiency of some errors #6785
  • [ENHANCEMENT] Ruler: exclude vector queries from being tracked in cortex_ruler_queries_zero_fetched_series_total. #6544
  • [ENHANCEMENT] Ruler: local storage backend now supports reading a rule group via /config/api/v1/rules/{namespace}/{groupName} configuration API endpoint. #6632
  • [ENHANCEMENT] Query-Frontend and Query-Scheduler: split tenant query request queues by query component with query-frontend.additional-query-queue-dimensions-enabled and query-scheduler.additional-query-queue-dimensions-enabled. #6772
  • [ENHANCEMENT] Distributor: support disabling metric relabel rules per-tenant via the flag -distributor.metric-relabeling-enabled or associated YAML. #6970
  • [ENHANCEMENT] Distributor: -distributor.remote-timeout is now accounted from the first ingester push request being sent. #6972
  • [ENHANCEMENT] Storage Provider: -<prefix>.s3.sts-endpoint sets a custom endpoint for AWS Security Token Service (AWS STS) in s3 storage provider. #6172
  • [ENHANCEMENT] Querier: add cortex_querier_queries_storage_type_total metric that indicates how many queries have executed for a source, ingesters or store-gateways. Add cortex_querier_query_storegateway_chunks_total metric to count the number of chunks fetched from a store gateway. #7099,#7145
  • [ENHANCEMENT] Query-frontend: add experimental support for sharding active series queries via -query-frontend.shard-active-series-queries. #6784
  • [ENHANCEMENT] Distributor: set -distributor.reusable-ingester-push-workers=2000 by default and mark feature as advanced. #7128
  • [ENHANCEMENT] All: set -server.grpc.num-workers=100 by default and mark feature as advanced. #7131
  • [ENHANCEMENT] Distributor: invalid metric name error message gets cleaned up to not include non-ascii strings. #7146
  • [ENHANCEMENT] Store-gateway: add source, level, and out_or_order to cortex_bucket_store_series_blocks_queried metric that indicates the number of blocks that were queried from store gateways by block metadata. #7112 #7262 #7267
  • [ENHANCEMENT] Compactor: After updating bucket-index, compactor now also computes estimated number of compaction jobs based on current bucket-index, and reports the result in cortex_bucket_index_estimated_compaction_jobs metric. If computation of jobs fails, cortex_bucket_index_estimated_compaction_jobs_errors_total is updated instead. #7299
  • [ENHANCEMENT] Mimir: Integrate profiling into tracing instrumentation. #7363
  • [ENHANCEMENT] Alertmanager: Adds metric cortex_alertmanager_notifications_suppressed_total that counts the total number of notifications suppressed for being silenced, inhibited, outside of active time intervals or within muted time intervals. #7384
  • [ENHANCEMENT] Query-scheduler: added more buckets to cortex_query_scheduler_queue_duration_seconds histogram metric, in order to better track queries staying in the queue for longer than 10s. #7470
  • [ENHANCEMENT] A type label is added to prometheus_tsdb_head_out_of_order_samples_appended_total metric. #7475
  • [ENHANCEMENT] Distributor: Optimize OTLP endpoint. #7475
  • [ENHANCEMENT] API: Use github.com/klauspost/compress for faster gzip and deflate compression of API responses. #7475
  • [ENHANCEMENT] Ingester: Limiting on owned series (-ingester.use-ingester-owned-series-for-limits) now prevents discards in cases where a tenant is sharded across all ingesters (or shuffle sharding is disabled) and the ingester count increases. #7411
  • [ENHANCEMENT] Block upload: include converted timestamps in the error message if block is from the future. #7538
  • [ENHANCEMENT] Query-frontend: Introduce -query-frontend.active-series-write-timeout to allow configuring the server-side write timeout for active series requests. #7553 #7569
  • [BUGFIX] Ingester: don't ignore errors encountered while iterating through chunks or samples in response to a query request. #6451
  • [BUGFIX] Fix issue where queries can fail or omit OOO samples if OOO head compaction occurs between creating a querier and reading chunks #6766
  • [BUGFIX] Fix issue where concatenatingChunkIterator can obscure errors #6766
  • [BUGFIX] Fix panic during tsdb Commit #6766
  • [BUGFIX] tsdb/head: wlog exemplars after samples #6766
  • [BUGFIX] Ruler: fix issue where "failed to remotely evaluate query expression, will retry" messages are logged without context such as the trace ID and do not appear in trace events. #6789
  • [BUGFIX] Ruler: do not retry requests to remote querier when server's response exceeds its configured max payload size. #7216
  • [BUGFIX] Querier: fix issue where spans in query request traces were not nested correctly. #6893
  • [BUGFIX] Fix issue where all incoming HTTP requests have duplicate trace spans. #6920
  • [BUGFIX] Querier: do not retry requests to store-gateway when a query gets canceled. #6934
  • [BUGFIX] Querier: return 499 status code instead of 500 when a request to remote read endpoint gets canceled. #6934
  • [BUGFIX] Querier: fix issue where -querier.max-fetched-series-per-query is not applied to /series endpoint if the series are loaded from ingesters. #7055
  • [BUGFIX] Distributor: fix issue where -distributor.metric-relabeling-enabled may cause distributors to panic #7176
  • [BUGFIX] Distributor: fix issue where -distributor.metric-relabeling-enabled may cause distributors to write unsorted labels and corrupt blocks #7326
  • [BUGFIX] Query-frontend: the cortex_query_frontend_queries_total report incorrectly reported op="query" for any request which wasn't a range query. Now the op label value can be one of the following: #7207
    • query: instant query
    • query_range: range query
    • cardinality: cardinality query
    • label_names_and_values: label names / values query
    • active_series: active series query
    • other: any other request
  • [BUGFIX] Fix performance regression introduced in Mimir 2.11.0 when uploading blocks to AWS S3. #7240
  • [BUGFIX] Query-frontend: fix race condition when sharding active series is enabled (see above) and response is compressed with snappy. #7290
  • [BUGFIX] Query-frontend: "query stats" log unsuccessful replies from downstream as "failed". #7296
  • [BUGFIX] Packaging: remove reload from systemd file as mimir does not take into account SIGHUP. #7345
  • [BUGFIX] Compactor: do not allow out-of-order blocks to prevent timely compaction. #7342
  • [BUGFIX] Update google.golang.org/grpc to resolve occasional issues with gRPC server closing its side of connection before it was drained by the client. #7380
  • [BUGFIX] Query-frontend: abort response streaming for active_series requests when the request context is canceled. #7378
  • [BUGFIX] Compactor: improve compaction of sporadic blocks. #7329
  • [BUGFIX] Ruler: fix regression that caused client errors to be tracked in cortex_ruler_write_requests_failed_total metric. #7472
  • [BUGFIX] promql: Fix Range selectors with an @ modifier are wrongly scoped in range queries. #7475
  • [BUGFIX] Fix metadata API using wrong JSON field names. #7475
  • [BUGFIX] Ruler: fix native histogram recording rule result corruption. #7552

Mixin

  • [CHANGE] The job label matcher for distributor and gateway have been extended to include any deployment matching distributor.* and cortex-gw.* respectively. This change allows to match custom and multi-zone distributor and gateway deployments too. #6817
  • [ENHANCEMENT] Dashboards: Add panels for alertmanager activity of a tenant #6826
  • [ENHANCEMENT] Dashboards: Add graphs to "Slow Queries" dashboard. #6880
  • [ENHANCEMENT] Dashboards: Update all deprecated "graph" panels to "timeseries" panels. #6864 #7413 #7457
  • [ENHANCEMENT] Dashboards: Make most columns in "Slow Queries" sortable. #7000
  • [ENHANCEMENT] Dashboards: Render graph panels at full resolution as opposed to at half resolution. #7027
  • [ENHANCEMENT] Dashboards: show query-scheduler queue length on "Reads" and "Remote Ruler Reads" dashboards. #7088
  • [ENHANCEMENT] Dashboards: Add estimated number of compaction jobs to "Compactor", "Tenants" and "Top tenants" dashboards. #7449 #7481
  • [ENHANCEMENT] Recording rules: add native histogram recording rules to cortex_request_duration_seconds. #7528
  • [ENHANCEMENT] Dashboards: Add total owned series, and per-ingester in-memory and owned series to "Tenants" dashboard. #7511
  • [BUGFIX] Dashboards: drop step parameter from targets as it is not supported. #7157
  • [BUGFIX] Recording rules: drop rules for metrics removed in 2.0: cortex_memcache_request_duration_seconds and cortex_cache_request_duration_seconds. #7514

Jsonnet

  • [CHANGE] Distributor: Increase JAEGER_REPORTER_MAX_QUEUE_SIZE from the default (100) to 1000, to avoid dropping tracing spans. #7259
  • [CHANGE] Querier: Increase JAEGER_REPORTER_MAX_QUEUE_SIZE from 1000 to 5000, to avoid dropping tracing spans. #6764
  • [CHANGE] rollout-operator: remove default CPU limit. #7066
  • [CHANGE] Store-gateway: Increase JAEGER_REPORTER_MAX_QUEUE_SIZE from the default (100) to 1000, to avoid dropping tracing spans. #7068
  • [CHANGE] Query-frontend, ingester, ruler, backend and write instances: Increase JAEGER_REPORTER_MAX_QUEUE_SIZE from the default (100), to avoid dropping tracing spans. #7086
  • [CHANGE] Ring: relaxed the hash ring heartbeat period and timeout for distributor, ingester, store-gateway and compactor: #6860
    • -distributor.ring.heartbeat-period set to 1m
    • -distributor.ring.heartbeat-timeout set to 4m
    • -ingester.ring.heartbeat-period set to 2m
    • -store-gateway.sharding-ring.heartbeat-period set to 1m
    • -store-gateway.sharding-ring.heartbeat-timeout set to 4m
    • -compactor.ring.heartbeat-period set to 1m
    • -compactor.ring.heartbeat-timeout set to 4m
  • [CHANGE] Ruler-querier: the topology spread constrain max skew is now configured through the configuration option ruler_querier_topology_spread_max_skew instead of querier_topology_spread_max_skew. #7204
  • [CHANGE] Distributor: -server.grpc.keepalive.max-connection-age lowered from 2m to 60s and configured -shutdown-delay=90s and termination grace period to 100 seconds in order to reduce the chances of failed gRPC write requests when distributors gracefully shutdown. #7361
  • [FEATURE] Added support for the following root-level settings to configure the list of matchers to apply to node affinity: #6782 #6829
    • alertmanager_node_affinity_matchers
    • compactor_node_affinity_matchers
    • continuous_test_node_affinity_matchers
    • distributor_node_affinity_matchers
    • ingester_node_affinity_matchers
    • ingester_zone_a_node_affinity_matchers
    • ingester_zone_b_node_affinity_matchers
    • ingester_zone_c_node_affinity_matchers
    • mimir_backend_node_affinity_matchers
    • mimir_backend_zone_a_node_affinity_matchers
    • mimir_backend_zone_b_node_affinity_matchers
    • mimir_backend_zone_c_node_affinity_matchers
    • mimir_read_node_affinity_matchers
    • mimir_write_node_affinity_matchers
    • mimir_write_zone_a_node_affinity_matchers
    • mimir_write_zone_b_node_affinity_matchers
    • mimir_write_zone_c_node_affinity_matchers
    • overrides_exporter_node_affinity_matchers
    • querier_node_affinity_matchers
    • query_frontend_node_affinity_matchers
    • query_scheduler_node_affinity_matchers
    • rollout_operator_node_affinity_matchers
    • ruler_node_affinity_matchers
    • ruler_node_affinity_matchers
    • ruler_querier_node_affinity_matchers
    • ruler_query_frontend_node_affinity_matchers
    • ruler_query_scheduler_node_affinity_matchers
    • store_gateway_node_affinity_matchers
    • store_gateway_node_affinity_matchers
    • store_gateway_zone_a_node_affinity_matchers
    • store_gateway_zone_b_node_affinity_matchers
    • store_gateway_zone_c_node_affinity_matchers
  • [FEATURE] Ingester: Allow automated zone-by-zone downscaling, that can be enabled via the ingester_automated_downscale_enabled flag. It is disabled by default. #6850
  • [ENHANCEMENT] Alerts: Add MimirStoreGatewayTooManyFailedOperations warning alert that triggers when Mimir store-gateway report error when interacting with the object storage. #6831
  • [ENHANCEMENT] Querier HPA: improved scaling metric and scaling policies, in order to scale up and down more gradually. #6971
  • [ENHANCEMENT] Rollout-operator: upgraded to v0.13.0. #7469
  • [ENHANCEMENT] Rollout-operator: add tracing configuration to rollout-operator container (when tracing is enabled and configured). #7469
  • [ENHANCEMENT] Query-frontend: configured -shutdown-delay, -server.grpc.keepalive.max-connection-age and termination grace period to reduce the likelihood of queries hitting terminated query-frontends. #7129
  • [ENHANCEMENT] Autoscaling: add support for KEDA's ignoreNullValues option for Prometheus scaler. #7471
  • [BUGFIX] Update memcached-exporter to 0.14.1 due to CVE-2023-39325. #6861

Mimirtool

  • [FEATURE] Add command migrate-utf8 to migrate Alertmanager configurations for Alertmanager versions 0.27.0 and later. #7383
  • [ENHANCEMENT] Add template render command to render locally a template. #7325
  • [ENHANCEMENT] Add --extra-headers option to mimirtool rules command to add extra headers to requests for auth. #7141
  • [ENHANCEMENT] Analyze Prometheus: set tenant header. #6737
  • [ENHANCEMENT] Add argument --output-dir to mimirtool alertmanager get where the config and templates will be written to and can be loaded via mimirtool alertmanager load #6760
  • [BUGFIX] Analyze rule-file: .metricsUsed field wasn't populated. #6953

Mimir Continuous Test

  • [ENHANCEMENT] Include comparison of all expected and actual values when any float sample does not match. #6756

Query-tee

  • [BUGFIX] Fix issue where Host HTTP header was not being correctly changed for the proxy targets. #7386
  • [ENHANCEMENT] Allow using the value of X-Scope-OrgID for basic auth username in the forwarded request if URL username is set as __REQUEST_HEADER_X_SCOPE_ORGID__. #7452

Documentation

  • [CHANGE] No longer mark OTLP distributor endpoint as experimental. #7348
  • [ENHANCEMENT] Added runbook for KubePersistentVolumeFillingUp alert. #7297
  • [ENHANCEMENT] Add Grafana Cloud recommendations to OTLP documentation. #7375
  • [BUGFIX] Fixed typo on single zone->zone aware replication Helm page. #7327

Tools

  • [CHANGE] copyblocks: The flags for copyblocks have been changed to align more closely with other tools. #6607
  • [CHANGE] undelete-blocks: undelete-blocks-gcs has been removed and replaced with undelete-blocks, which supports recovering deleted blocks in versioned buckets from ABS, GCS, and S3-compatible object storage. #6607
  • [FEATURE] copyprefix: Add tool to copy objects between prefixes. Supports ABS, GCS, and S3-compatible object storage. #6607

All changes in this release: mimir-2.11.0...mimir-2.12.0-rc.0