Skip to content

mimir-2.3.0

Compare
Choose a tag to compare
@treid314 treid314 released this 20 Sep 15:21
· 3675 commits to main since this release
b3f22b4

Grafana Mimir version 2.3 release notes

Grafana Labs is excited to announce version 2.3 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

The highlights that follow include the top features, enhancements, and bugfixes in this release. For the complete list of changes, see the changelog.

Note: If you are upgrading from Grafana Mimir 2.2, review the list of important changes that follow.

This release contains 370 PRs from 39 authors. Thank you!

Features and enhancements

  • Ingest metrics in OpenTelemetry format:
    This release of Grafana Mimir introduces experimental support for ingesting metrics from the OpenTelemetry Collector's otlphttp exporter. This adds a second ingestion option for users of the OTel Collector; Mimir was already compatible with the prometheusremotewrite exporter. For more information, please see Configure OTel Collector.

  • Tenant federation for metadata queries:
    Users with tenant federation enabled could already issue instant queries, range queries, and exemplar queries to multiple tenants at once and receive a single aggregated result. With Grafana Mimir 2.3, we've added tenant federation support to the /api/v1/metadata endpoint as well.

  • Simpler object storage configuration:
    Users can now configure block, alertmanager, and ruler storage all at once with the common YAML config option key (or -common.storage.* CLI flags). By centralizing your object storage configuration in one place, this enhancement makes configuration faster and less error prone. Users may still individually configure storage for each of these components if they desire. For more information, see the Common Configurations.

  • .deb and .rpm packages for Mimir:
    Starting with version 2.3, we're publishing .deb and .rpm files for Grafana Mimir, which will make installing and running it on Debian or RedHat-based linux systems much easier. Thank you to community contributor wilfriedroset for your work to implement this!

  • Import historic data:
    Users can now backfill time series data from their existing Prometheus or Cortex installation into Mimir using mimirtool, making it possible to migrate to Grafana Mimir without losing your existing metrics data. This support is still considered experimental and does not yet work for data stored in Thanos. To learn more about this feature, please see mimirtool backfill and Configure TSDB block upload

  • Increased instant query performance:
    Grafana Mimir now supports splitting instant queries by time. This allows it to better parallelize execution of instant queries and therefore return results faster. At present, splitting is only supported for a subset of instant queries, which means not all instant queries will see a speedup. This feature is currently experimental and is disabled by default. It can be enabled with the split_instant_queries_by_interval YAML config option in the limits section (or the CLI flag -query-frontend.split-instant-queries-by-interval).

Helm chart improvements

The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.3 release, we’re also releasing version 3.1 of the Mimir Helm chart.

Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

  • We've upgraded the MinIO subchart dependency from a deprecated chart to the supported one. This creates a breaking change in how the administrator password is set. However, as the built-in MinIO is not a recommended object store for production use cases, this change did not warrant a new major version of the Mimir Helm chart.
  • Query sharding is now enabled by default which should give you better performance on high cardinality metrics queries.
    • To compensate for the increased number of queries generated by query sharding, the query scheduler component is now enabled by default.
  • The backfill API endpoints for importing historic time series data are now exposed on the Nginx gateway.
  • Nginx now sets the value of the X-Scope-OrgID header equal to the value of Mimir's no_auth_tenant parameter by default. The previous release had set the value of X-Scope-OrgID to anonymous by default which complicated the process of migrating to Mimir.
  • Memberlist now uses DNS service-discovery by default, which decreases startup time for large Mimir clusters.

Important changes

In Grafana Mimir 2.3 we have removed the following previously deprecated configuration options:

  • The extend_writes parameter in the distributor YAML configuration and -distributor.extend-writes CLI flag have been removed.
  • The active_series_custom_trackers parameter has been removed from the YAML configuration. It had already been moved to the runtime configuration. See #1188 for details.
  • The blocks-storage.tsdb.isolation-enabled parameter in the YAML configuration and -blocks-storage.tsdb.isolation-enabled CLI flag have been removed.

With Grafana Mimir 2.3 we have also updated the default value for the CLI flag -distributor.ha-tracker.max-clusters to 100 to provide Denial-of-Service protection. Previously -distributor.ha-tracker.max-clusters was unlimited by default which could allow a tenant with HA Dedupe enabled to overload the HA tracker with __cluster__ label values that could cause the HA Dedupe database to fail.

Also, as noted above, the administrator password for Helm chart deployments using the built-in MinIO is now set differently.

Bug fixes

  • PR 2447: Fix incorrect mapping of http status codes 429 to 500 when the request queue is full in the query-frontend. This corrects behavior in the query-frontend where a retryable 429 "Too Many Outstanding Requests" error from a querier was incorrectly returned as an unretryable 500 system error.
  • PR 2505: The Memberlist key-value (KV) store now tries to "fast-join" the cluster to avoid serving an empty KV store. This fix addresses the confusing "empty ring" error response and the error log message "ring doesn't exist in KV store yet" emitted by services when there are other members present in the ring when a service starts. Those using other key-value store options (e.g., consul, etcd) are not impacted by this bug.
  • PR 2289: The "List Prometheus rules" API endpoint of the Mimir Ruler component is no longer blocked while rules are being synced. This means users can now list rules while syncing larger rule sets.

Changelog

2.3.0

Grafana Mimir

  • [CHANGE] Ingester: Added user label to ingester metric cortex_ingester_tsdb_out_of_order_samples_appended_total. On multitenant clusters this helps us find the rate of appended out-of-order samples for a specific tenant. #2493
  • [CHANGE] Compactor: delete source and output blocks from local disk on compaction failed, to reduce likelihood that subsequent compactions fail because of no space left on disk. #2261
  • [CHANGE] Ruler: Remove unused CLI flags -ruler.search-pending-for and -ruler.flush-period (and their respective YAML config options). #2288
  • [CHANGE] Successful gRPC requests are no longer logged (only affects internal API calls). #2309
  • [CHANGE] Add new -*.consul.cas-retry-delay flags. They have a default value of 1s, while previously there was no delay between retries. #2309
  • [CHANGE] Store-gateway: Remove the experimental ability to run requests in a dedicated OS thread pool and associated CLI flag -store-gateway.thread-pool-size. #2423
  • [CHANGE] Memberlist: disabled TCP-based ping fallback, because Mimir already uses a custom transport based on TCP. #2456
  • [CHANGE] Change default value for -distributor.ha-tracker.max-clusters to 100 to provide a DoS protection. #2465
  • [CHANGE] Experimental block upload API exposed by compactor has changed: Previous /api/v1/upload/block/{block} endpoint for starting block upload is now /api/v1/upload/block/{block}/start, and previous endpoint /api/v1/upload/block/{block}?uploadComplete=true for finishing block upload is now /api/v1/upload/block/{block}/finish. New API endpoint has been added: /api/v1/upload/block/{block}/check. #2486 #2548
  • [CHANGE] Compactor: changed -compactor.max-compaction-time default from 0s (disabled) to 1h. When compacting blocks for a tenant, the compactor will move to compact blocks of another tenant or re-plan blocks to compact at least every 1h. #2514
  • [CHANGE] Distributor: removed previously deprecated extend_writes (see #1856) YAML key and -distributor.extend-writes CLI flag from the distributor config. #2551
  • [CHANGE] Ingester: removed previously deprecated active_series_custom_trackers (see #1188) YAML key from the ingester config. #2552
  • [CHANGE] The tenant ID __mimir_cluster is reserved by Mimir and not allowed to store metrics. #2643
  • [CHANGE] Purger: removed the purger component and moved its API endpoints /purger/delete_tenant and /purger/delete_tenant_status to the compactor at /compactor/delete_tenant and /compactor/delete_tenant_status. The new endpoints on the compactor are stable. #2644
  • [CHANGE] Memberlist: Change the leave timeout duration (-memberlist.leave-timeout duration) from 5s to 20s and connection timeout (-memberlist.packet-dial-timeout) from 5s to 2s. This makes leave timeout 10x the connection timeout, so that we can communicate the leave to at least 1 node, if the first 9 we try to contact times out. #2669
  • [CHANGE] Alertmanager: return status code 412 Precondition Failed and log info message when alertmanager isn't configured for a tenant. #2635
  • [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request. #2710, #2725
  • [CHANGE] Limits: change the default value of max_global_series_per_metric limit to 0 (disabled). Setting this limit by default does not provide much benefit because series are sharded by all labels. #2714
  • [CHANGE] Ingester: experimental -blocks-storage.tsdb.new-chunk-disk-mapper has been removed, new chunk disk mapper is now always used, and is no longer marked experimental. Default value of -blocks-storage.tsdb.head-chunks-write-queue-size has changed to 1000000, this enables async chunk queue by default, which leads to improved latency on the write path when new chunks are created in ingesters. #2762
  • [CHANGE] Ingester: removed deprecated -blocks-storage.tsdb.isolation-enabled option. TSDB-level isolation is now always disabled in Mimir. #2782
  • [CHANGE] Compactor: -compactor.partial-block-deletion-delay must either be set to 0 (to disable partial blocks deletion) or a value higher than 4h. #2787
  • [CHANGE] Query-frontend: CLI flag -query-frontend.align-querier-with-step has been deprecated. Please use -query-frontend.align-queries-with-step instead. #2840
  • [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285
    • -compactor.partial-block-deletion-delay, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of 0, the default, disables this feature.
    • The metric cortex_compactor_blocks_marked_for_deletion_total has a new value for the reason label reason="partial", when a block deletion marker is triggered by the partial block deletion delay.
  • [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429
  • [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on /otlp/v1/metrics. #695 #2436 #2461
  • [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467
  • [FEATURE] Query-frontend: introduced experimental support to split instant queries by time. The instant query splitting can be enabled setting -query-frontend.split-instant-queries-by-interval. #2469 #2564 #2565 #2570 #2571 #2572 #2573 #2574 #2575 #2576 #2581 #2582 #2601 #2632 #2633 #2634 #2641 #2642 #2766
  • [FEATURE] Introduced an experimental anonymous usage statistics tracking (disabled by default), to help Mimir maintainers make better decisions to support the open source community. The tracking system anonymously collects non-sensitive, non-personally identifiable information about the running Mimir cluster, and is disabled by default. #2643 #2662 #2685 #2732 #2733 #2735
  • [FEATURE] Introduced an experimental deployment mode called read-write and running a fully featured Mimir cluster with three components: write, read and backend. The read-write deployment mode is a trade-off between the monolithic mode (only one component, no isolation) and the microservices mode (many components, high isolation). #2754 #2838
  • [ENHANCEMENT] Distributor: Decreased distributor tests execution time. #2562
  • [ENHANCEMENT] Alertmanager: Allow the HTTP proxy_url configuration option in the receiver's configuration. #2317
  • [ENHANCEMENT] ring: optimize shuffle-shard computation when lookback is used, and all instances have registered timestamp within the lookback window. In that case we can immediately return origial ring, because we would select all instances anyway. #2309
  • [ENHANCEMENT] Memberlist: added experimental memberlist cluster label support via -memberlist.cluster-label and -memberlist.cluster-label-verification-disabled CLI flags (and their respective YAML config options). #2354
  • [ENHANCEMENT] Object storage can now be configured for all components using the common YAML config option key (or -common.storage.* CLI flags). #2330 #2347
  • [ENHANCEMENT] Go: updated to go 1.18.4. #2400
  • [ENHANCEMENT] Store-gateway, listblocks: list of blocks now includes stats from meta.json file: number of series, samples and chunks. #2425
  • [ENHANCEMENT] Added more buckets to cortex_ingester_client_request_duration_seconds histogram metric, to correctly track requests taking longer than 1s (up until 16s). #2445
  • [ENHANCEMENT] Azure client: Improve memory usage for large object storage downloads. #2408
  • [ENHANCEMENT] Distributor: Add -distributor.instance-limits.max-inflight-push-requests-bytes. This limit protects the distributor against multiple large requests that together may cause an OOM, but are only a few, so do not trigger the max-inflight-push-requests limit. #2413
  • [ENHANCEMENT] Distributor: Drop exemplars in distributor for tenants where exemplars are disabled. #2504
  • [ENHANCEMENT] Runtime Config: Allow operator to specify multiple comma-separated yaml files in -runtime-config.file that will be merged in left to right order. #2583
  • [ENHANCEMENT] Query sharding: shard binary operations only if it doesn't lead to non-shardable vector selectors in one of the operands. #2696
  • [ENHANCEMENT] Add packaging for both debian based deb file and redhat based rpm file using FPM. #1803
  • [ENHANCEMENT] Distributor: Add cortex_distributor_query_ingester_chunks_deduped_total and cortex_distributor_query_ingester_chunks_total metrics for determining how effective ingester chunk deduplication at query time is. #2713
  • [ENHANCEMENT] Upgrade Docker base images to alpine:3.16.2. #2729
  • [ENHANCEMENT] Ruler: Add <prometheus-http-prefix>/api/v1/status/buildinfo endpoint. #2724
  • [ENHANCEMENT] Querier: Ensure all queries pulled from query-frontend or query-scheduler are immediately executed. The maximum workers concurrency in each querier is configured by -querier.max-concurrent. #2598
  • [ENHANCEMENT] Distributor: Add cortex_distributor_received_requests_total and cortex_distributor_requests_in_total metrics to provide visiblity into appropriate per-tenant request limits. #2770
  • [ENHANCEMENT] Distributor: Add single forwarding remote-write endpoint for a tenant (forwarding_endpoint), instead of using per-rule endpoints. This takes precendence over per-rule endpoints. #2801
  • [ENHANCEMENT] Added err-mimir-distributor-max-write-message-size to the errors catalog. #2470
  • [ENHANCEMENT] Add sanity check at startup to ensure the configured filesystem directories don't overlap for different components. #2828
  • [BUGFIX] TSDB: Fixed a bug on the experimental out-of-order implementation that led to wrong query results. #2701
  • [BUGFIX] Compactor: log the actual error on compaction failed. #2261
  • [BUGFIX] Alertmanager: restore state from storage even when running a single replica. #2293
  • [BUGFIX] Ruler: do not block "List Prometheus rules" API endpoint while syncing rules. #2289
  • [BUGFIX] Ruler: return proper *status.Status error when running in remote operational mode. #2417
  • [BUGFIX] Alertmanager: ensure the configured -alertmanager.web.external-url is either a path starting with /, or a full URL including the scheme and hostname. #2381 #2542
  • [BUGFIX] Memberlist: fix problem with loss of some packets, typically ring updates when instances were removed from the ring during shutdown. #2418
  • [BUGFIX] Ingester: fix misfiring MimirIngesterHasUnshippedBlocks and stale cortex_ingester_oldest_unshipped_block_timestamp_seconds when some block uploads fail. #2435
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 429 to 500 when request queue is full. #2447
  • [BUGFIX] Memberlist: Fix problem with ring being empty right after startup. Memberlist KV store now tries to "fast-join" the cluster to avoid serving empty KV store. #2505
  • [BUGFIX] Compactor: Fix bug when using -compactor.partial-block-deletion-delay: compactor didn't correctly check for modification time of all block files. #2559
  • [BUGFIX] Query-frontend: fix wrong query sharding results for queries with boolean result like 1 < bool 0. #2558
  • [BUGFIX] Fixed error messages related to per-instance limits incorrectly reporting they can be set on a per-tenant basis. #2610
  • [BUGFIX] Perform HA-deduplication before forwarding samples according to forwarding rules in the distributor. #2603 #2709
  • [BUGFIX] Fix reporting of tracing spans from PromQL engine. #2707
  • [BUGFIX] Apply relabel and drop_label rules before forwarding rules in the distributor. #2703
  • [BUGFIX] Distributor: Register cortex_discarded_requests_total metric, which previously was not registered and therefore not exported. #2712
  • [BUGFIX] Ruler: fix not restoring alerts' state at startup. #2648
  • [BUGFIX] Ingester: Fix disk filling up after restarting ingesters with out-of-order support disabled while it was enabled before. #2799
  • [BUGFIX] Memberlist: retry joining memberlist cluster on startup when no nodes are resolved. #2837
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 413 to 500 when request is too large. #2819
  • [BUGFIX] Alertmanager: revert upstream alertmananger to v0.24.0 to fix panic when unmarshalling email headers #2924 #2925
  • [BUGFIX] Fix sanity check done on configured filesystem directories when running Alertmanager in microservices mode. #2947

Mixin

  • [CHANGE] Dashboards: "Slow Queries" dashboard no longer works with versions older than Grafana 9.0. #2223
  • [CHANGE] Alerts: use RSS memory instead of working set memory in the MimirAllocatingTooMuchMemory alert for ingesters. #2480
  • [CHANGE] Dashboards: remove the "Cache - Latency (old)" panel from the "Mimir / Queries" dashboard. #2796
  • [FEATURE] Dashboards: added support to experimental read-write deployment mode. #2780
  • [ENHANCEMENT] Dashboards: added missed rule evaluations to the "Evaluations per second" panel in the "Mimir / Ruler" dashboard. #2314
  • [ENHANCEMENT] Dashboards: add k8s resource requests to CPU and memory panels. #2346
  • [ENHANCEMENT] Dashboards: add RSS memory utilization panel for ingesters, store-gateways and compactors. #2479
  • [ENHANCEMENT] Dashboards: allow to configure graph tooltip. #2647
  • [ENHANCEMENT] Alerts: MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck alerts are more reliable now as they consider all the intermediate samples in the minute prior to the evaluation. #2630
  • [ENHANCEMENT] Alerts: added RolloutOperatorNotReconciling alert, firing if the optional rollout-operator is not successfully reconciling. #2700
  • [ENHANCEMENT] Dashboards: added support to query-tee in front of ruler-query-frontend in the "Remote ruler reads" dashboard. #2761
  • [ENHANCEMENT] Dashboards: Introduce support for baremetal deployment, setting deployment_type: 'baremetal' in the mixin _config. #2657
  • [ENHANCEMENT] Dashboards: use timeseries panel to show exemplars. #2800
  • [BUGFIX] Dashboards: fixed unit of latency panels in the "Mimir / Ruler" dashboard. #2312
  • [BUGFIX] Dashboards: fixed "Intervals per query" panel in the "Mimir / Queries" dashboard. #2308
  • [BUGFIX] Dashboards: Make "Slow Queries" dashboard works with Grafana 9.0. #2223
  • [BUGFIX] Dashboards: add missing API routes to Ruler dashboard. #2412
  • [BUGFIX] Dashboards: stop setting 'interval' in dashboards; it should be set on your datasource. #2802

Jsonnet

  • [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431
  • [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517
    • The following configuration options have been removed:
      • distributor_allow_multiple_replicas_on_same_node
      • query_frontend_allow_multiple_replicas_on_same_node
      • querier_allow_multiple_replicas_on_same_node
      • ruler_allow_multiple_replicas_on_same_node
    • The following configuration options have been added:
      • distributor_topology_spread_max_skew
      • query_frontend_topology_spread_max_skew
      • querier_topology_spread_max_skew
      • ruler_topology_spread_max_skew
  • [CHANGE] Change max_global_series_per_metric to 0 in all plans, and as a default value. #2669
  • [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options memberlist_cluster_label and memberlist_cluster_label_verification_disabled. #2349
  • [FEATURE] Added ruler-querier autoscaling support. It requires KEDA installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545
    • autoscaling_ruler_querier_enabled: true to enable autoscaling.
    • autoscaling_ruler_querier_min_replicas: minimum number of ruler-querier replicas.
    • autoscaling_ruler_querier_max_replicas: maximum number of ruler-querier replicas.
    • autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
  • [ENHANCEMENT] Memberlist now uses DNS service-discovery by default. #2549
  • [ENHANCEMENT] Upgrade memcached image tag to memcached:1.6.16-alpine. #2740
  • [ENHANCEMENT] Added $._config.configmaps and $._config.runtime_config_files to make it easy to add new configmaps or runtime config file to all components. #2748

Mimirtool

  • [ENHANCEMENT] Added mimirtool backfill command to upload Prometheus blocks using API available in the compactor. #1822
  • [ENHANCEMENT] mimirtool bucket-validation: Verify existing objects can be overwritten by subsequent uploads. #2491
  • [ENHANCEMENT] mimirtool config convert: Now supports migrating to the current version of Mimir. #2629
  • [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors by using custom parsing. #2386
  • [BUGFIX] Version checking no longer prompts for updating when already on latest version. #2723

Mimir Continuous Test

  • [ENHANCEMENT] Added basic authentication and bearer token support for when Mimir is behind a gateway authenticating the calls. #2717

Query-tee

  • [CHANGE] Renamed CLI flag -server.service-port to -server.http-service-port. #2683
  • [CHANGE] Renamed metric cortex_querytee_request_duration_seconds to cortex_querytee_backend_request_duration_seconds. Metric cortex_querytee_request_duration_seconds is now reported without label backend. #2683
  • [ENHANCEMENT] Added HTTP over gRPC support to query-tee to allow testing gRPC requests to Mimir instances. #2683

Documentation

  • [ENHANCEMENT] Referenced mimirtool commands in the HTTP API documentation. #2516
  • [ENHANCEMENT] Improved DNS service discovery documentation. #2513

Tools

  • [ENHANCEMENT] markblocks now processes multiple blocks concurrently. #2677

New Contributors

Full Changelog: mimir-2.2.0...mimir-2.3.0