Skip to content

mimir-2.3.0-rc0

Pre-release
Pre-release
Compare
Choose a tag to compare
@treid314 treid314 released this 25 Aug 17:00
· 3584 commits to main since this release
3292627

This release contains 333 PRs from 39 authors. Thank you!

Grafana Mimir version 2.3 release notes

Grafana Labs is excited to announce version 2.3 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

The highlights that follow include the top features, enhancements, and bugfixes in this release. If you are upgrading from Grafana Mimir 2.2, there is upgrade-related information as well.
For the complete list of changes, see the Changelog.

Features and enhancements

  • Ingest metrics in OpenTelemetry format:
    This release of Grafana Mimir introduces experimental support for ingesting metrics from the OpenTelemetry Collector's otlphttp exporter. This adds a second ingestion option for users of the OTel Collector; Mimir was already compatible with the prometheusremotewrite exporter. For more information, please see Configure OTel Collector.

  • Increased instant query performance:
    Grafana Mimir now supports splitting instant queries by time. This allows it to better parallelize execution of instant queries and therefore return results faster. At present, splitting is only supported for a subset of instant queries, which means not all instant queries will see a speedup. This feature is being released as experimental and is disabled by default. It can be enabled by setting -query-frontend.split-instant-queries-by-interval.

  • Tenant federation for metadata queries:
    Users with tenant federation enabled could previously issue instant queries, range queries, and exemplar queries to multiple tenants at once and receive a single aggregated result. With Grafana Mimir 2.3, we've added tenant federation support to the /api/v1/metadata endpoint as well.

  • Simpler object storage configuration:
    Users can now configure block, alertmanager, and ruler storage all at once with the common YAML config option key (or -common.storage.* CLI flags). By centralizing your object storage configuration in one place, this enhancement makes configuration faster and less error prone. Users can still individually configure storage for each of these components if they desire. For more information, see the Common Configurations.

  • DEB and RPM packages for Mimir:
    Starting with version 2.3, we're publishing deb and rpm files for Grafana Mimir, which will make installing and running it on Debian or RedHat-based linux systems much easier. Thank you to community contributor wilfriedroset for your work to implement this!

  • Import historic data to Grafana Mimir:
    Users can now backfill time series data from their existing Prometheus or Cortex installation into Mimir using mimirtool, making it possible to migrate to Grafana Mimir without losing your existing metrics data. This support is still considered experimental and does not work for data stored in Thanos yet. To learn more about this feature, please see mimirtool backfill and Configure TSDB block upload

  • New Helm chart minor release: The Mimir Helm chart is the best way to install Mimir on Kubernetes. As part of the Mimir 2.3 release, we’re also releasing version 3.1 of the Mimir Helm chart. Notable enhancements follow. For the full list of changes, see the Helm chart changelog.

    • We've upgraded the MinIO subchart dependency from a deprecated chart to the supported one. This creates a breaking change in how the administrator password is set. However, as the built-in MinIO is not a recommended object store for production use cases, this change did not warrant a new major version of the Mimir Helm chart.
    • The backfill API endpoints for importing historic time series data are now exposed on the Nginx gateway.
    • Nginx now sets the value of the X-Scope-OrgID header equal to the value of Mimir's no_auth_tenant parameter by default. The previous release had set the value of X-Scope-OrgID to anonymous by default which complicated the process of migrating to Mimir.
    • Memberlist now uses DNS service-discovery by default, which should decrease startup time for large Mimir clusters.

Upgrade considerations

In Grafana Mimir 2.3 we have removed the following previously deprecated configuration options:

  • The extend_writes parameter in the distributor YAML configuration and -distributor.extend-writes CLI flag have been removed.
  • The active_series_custom_trackers parameter has been removed from the YAML configuration. It had already been moved to the runtime configuration. See #1188 for details.

With Grafana Mimir 2.3 we have also updated the default value for -distributor.ha-tracker.max-clusters to 100 to provide Denial-of-Service protection. Previously -distributor.ha-tracker.max-clusters was unlimited by default which could allow a tenant with HA Dedupe enabled to overload the HA tracker with __cluster__ label values that could cause the HA Dedupe database to fail.

Bug fixes

  • PR 2447: Fix incorrect mapping of http status codes 429 to 500 when the request queue is full in the query-frontend. This corrects behavior in the query-frontend where a 429 "Too Many Outstanding Requests" error (a retriable error) from a querier was incorrectly returned as a 500 system error (an unretriable error).
  • PR 2505: The Memberlist key-value (KV) store now tries to "fast-join" the cluster to avoid serving an empty KV store. This fix addresses the confusing "empty ring" error response and the error log message "ring doesn't exist in KV store yet" emitted by services when there are other members present in the ring when a service starts. Those using other key-value store options (e.g., consul, etcd) are not impacted by this bug.
  • PR 2289: The "List Prometheus rules" API endpoint of the Mimir Ruler component is no longer blocked while rules are being synced. This means users can now list rules while syncing larger rule sets.

Changelog since 2.2

2.3.0-rc.0

Grafana Mimir

  • [CHANGE] Ingester: Added user label to ingester metric cortex_ingester_tsdb_out_of_order_samples_appended_total. On multitenant clusters this helps us find the rate of appended out-of-order samples for a specific tenant. #2493
  • [CHANGE] Compactor: delete source and output blocks from local disk on compaction failed, to reduce likelihood that subsequent compactions fail because of no space left on disk. #2261
  • [CHANGE] Ruler: Remove unused CLI flags -ruler.search-pending-for and -ruler.flush-period (and their respective YAML config options). #2288
  • [CHANGE] Successful gRPC requests are no longer logged (only affects internal API calls). #2309
  • [CHANGE] Add new -*.consul.cas-retry-delay flags. They have a default value of 1s, while previously there was no delay between retries. #2309
  • [CHANGE] Store-gateway: Remove the experimental ability to run requests in a dedicated OS thread pool and associated CLI flag -store-gateway.thread-pool-size. #2423
  • [CHANGE] Memberlist: disabled TCP-based ping fallback, because Mimir already uses a custom transport based on TCP. #2456
  • [CHANGE] Change default value for -distributor.ha-tracker.max-clusters to 100 to provide a DoS protection. #2465
  • [CHANGE] Experimental block upload API exposed by compactor has changed: Previous /api/v1/upload/block/{block} endpoint for starting block upload is now /api/v1/upload/block/{block}/start, and previous endpoint /api/v1/upload/block/{block}?uploadComplete=true for finishing block upload is now /api/v1/upload/block/{block}/finish. New API endpoint has been added: /api/v1/upload/block/{block}/check. #2486 #2548
  • [CHANGE] Compactor: changed -compactor.max-compaction-time default from 0s (disabled) to 1h. When compacting blocks for a tenant, the compactor will move to compact blocks of another tenant or re-plan blocks to compact at least every 1h. #2514
  • [CHANGE] Distributor: removed previously deprecated extend_writes (see #1856) YAML key and -distributor.extend-writes CLI flag from the distributor config. #2551
  • [CHANGE] Ingester: removed previously deprecated active_series_custom_trackers (see #1188) YAML key from the ingester config. #2552
  • [CHANGE] The tenant ID __mimir_cluster is reserved by Mimir and not allowed to store metrics. #2643
  • [CHANGE] Purger: removed the purger component and moved its API endpoints /purger/delete_tenant and /purger/delete_tenant_status to the compactor at /compactor/delete_tenant and /compactor/delete_tenant_status. The new endpoints on the compactor are stable. #2644
  • [CHANGE] Memberlist: Change the leave timeout duration (-memberlist.leave-timeout duration) from 5s to 20s and connection timeout (-memberlist.packet-dial-timeout) from 5s to 2s. This makes leave timeout 10x the connection timeout, so that we can communicate the leave to at least 1 node, if the first 9 we try to contact times out. #2669
  • [CHANGE] Alertmanager: return status code 412 Precondition Failed and log info message when alertmanager isn't configured for a tenant. #2635
  • [CHANGE] Distributor: if forwarding rules are used to forward samples, exemplars are now removed from the request. #2710
  • [CHANGE] Limits: change the default value of max_global_series_per_metric limit to 0 (disabled). Setting this limit by default does not provide much benefit because series are sharded by all labels. #2714
  • [FEATURE] Compactor: Adds the ability to delete partial blocks after a configurable delay. This option can be configured per tenant. #2285
    • -compactor.partial-block-deletion-delay, as a duration string, allows you to set the delay since a partial block has been modified before marking it for deletion. A value of 0, the default, disables this feature.
    • The metric cortex_compactor_blocks_marked_for_deletion_total has a new value for the reason label reason="partial", when a block deletion marker is triggered by the partial block deletion delay.
  • [FEATURE] Querier: enabled support for queries with negative offsets, which are not cached in the query results cache. #2429
  • [FEATURE] EXPERIMENTAL: OpenTelemetry Metrics ingestion path on /otlp/v1/metrics. #695 #2436 #2461
  • [FEATURE] Querier: Added support for tenant federation to metric metadata endpoint. #2467
  • [FEATURE] Query-frontend: introduced experimental support to split instant queries by time. The instant query splitting can be enabled setting -query-frontend.split-instant-queries-by-interval. #2469 #2564 #2565 #2570 #2571 #2572 #2573 #2574 #2575 #2576 #2581 #2582 #2601 #2632 #2633 #2634 #2641 #2642 #2766
  • [ENHANCEMENT] Distributor: Decreased distributor tests execution time. #2562
  • [ENHANCEMENT] Alertmanager: Allow the HTTP proxy_url configuration option in the receiver's configuration. #2317
  • [ENHANCEMENT] ring: optimize shuffle-shard computation when lookback is used, and all instances have registered timestamp within the lookback window. In that case we can immediately return origial ring, because we would select all instances anyway. #2309
  • [ENHANCEMENT] Memberlist: added experimental memberlist cluster label support via -memberlist.cluster-label and -memberlist.cluster-label-verification-disabled CLI flags (and their respective YAML config options). #2354
  • [ENHANCEMENT] Object storage can now be configured for all components using the common YAML config option key (or -common.storage.* CLI flags). #2330 #2347
  • [ENHANCEMENT] Go: updated to go 1.18.4. #2400
  • [ENHANCEMENT] Store-gateway, listblocks: list of blocks now includes stats from meta.json file: number of series, samples and chunks. #2425
  • [ENHANCEMENT] Added more buckets to cortex_ingester_client_request_duration_seconds histogram metric, to correctly track requests taking longer than 1s (up until 16s). #2445
  • [ENHANCEMENT] Azure client: Improve memory usage for large object storage downloads. #2408
  • [ENHANCEMENT] Distributor: Add -distributor.instance-limits.max-inflight-push-requests-bytes. This limit protects the distributor against multiple large requests that together may cause an OOM, but are only a few, so do not trigger the max-inflight-push-requests limit. #2413
  • [ENHANCEMENT] Distributor: Drop exemplars in distributor for tenants where exemplars are disabled. #2504
  • [ENHANCEMENT] Runtime Config: Allow operator to specify multiple comma-separated yaml files in -runtime-config.file that will be merged in left to right order. #2583
  • [ENHANCEMENT] Query sharding: shard binary operations only if it doesn't lead to non-shardable vector selectors in one of the operands. #2696
  • [ENHANCEMENT] Add packaging for both debian based deb file and redhat based rpm file using FPM. #1803
  • [BUGFIX] TSDB: Fixed a bug on the experimental out-of-order implementation that led to wrong query results. #2701
  • [BUGFIX] Compactor: log the actual error on compaction failed. #2261
  • [BUGFIX] Alertmanager: restore state from storage even when running a single replica. #2293
  • [BUGFIX] Ruler: do not block "List Prometheus rules" API endpoint while syncing rules. #2289
  • [BUGFIX] Ruler: return proper *status.Status error when running in remote operational mode. #2417
  • [BUGFIX] Alertmanager: ensure the configured -alertmanager.web.external-url is either a path starting with /, or a full URL including the scheme and hostname. #2381 #2542
  • [BUGFIX] Memberlist: fix problem with loss of some packets, typically ring updates when instances were removed from the ring during shutdown. #2418
  • [BUGFIX] Ingester: fix misfiring MimirIngesterHasUnshippedBlocks and stale cortex_ingester_oldest_unshipped_block_timestamp_seconds when some block uploads fail. #2435
  • [BUGFIX] Query-frontend: fix incorrect mapping of http status codes 429 to 500 when request queue is full. #2447
  • [BUGFIX] Memberlist: Fix problem with ring being empty right after startup. Memberlist KV store now tries to "fast-join" the cluster to avoid serving empty KV store. #2505
  • [BUGFIX] Compactor: Fix bug when using -compactor.partial-block-deletion-delay: compactor didn't correctly check for modification time of all block files. #2559
  • [BUGFIX] Query-frontend: fix wrong query sharding results for queries with boolean result like 1 < bool 0. #2558
  • [BUGFIX] Fixed error messages related to per-instance limits incorrectly reporting they can be set on a per-tenant basis. #2610
  • [BUGFIX] Perform HA-deduplication before forwarding samples according to forwarding rules in the distributor. #2603 #2709
  • [BUGFIX] Fix reporting of tracing spans from PromQL engine. #2707
  • [BUGFIX] Apply relabel and drop_label rules before forwarding rules in the distributor. #2703
  • [BUGFIX] Distributor: Register cortex_discarded_requests_total metric, which previously was not registered and therefore not exported. #2712

Mixin

  • [CHANGE] Dashboards: "Slow Queries" dashboard no longer works with versions older than Grafana 9.0. #2223
  • [CHANGE] Alerts: use RSS memory instead of working set memory in the MimirAllocatingTooMuchMemory alert for ingesters. #2480
  • [ENHANCEMENT] Dashboards: added missed rule evaluations to the "Evaluations per second" panel in the "Mimir / Ruler" dashboard. #2314
  • [ENHANCEMENT] Dashboards: add k8s resource requests to CPU and memory panels. #2346
  • [ENHANCEMENT] Dashboards: add RSS memory utilization panel for ingesters, store-gateways and compactors. #2479
  • [ENHANCEMENT] Dashboards: allow to configure graph tooltip. #2647
  • [ENHANCEMENT] Alerts: MimirFrontendQueriesStuck and MimirSchedulerQueriesStuck alerts are more reliable now as they consider all the intermediate samples in the minute prior to the evaluation. #2630
  • [ENHANCEMENT] Alerts: added RolloutOperatorNotReconciling alert, firing if the optional rollout-operator is not successfully reconciling. #2700
  • [BUGFIX] Dashboards: fixed unit of latency panels in the "Mimir / Ruler" dashboard. #2312
  • [BUGFIX] Dashboards: fixed "Intervals per query" panel in the "Mimir / Queries" dashboard. #2308
  • [BUGFIX] Dashboards: Make "Slow Queries" dashboard works with Grafana 9.0. #2223
  • [BUGFIX] Dashboards: add missing API routes to Ruler dashboard. #2412

Jsonnet

  • [CHANGE] query-scheduler is enabled by default. We advise to deploy the query-scheduler to improve the scalability of the query-frontend. #2431
  • [CHANGE] Replaced anti-affinity rules with pod topology spread constraints for distributor, query-frontend, querier and ruler. #2517
    • The following configuration options have been removed:
      • distributor_allow_multiple_replicas_on_same_node
      • query_frontend_allow_multiple_replicas_on_same_node
      • querier_allow_multiple_replicas_on_same_node
      • ruler_allow_multiple_replicas_on_same_node
    • The following configuration options have been added:
      • distributor_topology_spread_max_skew
      • query_frontend_topology_spread_max_skew
      • querier_topology_spread_max_skew
      • ruler_topology_spread_max_skew
  • [CHANGE] Change max_global_series_per_metric to 0 in all plans, and as a default value. #2669
  • [FEATURE] Memberlist: added support for experimental memberlist cluster label, through the jsonnet configuration options memberlist_cluster_label and memberlist_cluster_label_verification_disabled. #2349
  • [FEATURE] Added ruler-querier autoscaling support. It requires KEDA installed in the Kubernetes cluster. Ruler-querier autoscaler can be enabled and configure through the following options in the jsonnet config: #2545
    • autoscaling_ruler_querier_enabled: true to enable autoscaling.
    • autoscaling_ruler_querier_min_replicas: minimum number of ruler-querier replicas.
    • autoscaling_ruler_querier_max_replicas: maximum number of ruler-querier replicas.
    • autoscaling_prometheus_url: Prometheus base URL from which to scrape Mimir metrics (e.g. http://prometheus.default:9090/prometheus).
  • [ENHANCEMENT] Memberlist now uses DNS service-discovery by default. #2549

Mimirtool

  • [ENHANCEMENT] Added mimirtool backfill command to upload Prometheus blocks using API available in the compactor. #1822
  • [ENHANCEMENT] mimirtool bucket-validation: Verify existing objects can be overwritten by subsequent uploads. #2491
  • [ENHANCEMENT] mimirtool config convert: Now supports migrating to the current version of Mimir. #2629
  • [BUGFIX] mimirtool analyze: Fix dashboard JSON unmarshalling errors by using custom parsing. #2386

Mimir Continuous Test

Documentation

  • [ENHANCEMENT] Referenced mimirtool commands in the HTTP API documentation. #2516
  • [ENHANCEMENT] Improved DNS service discovery documentation. #2513

Tools

  • [ENHANCEMENT] markblocks now processes multiple blocks concurrently. #2677

New Contributors

Full Changelog: mimir-2.2.0...mimir-2.3.0-rc0