Skip to content

@bwplotka bwplotka released this Jun 6, 2019 · 19 commits to master since this release

TL;DR: Store LRU cache is no longer leaking, Upgraded Thanos UI to Prometheus 2.9, Fixed auto-downsampling, Moved to Go 1.12.5 and more.

This version moved tarballs to Golang 1.12.5 from 1.11 as well, so same warning applies if you use container_memory_usage_bytes from cadvisor. Use container_memory_working_set_bytes instead.

breaking As announced couple of times this release also removes gossip with all configuration flags (--cluster.*).

Fixed

  • #1142 fixed major leak on store LRU cache for index items (postings and series).
  • #1163 sidecar is no longer blocking for custom Prometheus versions/builds. It only checks if flags return non 404, then it performs optional checks.
  • #1146 store/bucket: make getFor() work with interleaved resolutions.
  • #1157 querier correctly handles duplicated stores when some store changes external labels in place.

Added

  • #1094 Allow configuring the response header timeout for the S3 client.

Changed

  • #1118 breaking swift: Added support for cross-domain authentication by introducing userDomainID, userDomainName, projectDomainID, projectDomainName.
    The outdated terms tenantID, tenantName are deprecated and have been replaced by projectID, projectName.

  • #1066 Upgrade Thanos ui to Prometheus v2.9.1.

    Changes from the upstream:

    • query:
      • [ENHANCEMENT] Update moment.js and moment-timezone.js PR #4679
      • [ENHANCEMENT] Support to query elements by a specific time PR #4764
      • [ENHANCEMENT] Update to Bootstrap 4.1.3 PR #5192
      • [BUGFIX] Limit number of merics in prometheus UI PR #5139
      • [BUGFIX] Web interface Quality of Life improvements PR #5201
    • rule:
      • [ENHANCEMENT] Improve rule views by wrapping lines PR #4702
      • [ENHANCEMENT] Show rule evaluation errors on rules page PR #4457
  • #1156 Moved CI and docker multistage to Golang 1.12.5 for latest mem alloc improvements.

  • #1103 Updated go-cos deps. (COS bucket client).

  • #1149 Updated google Golang API deps (GCS bucket client).

  • #1190 Updated minio deps (S3 bucket client). This fixes minio retries.

  • #1133 Use prometheus v2.9.2, common v0.4.0 & tsdb v0.8.0.

    Changes from the upstreams:

    • store gateway:
      • [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
    • store gateway & compactor:
      • [BUGFIX] Fix fd and vm_area leak on error path in chunks.NewDirReader.
      • [BUGFIX] Fix fd and vm_area leak on error path in index.NewFileReader.
    • query:
      • [BUGFIX] Make sure subquery range is taken into account for selection #5467
      • [ENHANCEMENT] Check for cancellation on every step of a range evaluation. #5131
      • [BUGFIX] Exponentation operator to drop metric name in result of operation. #5329
      • [BUGFIX] Fix output sample values for scalar-to-vector comparison operations. #5454
    • rule:
      • [BUGFIX] Reload rules: copy state on both name and labels. #5368

Deprecated

  • #1008 breaking Removed Gossip implementation. All --cluster.* flags removed and Thanos will error out if any is provided.

See full CHANGELOG here

Assets 7

@bwplotka bwplotka released this May 31, 2019 · 19 commits to master since this release

v0.5.0-rc.0
Assets 7

@bwplotka bwplotka released this May 4, 2019 · 74 commits to master since this release

⚠️ IMPORTANT ⚠️ This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.

Major improvements:

  • This release also disables gossip mode by default for all components.
    See this for more details.
  • Store Gateway startup process is massively improved in both efficiency and memory consumption
  • Remote receiver component was added.
  • StoreUI works now beautifully 🌷
  • Timeout improvements for Querier
  • Control of concurrency and sample limits on Store Gateway gRPC API
  • Graceful handling and deletion of partial uploads made by Compactor.

Added

  • thanos.io website & automation 🎉
  • #1053 compactor: Compactor & store gateway now handles incomplete uploads gracefully. Added hard limit on how long block upload can take (30m).
  • #811 Remote write receiver component ❤️ ❤️ thanks to RedHat (@brancz) contribution.
  • #910 Query's stores UI page is now sorted by type and old DNS or File SD stores are removed after 5 minutes (configurable via the new --store.unhealthy-timeout=5m flag).
  • #905 Thanos support for Query API: /api/v1/labels. Notice that the API was added in Prometheus v2.6.
  • #798 Ability to limit the maximum number of concurrent request to Series() calls in Thanos Store and the maximum amount of samples we handle.
  • #1060 Allow specifying region attribute in S3 storage configuration

⚠️ WARNING ⚠️ #798 adds a new default limit to Thanos Store: --store.grpc.series-max-concurrency. Most likely you will want to make it the same as --query.max-concurrent on Thanos Query.

New options:

New Store flags:

* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.

New Store metrics:

* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.

New Store tracing span:
* store_query_gate_ismyturn shows how long it took for a query to pass (or not) through the gate.

  • #1016 Added option for another DNS resolver (miekg/dns client).
    Note that this is required to have SRV resolution working on Golang 1.11+ with KubeDNS below v1.14

    New Querier and Ruler flag: -- store.sd-dns-resolver which allows to specify resolver to use. Either golang or miekgdns

  • #986 Allow to save some startup & sync time in store gateway as it is no longer needed to compute index-cache from block index on its own for larger blocks.
    The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already.
    In the same time, compactor precomputes the index cache file on every compaction.

    New Compactor flag: --index.generate-missing-cache-file was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it's only one-off step per bucket.

  • #887 Compact: Added new --block-sync-concurrency flag, which allows you to configure number of goroutines to use when syncing block metadata from object storage.

  • #928 Query: Added --store.response-timeout flag. If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.

  • #893 S3 storage backend has graduated to stable maturity level.

  • #936 Azure storage backend has graduated to stable maturity level.

  • #937 S3: added trace functionality. You can add trace.enable: true to enable the minio client's verbose logging.

  • #953 Compact: now has a hidden flag --debug.accept-malformed-index. Compaction index verification will ignore out of order label names.

  • #963 GCS: added possibility to inline ServiceAccount into GCS config.

  • #1010 Compact: added new flag --compact.concurrency. Number of goroutines to use when compacting groups.

  • #1028 Query: added --query.default-evaluation-interval, which sets default evaluation interval for sub queries.

  • #980 Ability to override Azure storage endpoint for other regions (China)

  • #1021 Query API series now supports POST method.

  • #939 Query API query_range now supports POST method.

Changed

  • #970 Deprecated partial_response_disabled proto field. Added partial_response_strategy instead. Both in gRPC and Query API.
    No PartialResponseStrategy field for RuleGroups by default means abort strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.

    Metrics:

    • Added thanos_rule_evaluation_with_warnings_total to Ruler.
    • DNS thanos_ruler_query_apis* are now thanos_ruler_query_apis_* for consistency.
    • DNS thanos_querier_store_apis* are now thanos_querier_store_apis__* for consistency.
    • Query Gate thanos_bucket_store_series* are now thanos_bucket_store_series_* for consistency.
    • Most of thanos ruler metris related to rule manager has strategy label.

    Ruler tracing spans:

    • /rule_instant_query HTTP[client] is now /rule_instant_query_part_resp_abort HTTP[client]" if request is for abort strategy.
  • #1009: Upgraded Prometheus (~v2.7.0-rc.0 to v2.8.1) and TSDB (v0.4.0 to v0.6.1) deps.

    Changes that affects Thanos:

    • query:
      • [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
      • [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
      • [BUGFIX] Fix panic when aggregator param is not a literal. #5290
    • ruler:
      • [ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
      • [BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
      • [BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our issue #1027
      • [BUGFIX] Fix sorting of rule groups. #5260
    • store: [ENHANCEMENT] Fast path for EmptyPostings cases in Merge, Intersect and Without.
    • tooling: [FEATURE] New dump command to tsdb tool to dump all samples.
    • compactor: [ENHANCEMENT] When closing the db any running compaction will be cancelled so it doesn't block.
    • [CHANGE] Renamed flag --sync-delay to --consistency-delay #1053

    For ruler essentially whole TSDB CHANGELOG applies beween v0.4.0-v0.6.1: https://github.com/prometheus/tsdb/blob/master/CHANGELOG.md

    Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370
    Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.

  • #868 Go has been updated to 1.12.

  • #1055 Gossip flags are now disabled by default and deprecated.

  • #964 repair: Repair process now sorts the series and labels within block.

  • #1073 Store: index cache for requests. It now calculates the size properly (includes slice header), has anti-deadlock safeguard and reports more metrics.

Fixed

  • #921 thanos_objstore_bucket_last_successful_upload_time now does not appear when no blocks have been uploaded so far.
  • #966 Bucket: verify no longer warns about overlapping blocks, that overlap 0s
  • #848 Compact: now correctly works with time series with duplicate labels.
  • #894 Thanos Rule: UI now correctly shows evaluation time.
  • #865 Query: now properly parses DNS SRV Service Discovery.
  • #889 Store: added safeguard against merging posting groups segfault
  • #941 Sidecar: added better handling of intermediate restarts.
  • #933 Query: Fixed 30 seconds lag of adding new store to query.
  • #962 Sidecar: Make config reloader file writes atomic.
  • #982 Query: now advertises Min & Max Time accordingly to the nodes.
  • #1041 Ruler is now able to return long time range queries.
  • #904 Compact: Skip compaction for blocks with no samples.
  • #1070 Downsampling works back again. Deferred closer errors are now properly captured.

See the full changelog here

Assets 7

@bwplotka bwplotka released this Apr 26, 2019 · 84 commits to master since this release

v0.4.0-rc.1
Assets 7

@bwplotka bwplotka released this Apr 18, 2019 · 97 commits to master since this release

v0.4.0-rc.0
Assets 7

@improbable-ludwik improbable-ludwik released this Mar 4, 2019 · 171 commits to master since this release

v0.3.2 - 2019.03.04

Added

  • #851 New read API endpoint for api/v1/rules and api/v1/alerts.
  • #873 Store: fix set index cache LRU.

Fixed

  • #833 Store Gateway matcher regression for intersecting with empty posting.
  • #867 Fixed race condition in sidecare between reloader and shipper.
Assets 7

@domgreen domgreen released this Feb 18, 2019 · 181 commits to master since this release

Fixed

  • #829 Store Gateway crashing due to slice bounds out of range.
  • #834 fixed matcher regression for <> !=.
Assets 7

@domgreen domgreen released this Feb 8, 2019 · 192 commits to master since this release

🎉🎉🎉

Added

  • Support for gzip compressed configuration files before envvar substitution for reloader package.
  • bucket inspect command for better insights on blocks in object storage.
  • Support for Tencent COS object storage.
  • Partial Response disable option for StoreAPI and QueryAPI.
  • Partial Response disable button on Thanos UI
  • We have initial docs for goDoc documentation!
  • Flags for Querier and Ruler UIs: --web.route-prefix, --web.external-prefix, --web.prefix-header. Details here

Fixed

  • #649 - Fixed store label values api to add also external label values.
  • #396 - Fixed sidecar logic for proxying series that has more than 2^16 samples from Prometheus.
  • #732 - Fixed S3 authentication sequence. You can see new sequence enumerated here
  • #745 - Fixed race conditions and edge cases for Thanos Querier fanout logic.
  • #651 - Fixed index cache when asked buffer size is bigger than cache max size.

Changed

  • #529 Massive improvement for compactor. Downsampling memory consumption was reduced to only store labels and single chunks per each series.
  • Qurerier UI: Store page now shows the store APIs per component type.
  • Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot's of things has changed. See details here #704 Known changes that affects us:
    • prometheus/prometheus/discovery/file
      • [ENHANCEMENT] Discovery: Improve performance of previously slow updates of changes of targets. #4526
      • [BUGFIX] Wait for service discovery to stop before exiting #4508 ??
    • prometheus/prometheus/promql:
      • [ENHANCEMENT] Subqueries support. #4831
      • [BUGFIX] PromQL: Fix a goroutine leak in the lexer/parser. #4858
      • [BUGFIX] Change max/min over_time to handle NaNs properly. #438
      • [BUGFIX] Check label name for count_values PromQL function. #4585
      • [BUGFIX] Ensure that vectors and matrices do not contain identical label-sets. #4589
      • [ENHANCEMENT] Optimize PromQL aggregations #4248
      • [BUGFIX] Only add LookbackDelta to vector selectors #4399
      • [BUGFIX] Reduce floating point errors in stddev and related functions #4533
    • prometheus/prometheus/rules:
      • New metrics exposed! (prometheus evaluation!)
      • [ENHANCEMENT] Rules: Error out at load time for invalid templates, rather than at evaluation time. #4537
    • prometheus/tsdb/index: Index reader optimizations.
  • Thanos store gateway flag for sync concurrency (block-sync-concurrency with 20 default, so no change by default)
  • S3 provider:
    • Added put_user_metadata option to config.
    • Added insecure_skip_verify option to config.

Deprecated

  • Tests against Prometheus below v2.2.1. This does not mean lack of support for those. Only that we don't tests the compatibility anymore. See #758 for details.
Assets 7

@bwplotka bwplotka released this Dec 27, 2018 · 237 commits to master since this release

Xmas patch to release 2 critical fixes (Azure, DNS SD) and awesome, new store UI page.

This also includes first mitigation for #335

Changelog also available here.

Added

  • Relabel drop for Thanos Ruler to enable replica label drop and alert deduplication on AM side.
  • Query: Stores UI page available at /stores.

Fixed

  • Thanos Rule Alertmanager DNS SD bug.
  • DNS SD bug when having SRV results with different ports.
  • Move handling of HA alertmanagers to be the same as Prometheus.
  • Azure iteration implementation flaw.
Assets 7

@bwplotka bwplotka released this Dec 10, 2018 · 254 commits to master since this release

Next Thanos release adding support to new discovery method, gRPC mTLS and two new object store providers (Swift and Azure).

Note lots of necessary breaking changes in flags that relates to bucket configuration.

Changelog also available here.

Deprecated

  • breaking: Removed all bucket specific flags as we moved to config files:
    • --gcs-bucket=<bucket>
    • --s3.bucket=<bucket>
    • --s3.endpoint=<api-url>
    • --s3.access-key=<key>
    • --s3.insecure
    • --s3.signature-version2
    • --s3.encrypt-sse
    • --gcs-backup-bucket=<bucket>
    • --s3-backup-bucket=<bucket>
  • breaking: Removed support of those environment variables for bucket:
    • S3_BUCKET
    • S3_ENDPOINT
    • S3_ACCESS_KEY
    • S3_INSECURE
    • S3_SIGNATURE_VERSION2
  • breaking: Removed provider specific bucket metrics e.g thanos_objstore_gcs_bucket_operations_total in favor of of generic bucket operation metrics.

Changed

  • breaking: Added thanos_ prefix to memberlist (gossip) metrics. Make sure to update your dashboards and rules.
  • S3 provider:
    • Set "X-Amz-Acl": "bucket-owner-full-control" metadata for s3 upload operation.

Added

  • Support for heterogeneous secure gRPC on StoreAPI.
  • Handling of scalar result in rule node evaluating rules.
  • Flag --objstore.config-file to reference to the bucket configuration file in yaml format. Detailed information can be found in document storage.
  • File service discovery for StoreAPIs:
  • In thanos rule, static configuration of query nodes via --query
  • In thanos rule, file based discovery of query nodes using --query.file-sd-config.files
  • In thanos query, file based discovery of store nodes using --store.file-sd-config.files
  • /-/healthy endpoint to Querier.
  • DNS service discovery to static and file based configurations using the dns+ and dnssrv+ prefixes for the respective lookup. Details here
  • --cluster.disable flag to disable gossip functionality completely.
  • Hidden flag to configure max compaction level.
  • Azure Storage.
  • OpenStack Swift support.
  • Thanos Ruler thanos_rule_loaded_rules metric.
  • Option for JSON logger format.

Fixed

  • Issue whereby the Proxy Store could end up in a deadlock if there were more than 9 stores being queried and all returned an error.
  • Ruler tracing causing panics.
  • GatherIndexStats panics on duplicated chunks check.
  • Clean up of old compact blocks on compact restart.
  • Sidecar too frequent Prometheus reload.
  • thanos_compactor_retries_total metric not being registered.
Assets 7
You can’t perform that action at this time.