Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove max_samples_per_query limit. #397

Merged
merged 2 commits into from
Sep 16, 2021
Merged

Conversation

pstibrany
Copy link
Member

What this PR does: This PR removes max_samples_per_query limit, because it's not used in most setups, and is just confusing in the current form.

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany pstibrany requested a review from a team as a code owner September 16, 2021 15:37
@pstibrany pstibrany merged commit abb0806 into main Sep 16, 2021
@pstibrany pstibrany deleted the remove-max_samples_per_query branch September 16, 2021 15:42
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
Brings in the following changes:
- Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204
- allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238
- allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301
- Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341
- make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397
- use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401
- Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458
- Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469

https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/
pracucci added a commit to grafana/mimir that referenced this pull request Oct 19, 2021
* Increased CortexAllocatingTooMuchMemory alert threshold

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add alert for etcd memory limits close

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* the distributor now supports push via GRPC (grafana/cortex-jsonnet#266)

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* Fixed CortexQuerierHighRefetchRate alert

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed label matcher

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Sort legend descending in the CPU/memory panels

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add slow queries dashboard

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added tenant ID field to the table

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add recording rules to calculate Cortex scaling

- Update dashboard so it only shows under provisioned services and why
- Add sizing rules based on limits.
- Add some docs to the dashboard.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* Increased CortexRequestErrors alert severity

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed "Disk Writes" and "Disk Reads" panels

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Pre-compute aggregations to optimize scaling recording rules

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed 5m step from subquery

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add function to customize compactor statefulset

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Use the job name in compactor alerts too

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed CortexCompactorRunFailed threshold

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added Cortex Rollout progress dashboard

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix 'Unhealthy pods' in Cortex Rollout dashboard

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Simplify compactor alerts

We should simply alert on things not having run since X.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Use the right metric

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Apply suggestions from code review

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Fix CortexCompactorHasNotSuccessfullyRunCompaction to avoid false positives

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Introduce ingester instance limits to configuration, and add alerts. (grafana/cortex-jsonnet#296)

* Introduce ingester instance limits to configuration, and add alerts.

* CHANGELOG.md

* Address (internal) review feedback.

* Improve CortexRulerFailedRingCheck alert

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added example Loki query to CortexTenantHasPartialBlocks playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Default dashboards to Cortex blocks storage only

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add missing memberlist components to alerts

This adds the admin-api, compactor and store-gateway components to the
memberlist alert.

Signed-off-by: Christian Simon <simon@swine.de>

* mixin: Add gateway to valid job names (for GEM)

* Only show namespaces from selected cluster. "All" works thanks to using regex matcher. (grafana/cortex-jsonnet#311)

* Only show namespaces from selected cluster. "All" works thanks to using regex matcher.

* CHANGELOG.md

* Fixed CortexIngesterHasNotShippedBlocks alert false positive

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed mixin linter

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add placeholders to make the linter pass

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* cortex-mixin: Use kube_pod_container_resource_{requests,limits} metrics

This updates the recording rules to make them compatible with kube-state-metrics v2.0.0
which introduces some breaking changes in some metric names.

With kube-state-metrics v2.0.0:
- `kube_pod_container_resource_requests_cpu_cores` becomes `kube_pod_container_resource_requests{resource="cpu"}`
- `kube_pod_container_resource_requests_memory_bytes` becomes `kube_pod_container_resource_requests{resource="memory"}`

* cortex-mixin: Make the recording rules backwards compatible

* refactor: functions to reduce code duplication
- improve overrideability
- making more use of `per_instance_label` from _config
- added containerNetworkPanel functions for dashboards to use

* fix: lint

* refactor: config for job aggregation strings

- to make it easier to override, define "cluster_namespace_job"
  in $._config as `job_aggregation_prefix`.
- added some `job_aggregation_labels_*` as well

The resulting output does not change (unless config is overridden).

* lint

* Update cortex-mixin/dashboards/writes.libsonnet

simplify mapping by extending $._config

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* fix: syntax

* refactor: added a group_config

defines group-related strings based off of array-based
parameters in _config.

deprecated _config.alert_aggregation_labels with a std.trace warning,
while maintaining (temporary?) backward compatibility.

* refactor: added a group_config

defines group-related strings based off of array-based
parameters in _config.

deprecated _config.alert_aggregation_labels with a std.trace warning,
while maintaining (temporary?) backward compatibility.

* refactor: added a group_config

defines group-related strings based off of array-based
parameters in _config.

deprecated _config.alert_aggregation_labels with a std.trace warning,
while maintaining (temporary?) backward compatibility.

* Lower CortexIngesterRestarts severity

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* feature: add some text boxes and descriptions

Focussing on the reads and writes dashboards,
added some info panels and hover-over descriptions
for some of the panels.
Some common code used by the compactor also
received additional text content.

New functions:
- addRows
- addRowsIf
...to add a list of rows to a dashboard.

The `thanosMemcachedCache` function has had some of its query text
sprawled out for easier reading and comparison with similar dashboard
queries.

* fix: text replacements, repair addRows

* Changing copy to add 'latency' as well.

* Cut down on text from initial PR. Tucked existing text from the compactor dashboard under tooltips, rather than making them text boxes.

* Getting rid of a few space/comma errors.

* Update cortex-mixin/dashboards/compactor.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/compactor.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/compactor.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/compactor.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/compactor.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/compactor.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* fix: formatting - limit to 4 panels per row

* fmt

* fix: remove accidental line

* Update cortex-mixin/dashboards/dashboard-utils.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/reads.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/reads.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/writes.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/writes.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/writes.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/writes.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/writes.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Update cortex-mixin/dashboards/reads.libsonnet

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* fix: Requests per second

* fix: text

* Apply suggestions from code review as per @osg-grafana

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* fix: clarity

* Apply suggestions from code review as per @osg-grafana

Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>

* Add a simple playbook for ingester series limit alert.

Signed-off-by: Callum Styan <callumstyan@gmail.com>

* Add cortex-gw-internal to watched gateway metrics (grafana/cortex-jsonnet#328)

* Add cortex-gw-internal to watched gateway metrics

* Update CHANGELOG.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* fix: query formatting to aid in merge

* fix: query formatting to aid in merge

* fix: consistent labelling

* fix: ensure panel titles are consistent

- Most existing "per second" panel titles in `main` are written "/ sec",
  corrected recent commits to match.

* Improved CortexIngesterReachingSeriesLimit playbook and added CortexIngesterReachingTenantsLimit playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Better formatting for ingester_instance_limits+ example

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Clarify which alerts apply to chunks storage only

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improve compactor alerts and playbooks

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Addressed review comments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Update cortex-mixin/docs/playbooks.md

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fixed and improved runtime config alerts and playbooks

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* fix: resolve review feedback

* Update cortex-mixin/docs/playbooks.md

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>

* Update cortex-mixin/docs/playbooks.md

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>

* MarkCortexTableSyncFailure and CortexOldChunkInMemory alerts as chunks storage only

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed whitespace noise

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* refactor: resources dashboard comtainer functions
added:
- containerDiskWritesPanel
- containerDiskReadsPanel
- containerDiskSpaceUtilization

* revert: matching spacing format of main

* lint: white noise

* Add playbook for CortexRequestErrors and config option to exclude specific routes

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Change min-step to 15s to show better detail.

$__rate_interval will be floored at 4x this quantity, so 15s lets us see
faster transients than the previous value of 1m.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>

* Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Remove CortexQuerierCapacityFull alert

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added playbook for CortexProvisioningTooManyWrites

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added playbook for CortexAllocatingTooMuchMemory

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Address review feedback

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Replace ruler alerts, and add playbooks.

* Addressed review comments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix white space.

* Better alert messages.

* Improve CortexIngesterReachingSeriesLimit playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add playbook for CortexProvisioningTooManyActiveSeries

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improve messaging.

* Fixed formatting

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improved alert messages with Cortex cluster

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improved CortexRequestLatency playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added 'Per route p99 latency' to ruler configuration API

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Addressed review comments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Aded object storage metrics for Ruler and Alertmanager

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add playbook entry for CortexGossipMembersMismatch.

* Clarify data loss related to 'not healthy index found' issue

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Review comments.

* Improve CortexIngesterReachingSeriesLimit playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Increased CortexIngesterReachingSeriesLimit critical alert threshold from 80% to 85%

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Increase CortexIngesterReachingSeriesLimit warning `for` duration

As it turns out, during normal shuffle-sharding operation, the 70%
mark is often exceeded, but not by much. Rather than increasing the
threshold to 75%, this commit increases the `for` duration to 3h,
following the thought that we want this alert to fire if ingesters are
constantly above the threshold even after stale series are flushed
(which occurs every 2h, when the TSDB head is compacted). We flush
series with a timestamp between [-3h, -1h] after the last compaction,
so the worst case scenario is that it takes 3h to flush a stale
series.

Signed-off-by: beorn7 <beorn@grafana.com>

* Fix scaling dashboard to work on multi-zone ingesters

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Simplified cluster_namespace_deployment:actual_replicas:count recording rule

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added a comment to explain '.*?'

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix rollout dashboard to work with multi-zone deployments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed legends

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Extend Alertmanager dashboard with currently unused metrics.

Metrics for general operation:

- Added "Tenants" stat panel using:
  `cortex_alertmanager_tenants_discovered`

- Added "Tenant Configuration Sync" row using:
  `cortex_alertmanager_sync_configs_failed_total`
  `cortex_alertmanager_sync_configs_total`
  `cortex_alertmanager_ring_check_errors_total`

Metrics specific to sharding operation:

- Added "Sharding Initial State Sync" row using:
  `cortex_alertmanager_state_initial_sync_completed_total`
  `cortex_alertmanager_state_initial_sync_completed_total`
  `cortex_alertmanager_state_initial_sync_duration_seconds`

- Added "Sharding State Operations" row using:

  `cortex_alertmanager_state_fetch_replica_state_total`
  `cortex_alertmanager_state_fetch_replica_state_failed_total`
  `cortex_alertmanager_state_replication_total`
  `cortex_alertmanager_state_replication_failed_total`
  `cortex_alertmanager_partial_state_merges_total`
  `cortex_alertmanager_partial_state_merges_failed_total`
  `cortex_alertmanager_state_persist_total`
  `cortex_alertmanager_state_persist_failed_total`

* Review comments + fix latency panel.

* Review comments.

* Clarify the gsutil mv command for moving corrupted blocks

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Modify log message to fit example command

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Update grafana-builder from Mar 2019 to Feb 2021

Brings in the following changes:
- Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204
- allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238
- allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301
- Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341
- make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397
- use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401
- Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458
- Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469

https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/

* Match query-frontend/query-scheduler/querier custom deployments by default

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Create playbooks for sharded alertmanager

* Add new alerts for alertmanager sharding mode of operation.

* fix(rules): upstream recording rule switched to sum_irate

ref: kubernetes-monitoring/kubernetes-mixin#619

* Fix CortexIngesterReachingSeriesLimit playbook

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* feat: Allow configuration of ring members in gossip alerts

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* fix: Add store-gateway and compactor ring_members

Also re-order names for readability.

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* fix: Match all ingester workloads and avoid matching the cortex-gateway

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* feat: Optionally allow use of array or string to configure ring members

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* address review feedback

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* fix: Correct ingester and querier regexps

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* Fixes to initial state sync panels on alertmanager dashboard.

1) Change minimal interval to 1m for sync duration and fetch state panels.

    This is in order to show infrequent events at smaller time windows.

2) Change syncs/sec panel to reflect absolute value of metric not rate.

    The initial sync only occurs once per-tenant so the counter value is
    essentially 0 or 1. Due to how per-tenant metrics are aggregated, the
    external facing metric really acts more like a gauge reflecting the number
    of tenants which achieved each outcome.

    Also, stack this panel as it becomes easier to visually see when the initial
    syncs have completed for all tenants (e.g. during a rollout).

* Add rate back to Alertmanager dashboard initial syncs panel.

The metric in fact does act like a counter due to soft deletion of the
per-user registry when the user is unconfigured (e.g. moved to another
instance or configuration deleted).

* Make the overrides metric name configurable.

We (Grafana Labs) are about to put in a new system to control and export
data about limits and we'll need to use a different name. This shouldn't
affect our OSS users.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Improve Cortex / Queries dashboard

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add recording rules for speeding up Alertmanager dashboard.

With large numbers of tenants the queries for some panels on thos dashboard can
become quite slow as the metrics exposed are per-tenant.

* Fixes from testing.

* Move rules to their own group.

* Split `cortex_api` recording rule group into three groups.

This is a workaround for large clusters where this group can become slow to evaluate.

* Update gsutil installation playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Use `$._config.job_names.gateway` in resources dashboards.

This fixes panels where `cortex-gw` was hardcoded.

* Fine tune CortexIngesterReachingSeriesLimit alert

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add CortexRolloutStuck alert

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed playbook

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added CortexFailingToTalkToConsul alert

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed alert message

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Update alert to be generic to KV stores

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add README

* Add mimir-mixin CI checks

* Update build image

* Move to operations folder

* Add missing zip to build-image

* Run prettifier on playbooks.md

* Update build-image

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Co-authored-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Tom Wilkie <tom@grafana.com>
Co-authored-by: Tom Wilkie <tomwilkie@users.noreply.github.com>
Co-authored-by: Goutham Veeramachaneni <gouthamve+github@gmail.com>
Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
Co-authored-by: Alex Martin <alex@suitupalex.com>
Co-authored-by: Javier Palomo <javier.palomo@grafana.com>
Co-authored-by: Darren Janeczek <darren.janeczek@grafana.com>
Co-authored-by: Darren Janeczek <38694490+darrenjaneczek@users.noreply.github.com>
Co-authored-by: Jennifer Villa <jennifervilla@jennifers-mbp.lan>
Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com>
Co-authored-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Johanna Ratliff <johanna.ratliff@grafana.com>
Co-authored-by: Bryan Boreham <bjboreham@gmail.com>
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
Co-authored-by: beorn7 <beorn@grafana.com>
Co-authored-by: Tyler Reid <tyler.reid@grafana.com>
Co-authored-by: George Robinson <george.robinson@grafana.com>
Co-authored-by: Duologic <jeroen@simplistic.be>
Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: Jack Baldry <jack.baldry@grafana.com>
simonswine pushed a commit to grafana/mimir that referenced this pull request Dec 20, 2021
* Remove max_samples_per_query limit.

* Fixed CHANGELOG.md
pracucci added a commit to grafana/mimir that referenced this pull request Dec 20, 2021
* Added mega_user class

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fine-tune blocks storage config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Disable tests by default to fix README instructions

Ref grafana/cortex-jsonnet#95

* Run store-gateway without CPU limits

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Use v1 API for Deployment and StatefulSet resources

* Version bump to v1.1.0

* Actually include the ruler

* Update config option name

* Added ruler_enabled and alertmanager_enabled flags. (grafana/cortex-jsonnet#116)

* Added publish not ready addresses

Signed-off-by: Joe Elliott <number101010@gmail.com>

* Removed -experimental.tsdb.store-gateway-enabled flag

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added a discovery svc and pointed the querier service at itself

Signed-off-by: Joe Elliott <number101010@gmail.com>

* lint

Signed-off-by: Joe Elliott <number101010@gmail.com>

* Added PodDisruptionBudget for store-gateway

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Allow to configure the blocks replication factor

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Switch store-gateway StatefulSets to Parallel Pod Management

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Ruler should use metadata cache as well, if configured. (grafana/cortex-jsonnet#128)

Ruler instantiates querier internally, so it can use metadata cache.

* Allow to customize ingester disk size and class

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Version bump to 1.2.0

* refactor: use jaeger-agent-mixin

lib got moved: grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/291

used jb-0.4.0 which updates the jsonnetfile.json format

* Switch blocks storage ingesters to Parallel pod management policy and 4d retention

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed comment

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Chunks blocks migration (grafana/cortex-jsonnet#148)

* Allow configuring querier with second store engine.

* Introduced newIngesterStatefulSet and newIngesterPdb functions.

* Rename parameters to be more clear.

* refactor(cortex): use first class citizens

for:
* requiredDuringSchedulingIgnoredDuringExecutionType
* portsType

These are available from: https://github.com/jsonnet-libs/k8s-alpha

* Update blocks storage CLI flags

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Do not apply blocks storage config to query-frontend, table-manager and purger

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Cleaned up blocks storage config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Apply chunks-store config if primary or secondary store use chunks. (grafana/cortex-jsonnet#160)

* Enable table manager when using chunks storage as secondary storage engine for querier. (grafana/cortex-jsonnet#161)

* fix(ksonnet): backwards compatibility with ksonnet

* add overrides config to tsdb store-gateway

* Add jsonnet for ingester StatefulSet with WAL (grafana/cortex-jsonnet#72)

* Add jsonnet for ingester StatefulSet with WAL

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Add CHANGELOG entry

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix lint

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix review comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Change max query length to 32 days

To allow for comparision over months of 31d

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Fix ruler S3 config option (grafana/cortex-jsonnet#174)

* Removed -experimental.tsdb.store-gateway-enabled flag

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Use correct config variable for s3 ruler config

* restore dropped line

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* Add support for local ruler_client_type (grafana/cortex-jsonnet#175)

* Support Alertmanager HA

With this, we can now support increasing the number of replicas for a
Cortex AM thus enabling HA.

 Please note that Alerts themselves are not gossiped between
Alertmanagers. Each Ruler needs to send the alert to every Alertmanager
available thus the reason why a headless service gets created when the
number of replicas is more than 1.

* Setup the gossip port

* s/isGossiping/isHa

* Bump to 3 replicas by default

* Bump the cortex image, the latest stable is 1.3

* Fix typo in Alertmanager configuration

* Alertmanager configuration tweaks

- Introduces the `fallback_config` option to allow an Alertmanager to
  have a fallback config.
- Given the headless service a different name to allow seamless
  switching between 1 or multiple replicas. The cluster field in the
service metadata is immutable which made it impossible to create the new
service unless you delete the previous one.

* Remove different name for a headless service

Sadly, we can't have a different name for the headless service as the
statefulset is configured to match its name.

* Fix ruler s3 storage configuration

* Block storage support for s3

* Added Azure support to blocks storage

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed linter

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed the experimental prefix from blocks storage CLI flags

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Lower default ingestion limits and create a new overrides user

* Address review feedback

* Bump default series limit by 50%

* Add flusher job for blocks.

* Fixed Azure account name/key config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Rename changed flags for 1.4 release.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Make sure only a single ruler rolls out at a time

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Cut 1.4.0

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add overrides exporter

Overrides exporter part of grafana/cortex-tools and exposes runtime
overrides and related presets of Cortex as metrics.

Signed-off-by: Christian Simon <simon@swine.de>

* Refactor limits and overrides

Ensure we expose 'extra_small_user' and reference it setting the
"default" values.

This will raise the limits of the 'small_user' preset to the defaults
for `ingester.max-samples-per-query` and
`ingester.max-series-per-query`.

Signed-off-by: Christian Simon <simon@swine.de>

* Removed support for ingester.statefulset_replicas

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Switch compactor statefulset to Parallel pod management policy

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Cut 1.5.0 release

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add ruler limits

Sets default presets for for all the 'users' when it comes to ruler
limits.

* Add for the last user

* Enabled compactor sharding

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Rollback PR 213

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Re-introduce ruler limits

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* [fixup] ruler limits config key name

Ruler limits have a prefix of `ruler_` on the config key name. This
makes the key match and then uses them as the value for the flags.

* Removed postings-compression-enabled

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fine-tuned gRPC keepalive pings settings

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed gRPC settings

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Release 1.6.0

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add option to configure unregister ingesters on shutdown

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improved comment

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Updated doc

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Removed ifs

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Updated comment

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed syntax error

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Remove misleading comment (grafana/cortex-jsonnet#243)

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add option to customise the configmap name

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Fix for real

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added bucket index flag, and enable bucket index by default. (grafana/cortex-jsonnet#254)

* Cleanup blocks storage config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* feat: allow for Alertmanager to configure multiple storage backends

Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>

* Update cortex/config.libsonnet

Co-authored-by: gotjosh <josue@grafana.com>

* Update cortex/alertmanager.libsonnet

Co-authored-by: gotjosh <josue@grafana.com>

* Release 1.7.0. (grafana/cortex-jsonnet#260)

* Release 1.7.0.

* cortex: config: Fix error message for alertmanager_client_type.

* cortex: alertmanager: Remove space in dot notation.

* Up metadata connection limits

* Add flag to enable streaming of chunks. (grafana/cortex-jsonnet#276)

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Add recording rules to calculate Cortex scaling

- Update dashboard so it only shows under provisioned services and why
- Add sizing rules based on limits.
- Add some docs to the dashboard.

Signed-off-by: Tom Wilkie <tom@grafana.com>

* chore: update lib to use new API paths

Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>

* Create 1.8.0 release. (grafana/cortex-jsonnet#282)

* Create 1.8.0 release.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Update image tags.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Do not use deprecated Alertmanager cluster flags

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* fix: Update ksonnet-util vendor lock

The previous version `c19a92e586a6752f11745b47f309b13f02ef7147` is
incompatible with the library in its current form. For example in
`tsdb.libsonnet` L81, we use `pvc.new('ingester-pvc')` but at the
locked version, in `ksonnet-util/kausal.libsonnet` the `pvc.new`
function takes no arguments.

Signed-off-by: Jack Baldry <jack.baldry@grafana.com>

* Add function to customize compactor statefulset

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add querier_service_ignored_labels (grafana/cortex-jsonnet#291)

Co-authored-by: Victor Tsang Hi <victor.tsang.hi@sap.com>

* Introduce ingester instance limits to configuration, and add alerts. (grafana/cortex-jsonnet#296)

* Introduce ingester instance limits to configuration, and add alerts.

* CHANGELOG.md

* Address (internal) review feedback.

* Add `query-scheduler.libsonnet` (grafana/cortex-jsonnet#295)

* Add query-scheduler.libsonnet.

* CHANGELOG.md

* Use flag to enable query-scheduler.

* Fix image.

* Replace use of querier.compress-http-responses removed in Cortex 1.9

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Enable index-header lazy loading in store-gateway

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Do not use deprecated/removed flag -limits.per-user-override-config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Use new ruler storage config and enable API compression

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Changed alertmanager config to use the new storage config

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Cut release 1.9.0

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Mount overrides configmap to alertmanager too

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Upgrade memcached

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Increase default store-gateway memory request and limit

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Set -server.grpc-max-*-msg-size-bytes for ruler and ingester. (grafana/cortex-jsonnet#326)

* Fixed --alertmanager.cluster.peers

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Set empty alertmanager listen address with 1 replica

Alertmanager tries to start clustering unless the flag is explicitly set as an empty string
https://github.com/prometheus/alertmanager#turn-off-high-availability

* Add option to disable anti-affinity in newIngesterStatefulSet()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix alertmanager config change introduced in grafana/cortex-jsonnet#344

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Create another tier with 300K active series

The other tiers have a 3x jump except when we go from 100K to 1Mil. I
think we should have a 3x jump for the first tier too.

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Improve config settings based on recent learnings

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added functions to create query-frontend and querier deployments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added function to create query-scheduler deployment

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* chore: upgrade to latest etcd-operator

Brings: grafana/jsonnet-libs#480

* Alertmanager: Allow storage configuration to support Azure

The alertmanager configuration did not have support for Azure. Let's add it.

* remove new line

* Fix comment on medium_small_user config

It says it should be 100k + 50%, but that's what extra_small_user is.
Here we have 300k, which is 200k + 50%.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Remove wrong comment

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Add overrides to compactor

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Split limits config into a variable we can reuse

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Review feedback

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Fix missing ruler limits

Damn, missed this in grafana/cortex-jsonnet#391

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Alertmanager: Add sharding configuration.

* Fix `compactor_blocks_retention_period` type in `extra_small_user` (grafana/cortex-jsonnet#395)

* Fix `compactor_blocks_retention_period` type in `extra_small_user`

The actual type of `compactor_blocks_retention_period` is `model.Duration`. Which comes
from prometheus `common` package.

The problem is that `model.Duration` have custom JSON unmarshal which treat the incoming
value as string.
https://github.com/prometheus/common/blob/main/model/time.go#L276

So setting it as integer, won't work when unmarshalling with JSON.

NOTE: This won't be an issue for YamlUnmarshal, as it always treating it as string (even
though you put it as integer)
https://github.com/prometheus/common/blob/main/model/time.go#L307

* update CHANGELOG

* Update rule limits to be inline with customer expectations

We built the initial rules on guesswork and now we're updating them
based on what the customers are asking for.

Further, the ruler can be horizontally scaled and we're happy letting
our users have more rules!

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

* Remove max_samples_per_query limit. (grafana/cortex-jsonnet#397)

* Remove max_samples_per_query limit.

* Fixed CHANGELOG.md

* Removed chunks storage query sharding config support

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add queryEngineConfig

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* tsdb: Add multi concurrency and max idle connections store gateway params

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* Update cortex/tsdb.libsonnet

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* Fix formatting

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* tsdb: Use literal numbers instead of variables

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* cortex: Make ruler object storage support generic

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* Remove ruler-storage.gcs.bucket-name for Azure

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* cortex: Define Azure ruler args

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* Parameterize

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* Further document ingester_stream_chunks_when_using_blocks parameter

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>

* Add options to disable anti-affinity

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Upstream some config improvements

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Increased max connections for memcached chunks and index-queries too

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Ruler: Pass `-ruler-storage.s3.endpoint` to ruler when using S3.

This argument is is required, without it, the following error appears:

```
no s3 endpoint in config file
```

* Allow to create custom store-gateway StatefulSets via newStoreGatewayStatefulSet()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fix newStoreGatewayStatefulSet() to use input container

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Add CI check for jsonnet manifests

* Remove additional git diff in check-mixin

* Imported cortex-jsonnet CHANGELOG entries from 1.9.0

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Improved CHANGELOG header

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Austin McKinley <54160+amckinley@users.noreply.github.com>
Co-authored-by: Tom Wilkie <tomwilkie@users.noreply.github.com>
Co-authored-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Co-authored-by: Austin McKinley <austin.mckinley@robinhood.com>
Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
Co-authored-by: Joe Elliott <number101010@gmail.com>
Co-authored-by: Joe Elliott <joe.elliott@grafana.com>
Co-authored-by: Duologic <jeroen@simplistic.be>
Co-authored-by: Jeroen Op 't Eynde <jeroen@grafana.com>
Co-authored-by: Sandeep Sukhani <sandeep.d.sukhani@gmail.com>
Co-authored-by: Ganesh Vernekar <15064823+codesome@users.noreply.github.com>
Co-authored-by: Stan Kwong <jpdstan@gmail.com>
Co-authored-by: gotjosh <josue@grafana.com>
Co-authored-by: forestsword <colsen@adobe.com>
Co-authored-by: Jacob Lisi <jlisi@grafana.com>
Co-authored-by: Alex Martin <alex@suitupalex.com>
Co-authored-by: Tom Wilkie <tom@grafana.com>
Co-authored-by: Jack Baldry <jack.baldry@grafana.com>
Co-authored-by: Victor Tsang Hi <victor.tsanghi@gmail.com>
Co-authored-by: Victor Tsang Hi <victor.tsang.hi@sap.com>
Co-authored-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
Co-authored-by: Hamish <hamish.forbes@gmail.com>
Co-authored-by: Javier Palomo <javier.palomo@grafana.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
Co-authored-by: Kaviraj <kavirajkanagaraj@gmail.com>
Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants