Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki: Change default limits to common values #4415

Merged
merged 5 commits into from
Oct 13, 2021

Conversation

DylanGuedes
Copy link
Contributor

@DylanGuedes DylanGuedes commented Oct 5, 2021

What this PR does / why we need it:

  • Change default limits for the following parameters: ingestion_rate_strategy, max_global_streams_per_user, max_query_length, max_query_parallelism, max_streams_per_user, reject_old_samples: true, reject_old_samples_max_age. These new limits protects users from overwhelming their cluster with ingestion load and are already used by most configs.

Which issue(s) this PR fixes:
No issue linked

Special notes for your reviewer:

  • I didn't remove redundant values from the helmcharts/jsonnet files because I'm not sure what is the impact of it. If you think it is a good idea just let me know.
  • I was thinking of creating a test with the new default values but that sounded too much. Again, if you think it is a good idea just let me know.

Checklist

  • Documentation added
  • Tests updated (N/A)

@CLAassistant
Copy link

CLAassistant commented Oct 5, 2021

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@KMiller-Grafana KMiller-Grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs portion of the PR LGTM.

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I really like the direction this is moving (more sensible defaults). However, we'll need to add all of these changes to the upgrade guide:
https://github.com/grafana/loki/blob/main/docs/sources/upgrading/_index.md#master--unreleased
and add an entry in the changelog: https://github.com/grafana/loki/blob/main/CHANGELOG.md

@DylanGuedes
Copy link
Contributor Author

DylanGuedes commented Oct 6, 2021

Thanks! I really like the direction this is moving (more sensible defaults). However, we'll need to add all of these changes to the upgrade guide: https://github.com/grafana/loki/blob/main/docs/sources/upgrading/_index.md#master--unreleased and add an entry in the changelog: https://github.com/grafana/loki/blob/main/CHANGELOG.md

Awesome, I'll work on that! What about changing the helmcharts? Do you think it makes sense to also change it in this same PR?

edit: finished adding the new changes to the changelog and to the upgrading guide.

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks dylan, looks great! I too am excited with the direction things are moving re: sane defaults

docs/sources/upgrading/_index.md Outdated Show resolved Hide resolved
docs/sources/upgrading/_index.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
owen-d
owen-d previously requested changes Oct 7, 2021
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hunch and pulled/ran this locally against https://github.com/grafana/loki/blob/main/cmd/loki/loki-local-config.yaml

Unfortunately, making the new default distributor based rate limit "global" instead of "local" causes it to require the ring, which defaults to using consul. This causes some failures like,

level=warn ts=2021-10-07T20:06:51.282282554Z caller=module_service.go:94 msg="module failed with error" module=distributor err="distributor subservice failed: service distributor ring lifecycler failed: failed to join the ring distributor: failed to CAS collectors/distributor"
level=error ts=2021-10-07T20:06:51.282307015Z caller=loki.go:301 msg="module failed" module=distributor error="distributor subservice failed: service distributor ring lifecycler failed: failed to join the ring distributor: failed to CAS collectors/distributor"

I wonder if we can instead make the distributor ring default to inmemory.
@slim-bean @cyriltovena WDYT? Should we make all the rings default to in-memory instead of Consul? That could break users who are using the default localhost:8500 consul default, but that may be a preferable change we can document.

@trevorwhitney
Copy link
Collaborator

👍 to the idea of defaulting to in-memory, seems reasonable that any external dependencies you want to add to your system you would have to specify, and we try to pick defaults that don't require dependencies.

An alternative is we could dynamically make distributor rate limit default to global based on provided config, but I don't like that solution as much as it's much more confusing to the user what the limits are.

@DylanGuedes
Copy link
Contributor Author

DylanGuedes commented Oct 7, 2021

+1 to default to in-memory. Just one thing: I noticed that this is actually dictated by the DSKit repo (at this line), so the impact of changing it might be bigger than expected. Do you think all the other projects are fine with changing that too?

edit: and yeah, after changing the default to inmemory locally everything ran great

@trevorwhitney
Copy link
Collaborator

@DylanGuedes if we need to change defaults coming from upstream (dskit in this case), here's an example of how to do that: #4435

In that PR we change the values coming from upstream weaveworks/common.

@DylanGuedes
Copy link
Contributor Author

Note: I'm changing the default distributor's ring KV store to inmemory in a different PR (this one). Once it gets merged I think we can proceed with this one.

@slim-bean slim-bean dismissed owen-d’s stale review October 13, 2021 18:32

the requested changes were made in another PR

DylanGuedes and others added 4 commits October 13, 2021 15:33
- These new limits protect users from overwhelming their cluster with
  ingestion load
- The new limits are:
  * ingestion_rate_strategy: global <- currently defaults to local
  * max_global_streams_per_user: 5000 <- current default is no limit
  * max_query_length: 721h <- current default is no limit
  * max_query_parallelism: 32 <- current default is 14
  * max_streams_per_user: 0 <- current default is 10000
  * reject_old_samples: true <-- current default is false
  * reject_old_samples_max_age: 168h <-- current default is 336h
- They are not necessary anymore because of the new default values
Co-authored-by: Trevor Whitney <trevorjwhitney@gmail.com>
Copy link
Collaborator

@slim-bean slim-bean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants