Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn about CPU limits in teleport-cluster Helm chart #36251

Merged
merged 2 commits into from Jan 4, 2024

Conversation

hugoShaka
Copy link
Contributor

@hugoShaka hugoShaka commented Jan 3, 2024

Because people keep shooting themselves in the foot with CFS quotas and this causes S1s.

Technical explanation as to why cpu limits are not the best idea:

  • you want to set limits to avoid side workloads to harm important workloads, setting limits on the Teleport cluster means that you want to degrade Teleport to protect something else in your Kube cluster. In the vast majority of setups, Teleport is the main workload in the cluster and the most important one, you don't want to degrade it and loose access to everything when under pressure.
  • CFS quotas are misleading. If you set requests.cpu:1and limits.cpu: 1 this absolutely does not mean that Teleport will run on a single CPU, nor that its CPU will be reserved. On an 8 core node, this means teleport will run 13% of the time on all CPUs, and then not be scheduled during the remaining 87% of the observed period. The Static CPU management policy does the thing people expect: it statically allocates CPUs to each workload (plus you get a nice CPU affinity that can help a lot with single-threaded workloads)
  • Teleport is mainly doing IO; when throttling starts, latency will skyrocket, and the service will become unstable (you can safely expect everything to start timeouting: e.g. TLS handshakes)

Copy link

github-actions bot commented Jan 3, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@hugoShaka hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label Jan 3, 2024
Copy link

github-actions bot commented Jan 3, 2024

🤖 Vercel preview here: https://docs-b1f0b3wxe-goteleport.vercel.app/docs/ver/preview

Copy link
Contributor

@webvictim webvictim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

[the Static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy),
a multithreaded workload with CPU limits will very likely not behave the way you expect when approaching its CPU limit.

Teleport will become unstable once throttling starts. We recommend not to set CPU limits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a paragraph about the implications of such actions?
Since people don't seem to know how it works, it's probably good to give them an idea that CPU limits control the CPU time of the process and not the actual CPU cores reserved. This leads to huge latencies because Teleport will quickly consume its quota and will not be scheduled on any cores for long periods of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From prev experience, no one will read it.

examples/chart/teleport-cluster/values.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow up!

Copy link

github-actions bot commented Jan 4, 2024

🤖 Vercel preview here: https://docs-r51p5u345-goteleport.vercel.app/docs/ver/preview

@hugoShaka hugoShaka added this pull request to the merge queue Jan 4, 2024
Merged via the queue into master with commit 8ddcf24 Jan 4, 2024
38 checks passed
@hugoShaka hugoShaka deleted the hugo/warn-about-common-kubernetes-footguns branch January 4, 2024 17:45
@public-teleport-github-review-bot

@hugoShaka See the table below for backport results.

Branch Result
branch/v12 Failed
branch/v13 Create PR
branch/v14 Create PR

ibeckermayer pushed a commit that referenced this pull request Jan 17, 2024
* Warn about CPU limits

* fixup! Warn about CPU limits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants