Warn about CPU limits in `teleport-cluster` Helm chart #36251

hugoShaka · 2024-01-03T21:41:50Z

Because people keep shooting themselves in the foot with CFS quotas and this causes S1s.

Technical explanation as to why cpu limits are not the best idea:

you want to set limits to avoid side workloads to harm important workloads, setting limits on the Teleport cluster means that you want to degrade Teleport to protect something else in your Kube cluster. In the vast majority of setups, Teleport is the main workload in the cluster and the most important one, you don't want to degrade it and loose access to everything when under pressure.
CFS quotas are misleading. If you set requests.cpu:1and limits.cpu: 1 this absolutely does not mean that Teleport will run on a single CPU, nor that its CPU will be reserved. On an 8 core node, this means teleport will run 13% of the time on all CPUs, and then not be scheduled during the remaining 87% of the observed period. The Static CPU management policy does the thing people expect: it statically allocates CPUs to each workload (plus you get a nice CPU affinity that can help a lot with single-threaded workloads)
Teleport is mainly doing IO; when throttling starts, latency will skyrocket, and the service will become unstable (you can safely expect everything to start timeouting: e.g. TLS handshakes)

github-actions · 2024-01-03T21:42:26Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

github-actions · 2024-01-03T21:51:09Z

🤖 Vercel preview here: https://docs-b1f0b3wxe-goteleport.vercel.app/docs/ver/preview

webvictim

❤️

tigrato · 2024-01-04T10:30:02Z

docs/pages/reference/helm-reference/teleport-cluster.mdx

+[the Static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy),
+a multithreaded workload with CPU limits will very likely not behave the way you expect when approaching its CPU limit.
+
+Teleport will become unstable once throttling starts. We recommend not to set CPU limits.


Should we add a paragraph about the implications of such actions?
Since people don't seem to know how it works, it's probably good to give them an idea that CPU limits control the CPU time of the process and not the actual CPU cores reserved. This leads to huge latencies because Teleport will quickly consume its quota and will not be scheduled on any cores for long periods of time.

I added a link to this PR.

From prev experience, no one will read it.

examples/chart/teleport-cluster/values.yaml

codingllama

Thanks for the follow up!

github-actions · 2024-01-04T15:57:59Z

🤖 Vercel preview here: https://docs-r51p5u345-goteleport.vercel.app/docs/ver/preview

public-teleport-github-review-bot · 2024-01-04T17:47:35Z

@hugoShaka See the table below for backport results.

Branch	Result
branch/v12	Failed
branch/v13	Create PR
branch/v14	Create PR

* Warn about CPU limits * fixup! Warn about CPU limits

Warn about CPU limits

eda9681

hugoShaka added documentation helm backport/branch/v12 backport/branch/v13 backport/branch/v14 labels Jan 3, 2024

hugoShaka requested review from marcoandredinis, webvictim, tigrato and codingllama January 3, 2024 21:41

hugoShaka temporarily deployed to vercel January 3, 2024 21:41 — with GitHub Actions Inactive

github-actions bot added the size/sm label Jan 3, 2024

github-actions bot requested review from ptgott, r0mant, xinding33 and zmb3 January 3, 2024 21:42

hugoShaka added the no-changelog Indicates that a PR does not require a changelog entry label Jan 3, 2024

webvictim approved these changes Jan 4, 2024

View reviewed changes

zmb3 approved these changes Jan 4, 2024

View reviewed changes

marcoandredinis approved these changes Jan 4, 2024

View reviewed changes

public-teleport-github-review-bot bot removed request for r0mant, tigrato, codingllama, xinding33 and ptgott January 4, 2024 07:54

tigrato approved these changes Jan 4, 2024

View reviewed changes

codingllama reviewed Jan 4, 2024

View reviewed changes

fixup! Warn about CPU limits

a2f3b6b

hugoShaka temporarily deployed to vercel January 4, 2024 15:52 — with GitHub Actions Inactive

tigrato approved these changes Jan 4, 2024

View reviewed changes

hugoShaka added this pull request to the merge queue Jan 4, 2024

Merged via the queue into master with commit 8ddcf24 Jan 4, 2024
38 checks passed

hugoShaka deleted the hugo/warn-about-common-kubernetes-footguns branch January 4, 2024 17:45

This was referenced Jan 4, 2024

[v14] Warn about CPU limits in teleport-cluster Helm chart #36289

Merged

[v13] Warn about CPU limits in teleport-cluster Helm chart #36290

Merged

ibeckermayer pushed a commit that referenced this pull request Jan 17, 2024

Warn about CPU limits in teleport-cluster Helm chart (#36251)

4e467ad

* Warn about CPU limits * fixup! Warn about CPU limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn about CPU limits in `teleport-cluster` Helm chart #36251

Warn about CPU limits in `teleport-cluster` Helm chart #36251

hugoShaka commented Jan 3, 2024 •

edited

github-actions bot commented Jan 3, 2024

github-actions bot commented Jan 3, 2024

webvictim left a comment

tigrato Jan 4, 2024

hugoShaka Jan 4, 2024

tigrato Jan 4, 2024

codingllama left a comment

github-actions bot commented Jan 4, 2024

public-teleport-github-review-bot bot commented Jan 4, 2024

Warn about CPU limits in teleport-cluster Helm chart #36251

Warn about CPU limits in teleport-cluster Helm chart #36251

Conversation

hugoShaka commented Jan 3, 2024 • edited

github-actions bot commented Jan 3, 2024

github-actions bot commented Jan 3, 2024

webvictim left a comment

Choose a reason for hiding this comment

tigrato Jan 4, 2024

Choose a reason for hiding this comment

hugoShaka Jan 4, 2024

Choose a reason for hiding this comment

tigrato Jan 4, 2024

Choose a reason for hiding this comment

codingllama left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 4, 2024

public-teleport-github-review-bot bot commented Jan 4, 2024

Warn about CPU limits in `teleport-cluster` Helm chart #36251

Warn about CPU limits in `teleport-cluster` Helm chart #36251

hugoShaka commented Jan 3, 2024 •

edited