Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert when a single pod is using up the majority of CPU or memory of a node. #4538

Closed
vijay-veeranki opened this issue May 9, 2023 · 3 comments

Comments

@vijay-veeranki
Copy link
Contributor

Related to:

#4491

Create an alert when a single pod is using up the majority of CPU or memory of a node.

The following query returns per-container average number of CPUs used during the last 5 minutes:

rate(container_cpu_usage_seconds_total{container!~"POD|"}[5m])
The lookbehind window in square brackets (5m in the case above) can be changed to the needed value. See possible time duration values here.

The container!~"POD|" filter removes metrics related to cgroups hierarchy (see this answer for more details) and metrics for e.g. pause containers (see these docs).

Since each pod can contain multiple containers, then the following query can be used for returning per-pod average number of CPUs used during the last 5 minutes:

sum(
rate(container_cpu_usage_seconds_total{container!~"POD|"}[5m])
) by (namespace,pod)

@sj-williams
Copy link
Contributor

@sj-williams
Copy link
Contributor

TODO:
Pair with AP team to get rshiny app stood up in test cluster to simulate memory leak issues

@SteveMarshall
Copy link
Member

We're going to try to recreate the cause of this in #5251. Once that's done, we'll revisit this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🥇 Done
Development

No branches or pull requests

5 participants