Alert when a single pod is using up the majority of CPU or memory of a node. #4538

vijay-veeranki · 2023-05-09T16:08:51Z

Related to:

Create an alert when a single pod is using up the majority of CPU or memory of a node.

The following query returns per-container average number of CPUs used during the last 5 minutes:

rate(container_cpu_usage_seconds_total{container!~"POD|"}[5m])
The lookbehind window in square brackets (5m in the case above) can be changed to the needed value. See possible time duration values here.

The container!~"POD|" filter removes metrics related to cgroups hierarchy (see this answer for more details) and metrics for e.g. pause containers (see these docs).

Since each pod can contain multiple containers, then the following query can be used for returning per-pod average number of CPUs used during the last 5 minutes:

sum(
rate(container_cpu_usage_seconds_total{container!~"POD|"}[5m])
) by (namespace,pod)

sj-williams · 2023-05-26T09:08:33Z

Findings doc:
https://docs.google.com/document/d/1qAxCYFzDQta00l4v3IZ1CUyjOECuWAXDh3oAqNF0UtA/edit#heading=h.sdps58w53s1e

sj-williams · 2023-06-06T07:55:04Z

TODO:
Pair with AP team to get rshiny app stood up in test cluster to simulate memory leak issues

SteveMarshall · 2024-01-31T11:35:08Z

We're going to try to recreate the cause of this in #5251. Once that's done, we'll revisit this.

vijay-veeranki mentioned this issue May 9, 2023

CPU-Critical work: Visibility of CPU/Memory usage #4491

Closed

AntonyBishop added the operations-driven-engineering label May 12, 2023

sj-williams self-assigned this May 16, 2023

AntonyBishop added the blocked label Sep 4, 2023

SteveMarshall added the Environments label Sep 26, 2023

AntonyBishop removed the blocked label Sep 26, 2023

sablumiah added the blocked label Mar 11, 2024

sj-williams mentioned this issue Apr 24, 2024

Recreate RShiny app scenario in test cluster to investigate CPU Critical #5251

Open

9 tasks

sj-williams closed this as completed Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert when a single pod is using up the majority of CPU or memory of a node. #4538

Alert when a single pod is using up the majority of CPU or memory of a node. #4538

vijay-veeranki commented May 9, 2023

sj-williams commented May 26, 2023

sj-williams commented Jun 6, 2023

SteveMarshall commented Jan 31, 2024

Alert when a single pod is using up the majority of CPU or memory of a node. #4538

Alert when a single pod is using up the majority of CPU or memory of a node. #4538

Comments

vijay-veeranki commented May 9, 2023

sj-williams commented May 26, 2023

sj-williams commented Jun 6, 2023

SteveMarshall commented Jan 31, 2024