fix container_oom_events_total always returns 0. #3278

chengjoey · 2023-03-22T14:32:03Z

fix #3015
In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the container_oom_events_total metric will always be 0. this pr refactor the collector of oom events, and retain the deleted container oom information for a period of events. And add flag oom_event_retain_time to decide how long the oom event will be keep, default is 5 minutes

k8s-ci-robot · 2023-03-22T14:32:13Z

Hi @chengjoey. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chengjoey · 2023-03-22T14:42:06Z

/assign @iwankgb @kragniz

Is it feasible to keep the container metrics with oomkilled without deleting them?
please take a look

chengjoey · 2023-03-22T14:42:56Z

/kind bug

iwankgb · 2023-03-24T13:07:10Z

What happens when PID 1 forks another process and the forked process get OOM-killed?

chengjoey · 2023-03-27T02:51:04Z

What happens when PID 1 forks another process and the forked process get OOM-killed?

The forked process that was OOM-killed can still read relevant log information from /dev/kmsg. It should still be possible to associate with the corresponding container.

szuecs · 2023-05-10T08:25:18Z

@chengjoey what happens if a container is killed every second?
Do I understand correctly that we keep creating new container metrics or would this counter increase to >1?
In case we would create new container metrics, then I would call this feature a memory leak, because every container start will create a new set of metrics and now we would store them forever as far as I understand. (I am not very familiar with the code)

ishworgurung · 2023-08-04T04:54:20Z

memory leak [ ... ]

@szuecs In what way would it be a memory leak ?

szuecs · 2023-08-11T21:32:50Z

@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand.
Increasing the counter is great, though. Having old metrics forever is likely an issue.
Maybe it's also not part of this PR, feel free to ignore this.

chengjoey · 2023-09-04T08:36:10Z

@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand. Increasing the counter is great, though. Having old metrics forever is likely an issue. Maybe it's also not part of this PR, feel free to ignore this.

hi @szuecs @ishworgurung , I have made modifications in this PR, putting the oom event metric information in a separate map, and adding the flag oom_event_retain_time to configure the retention time. Oom metric that exceeds this time will still be deleted to prevent memory leaks.

@iwankgb could you please task a review when you have time

dims · 2023-10-16T19:16:23Z

/ok-to-test

dims · 2023-10-16T21:08:28Z

@chengjoey please resolve merge conflicts:

In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the `container_oom_events_total` metric will always be 0. Refactor the collector of oom events, and retain the deleted container oom information for a period of events Signed-off-by: joey <zchengjoey@gmail.com>

chengjoey · 2023-10-17T05:40:00Z

/test pull-cadvisor-e2e

chengjoey · 2023-10-17T06:16:43Z

@chengjoey please resolve merge conflicts:

Thanks @dims , pr has been rebased

chengjoey · 2023-10-17T12:27:59Z

/test pull-cadvisor-e2e

chengjoey · 2023-10-18T01:15:04Z

/test pull-cadvisor-e2e

chengjoey · 2023-10-18T07:30:32Z

/test pull-cadvisor-e2e

k8s-ci-robot · 2023-10-18T07:37:49Z

@chengjoey: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cadvisor-e2e	`0b6dfeb`	link	true	`/test pull-cadvisor-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

nlamirault · 2023-12-04T08:30:29Z

Hi, any news on this bugfix?
We're waiting for an alert on OOMKilled event (kubernetes-monitoring/kubernetes-mixin#822)
Thanks.

taraspos · 2024-04-25T14:53:05Z

Hi!
just wanted to bump this issue again. Would be great to get it fixed.

tsipo · 2024-04-29T20:17:19Z

Hi @pschichtel and others.
I was testing this issue on Kubernetes using a small image I have built - see here. I have noticed that when I use that tool in a forked mode - this and this use-cases, I go get container_oom_events_total == 1 as the container continues to live enough time after the OOMKill for cAdvisor to be scraped. If I change the AFTER_FORK_INTERVAL env var to 0, and the container exits immediately, I gon't get container_oom_events_total == 1 as the container is de-registered immediately from cAdvisor.
FYI.

k8s-ci-robot added the needs-ok-to-test label Mar 22, 2023

chengjoey force-pushed the fix/container-oom-total branch from dcbab71 to 70b1b02 Compare March 22, 2023 15:05

chengjoey force-pushed the fix/container-oom-total branch from 70b1b02 to 02b6c33 Compare September 4, 2023 08:25

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 16, 2023

chengjoey force-pushed the fix/container-oom-total branch from 02b6c33 to 0b6dfeb Compare October 17, 2023 03:38

pschichtel mentioned this pull request Apr 29, 2024

container_oom_events_total always returns 0 #3015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix container_oom_events_total always returns 0. #3278

fix container_oom_events_total always returns 0. #3278

chengjoey commented Mar 22, 2023 •

edited

Loading

k8s-ci-robot commented Mar 22, 2023

chengjoey commented Mar 22, 2023

chengjoey commented Mar 22, 2023

iwankgb commented Mar 24, 2023

chengjoey commented Mar 27, 2023

szuecs commented May 10, 2023

ishworgurung commented Aug 4, 2023

szuecs commented Aug 11, 2023

chengjoey commented Sep 4, 2023

dims commented Oct 16, 2023

dims commented Oct 16, 2023

chengjoey commented Oct 17, 2023

chengjoey commented Oct 17, 2023

chengjoey commented Oct 17, 2023

chengjoey commented Oct 18, 2023

chengjoey commented Oct 18, 2023

k8s-ci-robot commented Oct 18, 2023

nlamirault commented Dec 4, 2023

taraspos commented Apr 25, 2024 •

edited

Loading

tsipo commented Apr 29, 2024

fix container_oom_events_total always returns 0. #3278

Are you sure you want to change the base?

fix container_oom_events_total always returns 0. #3278

Conversation

chengjoey commented Mar 22, 2023 • edited Loading

k8s-ci-robot commented Mar 22, 2023

chengjoey commented Mar 22, 2023

chengjoey commented Mar 22, 2023

iwankgb commented Mar 24, 2023

chengjoey commented Mar 27, 2023

szuecs commented May 10, 2023

ishworgurung commented Aug 4, 2023

szuecs commented Aug 11, 2023

chengjoey commented Sep 4, 2023

dims commented Oct 16, 2023

dims commented Oct 16, 2023

chengjoey commented Oct 17, 2023

chengjoey commented Oct 17, 2023

chengjoey commented Oct 17, 2023

chengjoey commented Oct 18, 2023

chengjoey commented Oct 18, 2023

k8s-ci-robot commented Oct 18, 2023

nlamirault commented Dec 4, 2023

taraspos commented Apr 25, 2024 • edited Loading

tsipo commented Apr 29, 2024

chengjoey commented Mar 22, 2023 •

edited

Loading

taraspos commented Apr 25, 2024 •

edited

Loading