Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix container_oom_events_total always returns 0. #3278

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chengjoey
Copy link

@chengjoey chengjoey commented Mar 22, 2023

fix #3015
In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the container_oom_events_total metric will always be 0. this pr refactor the collector of oom events, and retain the deleted container oom information for a period of events. And add flag oom_event_retain_time to decide how long the oom event will be keep, default is 5 minutes

@k8s-ci-robot
Copy link
Collaborator

Hi @chengjoey. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chengjoey
Copy link
Author

/assign @iwankgb @kragniz

Is it feasible to keep the container metrics with oomkilled without deleting them?
please take a look

@chengjoey
Copy link
Author

/kind bug

@iwankgb
Copy link
Collaborator

iwankgb commented Mar 24, 2023

What happens when PID 1 forks another process and the forked process get OOM-killed?

@chengjoey
Copy link
Author

What happens when PID 1 forks another process and the forked process get OOM-killed?

The forked process that was OOM-killed can still read relevant log information from /dev/kmsg. It should still be possible to associate with the corresponding container.

@szuecs
Copy link

szuecs commented May 10, 2023

@chengjoey what happens if a container is killed every second?
Do I understand correctly that we keep creating new container metrics or would this counter increase to >1?
In case we would create new container metrics, then I would call this feature a memory leak, because every container start will create a new set of metrics and now we would store them forever as far as I understand. (I am not very familiar with the code)

@ishworgurung
Copy link

memory leak [ ... ]

@szuecs In what way would it be a memory leak ?

@szuecs
Copy link

szuecs commented Aug 11, 2023

@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand.
Increasing the counter is great, though. Having old metrics forever is likely an issue.
Maybe it's also not part of this PR, feel free to ignore this.

@chengjoey
Copy link
Author

@ishworgurung maybe the wording is not correct, but it will increase memory overtime, which is never GCed and finally cadvisor get oom. As far as I understand. Increasing the counter is great, though. Having old metrics forever is likely an issue. Maybe it's also not part of this PR, feel free to ignore this.

hi @szuecs @ishworgurung , I have made modifications in this PR, putting the oom event metric information in a separate map, and adding the flag oom_event_retain_time to configure the retention time. Oom metric that exceeds this time will still be deleted to prevent memory leaks.

@iwankgb could you please task a review when you have time

@dims
Copy link
Collaborator

dims commented Oct 16, 2023

/ok-to-test

@dims
Copy link
Collaborator

dims commented Oct 16, 2023

@chengjoey please resolve merge conflicts:
image

In a Kubernetes pod, if a container is OOM-killed, it will be deleted and a new container will be created. Therefore, the `container_oom_events_total` metric will always be 0. Refactor the collector of oom events, and retain the deleted container oom information for a period of events

Signed-off-by: joey <zchengjoey@gmail.com>
@chengjoey
Copy link
Author

/test pull-cadvisor-e2e

@chengjoey
Copy link
Author

@chengjoey please resolve merge conflicts: image

Thanks @dims , pr has been rebased

@chengjoey
Copy link
Author

/test pull-cadvisor-e2e

2 similar comments
@chengjoey
Copy link
Author

/test pull-cadvisor-e2e

@chengjoey
Copy link
Author

/test pull-cadvisor-e2e

@k8s-ci-robot
Copy link
Collaborator

@chengjoey: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cadvisor-e2e 0b6dfeb link true /test pull-cadvisor-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@nlamirault
Copy link

Hi, any news on this bugfix?
We're waiting for an alert on OOMKilled event (kubernetes-monitoring/kubernetes-mixin#822)
Thanks.

@taraspos
Copy link

taraspos commented Apr 25, 2024

Hi!
just wanted to bump this issue again. Would be great to get it fixed.

@tsipo
Copy link

tsipo commented Apr 29, 2024

Hi @pschichtel and others.
I was testing this issue on Kubernetes using a small image I have built - see here. I have noticed that when I use that tool in a forked mode - this and this use-cases, I go get container_oom_events_total == 1 as the container continues to live enough time after the OOMKill for cAdvisor to be scraped. If I change the AFTER_FORK_INTERVAL env var to 0, and the container exits immediately, I gon't get container_oom_events_total == 1 as the container is de-registered immediately from cAdvisor.
FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

container_oom_events_total always returns 0
9 participants