Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the container whose limit is hit for system OOMs #88871

Merged
merged 1 commit into from Mar 18, 2020

Conversation

dashpole
Copy link
Contributor

@dashpole dashpole commented Mar 5, 2020

What type of PR is this?
/kind bug
/sig node
/priority important-soon

What this PR does / why we need it:
cAdvisor populates ContainerName and VictimContainerName from matching the regexp: Task in (.) killed as a result of limit of (.). A SystemOOM should mean VictimContainerName == "/", as we are looking for OOMs that are "killed as a results of limit of /". However, we incorrectly check ContainerName instead in the kubelet.

Which issue(s) this PR fixes:
Fixes #88868

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix detection of SystemOOMs in which the victim is a container.

/assign @sjenning @derekwaynecarr @dchen1107

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 5, 2020
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dashpole

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet labels Mar 5, 2020
@dashpole
Copy link
Contributor Author

dashpole commented Mar 5, 2020

/retest

@tedyu
Copy link
Contributor

tedyu commented Mar 5, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2020
@dashpole
Copy link
Contributor Author

dashpole commented Mar 5, 2020

/hold
i'll give others a day to make sure this looks correct to them.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2020
Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
I believe this lgtm.

Can you add cAdvisor populates ContainerName and VictimContainerName from matching the regexp: Task in (.*) killed as a result of limit of (.*). A SystemOOM should mean VictimContainerName == "/", as we are looking for OOMs that are "killed as a results of limit of /". However, we incorrectly check ContainerName instead in the kubelet. from the issue into either the commit message or the pr?

Also, if I'm reading the git history correct, does this mean this feature never worked? In other words, would there ever have been a time where the container name is /? If this never worked, does that reflect a lack of e2e testing?

pkg/kubelet/oom/oom_watcher_linux_test.go Outdated Show resolved Hide resolved
pkg/kubelet/oom/oom_watcher_linux_test.go Outdated Show resolved Hide resolved
@dashpole
Copy link
Contributor Author

dashpole commented Mar 6, 2020

Also, if I'm reading the git history correct, does this mean this feature never worked? In other words, would there ever have been a time where the container name is /? If this never worked, does that reflect a lack of e2e testing?

Yeah, it never worked. I thought about that as well in light of this issue. The main problem is that e2e testing would be quite flaky. SystemOOMs have been known to cause problems for the host VM, so a test that triggers a real SystemOOM would often fail.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2020
@dashpole
Copy link
Contributor Author

dashpole commented Mar 6, 2020

/retest

1 similar comment
@dashpole
Copy link
Contributor Author

dashpole commented Mar 6, 2020

/retest

@mattjmcnaughton
Copy link
Contributor

Also, if I'm reading the git history correct, does this mean this feature never worked? In other words, would there ever have been a time where the container name is /? If this never worked, does that reflect a lack of e2e testing?

Yeah, it never worked. I thought about that as well in light of this issue. The main problem is that e2e testing would be quite flaky. SystemOOMs have been known to cause problems for the host VM, so a test that triggers a real SystemOOM would often fail.

/retest

Yeah... that's a strong point that performing e2e tests with true system OOM events is a tricky game to play haha...

Updating the unit tests and clear comments/commit messages may be the best we can do for now.

@sjenning
Copy link
Contributor

sjenning commented Mar 9, 2020

readding
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2020
@dashpole
Copy link
Contributor Author

dashpole commented Mar 9, 2020

thanks for taking a look @sjenning
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2020
@KielChan
Copy link

I have came up with another problem about oom.
It is java app with tomcat. OOMKiller on system reject tomcat to apply more memory, and kill this tomcat process. Java app container does not restart and no oom event report.
Is it the same as this problem?

@dashpole
Copy link
Contributor Author

No, this is for SystemOOM events, which you (hopefully) did not hit

@k8s-ci-robot k8s-ci-robot merged commit 0c8ac83 into kubernetes:master Mar 18, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Mar 18, 2020
@dashpole dashpole deleted the fix_oom branch June 15, 2020 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SystemOOMs not reported for containers
8 participants