New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the container whose limit is hit for system OOMs #88871
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dashpole The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/lgtm |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I believe this lgtm.
Can you add cAdvisor populates ContainerName and VictimContainerName from matching the regexp: Task in (.*) killed as a result of limit of (.*). A SystemOOM should mean VictimContainerName == "/", as we are looking for OOMs that are "killed as a results of limit of /". However, we incorrectly check ContainerName instead in the kubelet.
from the issue into either the commit message or the pr?
Also, if I'm reading the git history correct, does this mean this feature never worked? In other words, would there ever have been a time where the container name is /
? If this never worked, does that reflect a lack of e2e testing?
Yeah, it never worked. I thought about that as well in light of this issue. The main problem is that e2e testing would be quite flaky. SystemOOMs have been known to cause problems for the host VM, so a test that triggers a real SystemOOM would often fail. |
/retest |
1 similar comment
/retest |
/retest Yeah... that's a strong point that performing e2e tests with true system OOM events is a tricky game to play haha... Updating the unit tests and clear comments/commit messages may be the best we can do for now. |
readding |
thanks for taking a look @sjenning |
I have came up with another problem about oom. |
No, this is for SystemOOM events, which you (hopefully) did not hit |
What type of PR is this?
/kind bug
/sig node
/priority important-soon
What this PR does / why we need it:
cAdvisor populates ContainerName and VictimContainerName from matching the regexp: Task in (.) killed as a result of limit of (.). A SystemOOM should mean VictimContainerName == "/", as we are looking for OOMs that are "killed as a results of limit of /". However, we incorrectly check ContainerName instead in the kubelet.
Which issue(s) this PR fixes:
Fixes #88868
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/assign @sjenning @derekwaynecarr @dchen1107