New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kubelet panic in cgroup manager. #42927
Fix kubelet panic in cgroup manager. #42927
Conversation
@calebamiles Mark 1.6 because this is a kubelet bug fix. |
/lgtm |
@k8s-bot gce etcd3 e2e test this |
Thanks for fixing this so quickly. |
@@ -449,12 +449,16 @@ func (m *cgroupManagerImpl) Pids(name CgroupName) []int { | |||
|
|||
// WalkFunc which is called for each file and directory in the pod cgroup dir | |||
visitor := func(path string, info os.FileInfo, err error) error { | |||
if err != nil { | |||
glog.V(5).Infof("cgroup manager encountered error scanning cgroup path %q", path) | |||
return filepath.SkipDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if its safe to skip here...
What prompted you to add this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this #42920
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh would you say it is equally safe as skipping 10 lines later if there's an error reading cgroup data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if there is an error, info may be nil. And this will cause kubelet panic.
Do you mean that we should still try getCgroupProcs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect systemd is the one that could cause errors in this logic. We do not expect transient errors with cgroupfs. So maybe in the future, we could have a list of well known controllers this logic expects to always work and fail loudly if one of those controllers throws access errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Random-Liu Shouldn't we also print out err before the return? Same for the following errs?
@k8s-bot gce etcd3 e2e test this |
@@ -449,12 +449,16 @@ func (m *cgroupManagerImpl) Pids(name CgroupName) []int { | |||
|
|||
// WalkFunc which is called for each file and directory in the pod cgroup dir | |||
visitor := func(path string, info os.FileInfo, err error) error { | |||
if err != nil { | |||
glog.V(5).Infof("cgroup manager encountered error scanning cgroup path %q", path) | |||
return filepath.SkipDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me.
this also fixes #42927, thanks! |
@derekwaynecarr I believe you were referring to #42875 Fixed the pr description too. |
b0f3500
to
e6341cc
Compare
@dchen1107 Updated the PR to print out the error and change log level to @derekwaynecarr Will this be quite spammy? |
/lgtm |
@derekwaynecarr I asked @Random-Liu made above change because the original change did fix the kubelet panic, but I worry about it hides some real issues in the system, and leaves tons of uncleaned pod-cgroup behind. |
Yeah. It might be useful to generate events on cgroup deletion failures for
now.
…On Mon, Mar 13, 2017 at 12:12 PM, Dawn Chen ***@***.***> wrote:
@derekwaynecarr <https://github.com/derekwaynecarr> I asked @Random-Liu
<https://github.com/Random-Liu> made above change because the original
change did fix the kubelet panic, but I worry about it hides some real
issues in the system, and leaves tons of uncleaned pod-cgroup behind.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#42927 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKMtErhg2lLDgGsfOmKNaTOnQ5buRks5rlZUPgaJpZM4MZ_Oa>
.
|
[APPROVALNOTIFIER] This PR is APPROVED The following people have approved this PR: Random-Liu, dchen1107, yujuhong Needs approval from an approver in each of these OWNERS Files:
We suggest the following people: |
@k8s-bot cvm gce e2e test this |
Apply LGTM based on #42927 (comment) |
I'd be willing to bet this fixes most or all of the hits here: https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=the%20server%20could%20not%20find%20the%20requested%20resource%20%5C(get#04092e6e21f91a7731ee |
Fixes #42920.
Fixes #42875
Fixes #42927
Fixes #43059
Check the error in walk function, so that we don't use info when there is an error.
@yujuhong @dchen1107 @derekwaynecarr @vishh /cc @kubernetes/sig-node-bugs