Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubelet panic in cgroup manager. #42927

Merged
merged 1 commit into from Mar 14, 2017

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Mar 10, 2017

Fixes #42920.
Fixes #42875
Fixes #42927
Fixes #43059

Check the error in walk function, so that we don't use info when there is an error.

@yujuhong @dchen1107 @derekwaynecarr @vishh /cc @kubernetes/sig-node-bugs

@Random-Liu Random-Liu added area/kubelet release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 10, 2017
@Random-Liu Random-Liu added this to the v1.6 milestone Mar 10, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2017
@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 10, 2017
@Random-Liu
Copy link
Member Author

@calebamiles Mark 1.6 because this is a kubelet bug fix.

@k8s-reviewable
Copy link

This change is Reviewable

@ethernetdan ethernetdan added the kind/bug Categorizes issue or PR as related to a bug. label Mar 10, 2017
@yujuhong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2017
@Random-Liu
Copy link
Member Author

@k8s-bot gce etcd3 e2e test this

@dchen1107
Copy link
Member

Thanks for fixing this so quickly.

@@ -449,12 +449,16 @@ func (m *cgroupManagerImpl) Pids(name CgroupName) []int {

// WalkFunc which is called for each file and directory in the pod cgroup dir
visitor := func(path string, info os.FileInfo, err error) error {
if err != nil {
glog.V(5).Infof("cgroup manager encountered error scanning cgroup path %q", path)
return filepath.SkipDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if its safe to skip here...
What prompted you to add this logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this #42920

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh would you say it is equally safe as skipping 10 lines later if there's an error reading cgroup data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if there is an error, info may be nil. And this will cause kubelet panic.

Do you mean that we should still try getCgroupProcs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect systemd is the one that could cause errors in this logic. We do not expect transient errors with cgroupfs. So maybe in the future, we could have a list of well known controllers this logic expects to always work and fail loudly if one of those controllers throws access errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Random-Liu Shouldn't we also print out err before the return? Same for the following errs?

@Random-Liu
Copy link
Member Author

@k8s-bot gce etcd3 e2e test this

@@ -449,12 +449,16 @@ func (m *cgroupManagerImpl) Pids(name CgroupName) []int {

// WalkFunc which is called for each file and directory in the pod cgroup dir
visitor := func(path string, info os.FileInfo, err error) error {
if err != nil {
glog.V(5).Infof("cgroup manager encountered error scanning cgroup path %q", path)
return filepath.SkipDir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me.

@derekwaynecarr
Copy link
Member

this also fixes #42927, thanks!

@dchen1107
Copy link
Member

@derekwaynecarr I believe you were referring to #42875

Fixed the pr description too.

@dchen1107 dchen1107 removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2017
@Random-Liu
Copy link
Member Author

@dchen1107 Updated the PR to print out the error and change log level to V(4), so that we could see the transient error in the test log.

@derekwaynecarr Will this be quite spammy?

@dchen1107
Copy link
Member

/lgtm

@dchen1107
Copy link
Member

@derekwaynecarr I asked @Random-Liu made above change because the original change did fix the kubelet panic, but I worry about it hides some real issues in the system, and leaves tons of uncleaned pod-cgroup behind.

@dchen1107 dchen1107 self-assigned this Mar 13, 2017
@vishh
Copy link
Contributor

vishh commented Mar 13, 2017 via email

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: Random-Liu, dchen1107, yujuhong

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @vishh
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@Random-Liu
Copy link
Member Author

@k8s-bot cvm gce e2e test this

@Random-Liu Random-Liu added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2017
@Random-Liu
Copy link
Member Author

Apply LGTM based on #42927 (comment)

@ncdc
Copy link
Member

ncdc commented Mar 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet