Fix kubelet panic in cgroup manager. #42927

Random-Liu · 2017-03-10T23:34:51Z

Fixes #42920.
Fixes #42875
Fixes #42927
Fixes #43059

Check the error in walk function, so that we don't use info when there is an error.

@yujuhong @dchen1107 @derekwaynecarr @vishh /cc @kubernetes/sig-node-bugs

Random-Liu · 2017-03-10T23:35:43Z

@calebamiles Mark 1.6 because this is a kubelet bug fix.

k8s-reviewable · 2017-03-10T23:35:57Z

This change is

yujuhong · 2017-03-11T00:02:42Z

/lgtm

Random-Liu · 2017-03-11T01:37:48Z

@k8s-bot gce etcd3 e2e test this

dchen1107 · 2017-03-11T01:50:54Z

Thanks for fixing this so quickly.

vishh · 2017-03-11T02:48:25Z

pkg/kubelet/cm/cgroup_manager_linux.go

@@ -449,12 +449,16 @@ func (m *cgroupManagerImpl) Pids(name CgroupName) []int {

 		// WalkFunc which is called for each file and directory in the pod cgroup dir
 		visitor := func(path string, info os.FileInfo, err error) error {
+			if err != nil {
+				glog.V(5).Infof("cgroup manager encountered error scanning cgroup path %q", path)
+				return filepath.SkipDir


I wonder if its safe to skip here...
What prompted you to add this logic?

Because of this #42920

@vishh would you say it is equally safe as skipping 10 lines later if there's an error reading cgroup data?

The problem is that if there is an error, info may be nil. And this will cause kubelet panic.

Do you mean that we should still try getCgroupProcs?

this makes sense to me.

I suspect systemd is the one that could cause errors in this logic. We do not expect transient errors with cgroupfs. So maybe in the future, we could have a list of well known controllers this logic expects to always work and fail loudly if one of those controllers throws access errors.

@Random-Liu Shouldn't we also print out err before the return? Same for the following errs?

Random-Liu · 2017-03-13T17:38:31Z

@k8s-bot gce etcd3 e2e test this

derekwaynecarr · 2017-03-13T18:32:34Z

pkg/kubelet/cm/cgroup_manager_linux.go

@@ -449,12 +449,16 @@ func (m *cgroupManagerImpl) Pids(name CgroupName) []int {

 		// WalkFunc which is called for each file and directory in the pod cgroup dir
 		visitor := func(path string, info os.FileInfo, err error) error {
+			if err != nil {
+				glog.V(5).Infof("cgroup manager encountered error scanning cgroup path %q", path)
+				return filepath.SkipDir


this makes sense to me.

derekwaynecarr · 2017-03-13T18:40:44Z

this also fixes #42927, thanks!

dchen1107 · 2017-03-13T18:52:54Z

@derekwaynecarr I believe you were referring to #42875

Fixed the pr description too.

Random-Liu · 2017-03-13T19:07:29Z

@dchen1107 Updated the PR to print out the error and change log level to V(4), so that we could see the transient error in the test log.

@derekwaynecarr Will this be quite spammy?

dchen1107 · 2017-03-13T19:09:47Z

/lgtm

dchen1107 · 2017-03-13T19:11:45Z

@derekwaynecarr I asked @Random-Liu made above change because the original change did fix the kubelet panic, but I worry about it hides some real issues in the system, and leaves tons of uncleaned pod-cgroup behind.

vishh · 2017-03-13T19:22:02Z

Yeah. It might be useful to generate events on cgroup deletion failures for now.

…

On Mon, Mar 13, 2017 at 12:12 PM, Dawn Chen ***@***.***> wrote: @derekwaynecarr <https://github.com/derekwaynecarr> I asked @Random-Liu <https://github.com/Random-Liu> made above change because the original change did fix the kubelet panic, but I worry about it hides some real issues in the system, and leaves tons of uncleaned pod-cgroup behind. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#42927 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKMtErhg2lLDgGsfOmKNaTOnQ5buRks5rlZUPgaJpZM4MZ_Oa> .

k8s-github-robot · 2017-03-13T19:36:51Z

[APPROVALNOTIFIER] This PR is APPROVED

The following people have approved this PR: Random-Liu, dchen1107, yujuhong

Needs approval from an approver in each of these OWNERS Files:

~~pkg/kubelet/OWNERS~~ [Random-Liu,dchen1107,yujuhong]

We suggest the following people:
cc @vishh
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

Random-Liu · 2017-03-13T23:34:11Z

@k8s-bot cvm gce e2e test this

Random-Liu · 2017-03-14T03:50:49Z

Apply LGTM based on #42927 (comment)

ncdc · 2017-03-14T13:46:26Z

I'd be willing to bet this fixes most or all of the hits here: https://storage.googleapis.com/k8s-gubernator/triage/index.html?text=the%20server%20could%20not%20find%20the%20requested%20resource%20%5C(get#04092e6e21f91a7731ee

Random-Liu added area/kubelet release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 10, 2017

Random-Liu added this to the v1.6 milestone Mar 10, 2017

Random-Liu assigned yujuhong Mar 10, 2017

Random-Liu requested a review from yujuhong March 10, 2017 23:34

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2017

k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 10, 2017

Random-Liu mentioned this pull request Mar 10, 2017

[k8s.io] Projected should be consumable from pods in volume [Conformance] [Volume] {E2eNode Suite} #42920

Closed

ethernetdan added the kind/bug Categorizes issue or PR as related to a bug. label Mar 10, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2017

vishh reviewed Mar 11, 2017

View reviewed changes

derekwaynecarr approved these changes Mar 13, 2017

View reviewed changes

derekwaynecarr mentioned this pull request Mar 13, 2017

[k8s.io] Downward API should provide default limits.cpu/memory from node allocatable [Conformance] {E2eNode Suite} #42875

Closed

dchen1107 removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 13, 2017

Fix kubelet panic in cgroup manager.

e6341cc

Random-Liu force-pushed the fix-kubelet-panic branch from b0f3500 to e6341cc Compare March 13, 2017 19:06

dchen1107 self-assigned this Mar 13, 2017

Random-Liu added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2017

This was referenced Mar 14, 2017

[k8s.io] ConfigMap should be consumable in multiple volumes in the same pod [Conformance] [Volume] {E2eNode Suite} #43054

Closed

ci-kubernetes-e2e-gci-gce-etcd3: broken test run #43052

Closed

This was referenced Mar 14, 2017

Fix possible panic when getting pids for cgroup #43075

Closed

kubelet can panic when trying to determine the pids for a cgroup #43074

Closed

k8s-github-robot merged commit f1e9004 into kubernetes:master Mar 14, 2017

Random-Liu deleted the fix-kubelet-panic branch March 14, 2017 18:22

yujuhong mentioned this pull request Mar 15, 2017

[k8s.io] MetricsGrabber should grab all metrics from a Kubelet. {Kubernetes e2e suite} #37543

Closed

This was referenced Jul 5, 2017

Extended.[k8s.io] ConfigMap should be consumable from pods in volume [Conformance] [Volume] openshift/origin#14876

Closed

Downward API volume should provide podname as non-root with fsgroup [Feature:FSGroup] [Volume] openshift/origin#15003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix kubelet panic in cgroup manager. #42927

Fix kubelet panic in cgroup manager. #42927

Random-Liu commented Mar 10, 2017 •

edited by ncdc

Random-Liu commented Mar 10, 2017

k8s-reviewable commented Mar 10, 2017

yujuhong commented Mar 11, 2017

Random-Liu commented Mar 11, 2017

dchen1107 commented Mar 11, 2017

vishh Mar 11, 2017

Random-Liu Mar 11, 2017

liggitt Mar 11, 2017

Random-Liu Mar 13, 2017 •

edited

derekwaynecarr Mar 13, 2017

vishh Mar 13, 2017

dchen1107 Mar 13, 2017

Random-Liu commented Mar 13, 2017

derekwaynecarr Mar 13, 2017

derekwaynecarr commented Mar 13, 2017

dchen1107 commented Mar 13, 2017

Random-Liu commented Mar 13, 2017

dchen1107 commented Mar 13, 2017

dchen1107 commented Mar 13, 2017

vishh commented Mar 13, 2017 via email

k8s-github-robot commented Mar 13, 2017

Random-Liu commented Mar 13, 2017

Random-Liu commented Mar 14, 2017

ncdc commented Mar 14, 2017

Fix kubelet panic in cgroup manager. #42927

Fix kubelet panic in cgroup manager. #42927

Conversation

Random-Liu commented Mar 10, 2017 • edited by ncdc

Random-Liu commented Mar 10, 2017

k8s-reviewable commented Mar 10, 2017

yujuhong commented Mar 11, 2017

Random-Liu commented Mar 11, 2017

dchen1107 commented Mar 11, 2017

vishh Mar 11, 2017

Choose a reason for hiding this comment

Random-Liu Mar 11, 2017

Choose a reason for hiding this comment

liggitt Mar 11, 2017

Choose a reason for hiding this comment

Random-Liu Mar 13, 2017 • edited

Choose a reason for hiding this comment

derekwaynecarr Mar 13, 2017

Choose a reason for hiding this comment

vishh Mar 13, 2017

Choose a reason for hiding this comment

dchen1107 Mar 13, 2017

Choose a reason for hiding this comment

Random-Liu commented Mar 13, 2017

derekwaynecarr Mar 13, 2017

Choose a reason for hiding this comment

derekwaynecarr commented Mar 13, 2017

dchen1107 commented Mar 13, 2017

Random-Liu commented Mar 13, 2017

dchen1107 commented Mar 13, 2017

dchen1107 commented Mar 13, 2017

vishh commented Mar 13, 2017 via email

k8s-github-robot commented Mar 13, 2017

Random-Liu commented Mar 13, 2017

Random-Liu commented Mar 14, 2017

ncdc commented Mar 14, 2017

Random-Liu commented Mar 10, 2017 •

edited by ncdc

Random-Liu Mar 13, 2017 •

edited