Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a data race in kubelet image manager #88915

Merged

Conversation

@roycaihw
Copy link
Member

roycaihw commented Mar 6, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it:
Let image manager return a copy of image list. Today it sorts and returns the internal image list cache to caller directly, which introduces a data race. Due to the sort being in-place, we've observed the race causing static pod workers to panic during syncPod. However, given the way kubelet recovers from goroutine panics, kubelet would keep running and the pod worker silently stopped working-- all future work dispatched for this pod got ignored.

git log shows the race was introduced in 1.10. We should cherrypick this fix to all the supported minor releases.

Sample panic log from 1.13:

runtime.go:73] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 755 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x..., 0x...)
	...k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:69 +0x...
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x..., 0x..., 0x...)
	...k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0x...
panic(0x..., 0x...)
	/usr/local/go/src/runtime/panic.go:522 +0x...
k8s.io/kubernetes/pkg/kubelet/nodestatus.Images.func1(0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/nodestatus/setters.go:413 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).setNodeStatus(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go:480 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).initialNode(0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_node_status.go:348 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).GetNode(0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet_getters.go:220 +0x...
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncPod(0x..., 0x..., 0x..., 0x..., 0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/kubelet.go:1666 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop.func1(0x..., 0x..., 0x..., 0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:174 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).managePodLoop(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:183 +0x...
k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod.func1(0x..., 0x...)
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:221 +0x...
created by k8s.io/kubernetes/pkg/kubelet.(*podWorkers).UpdatePod
	...k8s.io/kubernetes/pkg/kubelet/pod_workers.go:219 +0x...

Which issue(s) this PR fixes:

Fixes #60443

Special notes for your reviewer:
I'm going to add a proposal to SIG node about catching/mitigating this kind of programming error in future. In this case the issue would have been mitigated if we either 1) actually crashed kubelet and let it restart or 2) created a new goroutine when the old one panicked. Restarting kubelet is preferred IMO because it guards a wide range of programming error, and it's usually safe to restart kubelet as long as it's not too frequent.

Does this PR introduce a user-facing change?:

Fixed a data race in kubelet image manager that can cause static pod workers to silently stop working.

/cc @Random-Liu @cheftako
/priority critical-urgent
/sig node

@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 6, 2020

I'm not sure what additional approval we need given it's code freeze. I will prepare the cherrypicks once it's merged in master

@@ -548,14 +550,19 @@ func TestValidateImageGCPolicy(t *testing.T) {
}
}

//

This comment has been minimized.

Copy link
@Random-Liu

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 6, 2020

Author Member

ah, I was going to add a test description but decided to not at some point. I updated the test comments and the variable name to expect to make things more clear.

// Make a copy of testList because cache.set takes ownership of the list
testListCopy := append([]container.Image{}, testList...)
// Expect cache.get to sort the image list
sort.Sort(sliceutils.ByImageSize(testListCopy))
cache.set(testList)

This comment has been minimized.

Copy link
@Random-Liu

Random-Liu Mar 6, 2020

Member

We should also test that testList is not changed after get.

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 6, 2020

Author Member

It is changed. That's why the test didn't catch the bug before. Please see the comment I added for imageCache.set()

It brings another question: should imageCache.set() make a copy of the given list? E.g.

func (i *imageCache) set(images []container.Image) {
	i.Lock()
	defer i.Unlock()
	imagesCopy := append([]container.Image{}, images...)
	i.images = imagesCopy
}

this way the caller of imageCache.set() can safely reference images afterwards. I think it's a reasonable change but not as critical. Currently this goroutine is the only caller of imageCache.set() and it doesn't reference images afterwards, so the change doesn't matter. Since we are going to cherrypick this fix, I prefer keeping the minimum change needed. That's why I didn't change imageCache.set() behavior here.

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 10, 2020

Author Member

I moved the sort logic outside of the image cache and tested that testList is not changed after get

@Random-Liu Random-Liu modified the milestones: v1.19, v1.18 Mar 6, 2020
@Random-Liu

This comment has been minimized.

Copy link
Member

Random-Liu commented Mar 6, 2020

This is a critical bug fix! Mark 1.18.

@roycaihw roycaihw force-pushed the roycaihw:fix/image-manager-data-race branch from d03a564 to 64cda98 Mar 6, 2020
@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 6, 2020

/test pull-kubernetes-e2e-kind

@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 6, 2020

/test pull-kubernetes-integration

sort.Sort(sliceutils.ByImageSize(i.images))
return i.images
// Make a copy of image list to avoid data race
images := append([]container.Image{}, i.images...)

This comment has been minimized.

Copy link
@liggitt

liggitt Mar 7, 2020

Member

How frequently is this read? Should we sort on set and return the list without sorting/allocating?

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 9, 2020

Author Member

That's a good point! It's invoked in different places so it's hard to tell from source. I searched this and this log messages in a master node and found the read is invoked every few (<5) seconds (see the log below). Compared to the write getting invoked every 30s, I agree it makes sense to sort on set.

Regarding whether we should skip allocating on return, I'm not sure if it's safe to do. Currently we truncate the length of the returned list, which has no effect on the original list https://play.golang.org/p/fc6JmArJtfW. But I still feel it's safer if we copy on return, to prevent the original list getting mutated in future.

I0309 21:40:27.927125    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:35.971794    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:39.926950    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:43.582650    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:44.926879    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:45.926867    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:46.104704    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:52.927557    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:53.927034    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:56.239125    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:40:56.926903    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:06.404693    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:08.926829    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:11.926992    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:14.926864    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:16.541316    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:19.929088    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:20.926684    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:23.926947    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:23.927801    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:26.720531    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:28.926814    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:32.926843    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:32.927004    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:34.926880    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:36.884823    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:37.926909    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:38.926790    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:39.927096    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:43.582911    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:43.927033    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:47.013040    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:54.926907    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:55.927053    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:56.927574    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:41:57.161981    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:02.927006    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:04.926781    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:05.927033    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:07.335855    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:13.927023    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:17.502770    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:19.927029    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:21.926896    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:27.638850    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
I0309 21:42:32.926875    1438 kubelet_node_status.go:279] Setting node annotation to enable volume controller attach/detach 
@roycaihw roycaihw force-pushed the roycaihw:fix/image-manager-data-race branch from 07d924e to d5f1d52 Mar 10, 2020
// The image list needs to be sorted when it gets read and used in
// setNodeStatusImages. We sort the list on write instead of on read,
// because the image cache is more often read than written
sort.Sort(sliceutils.ByImageSize(images))

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 10, 2020

Author Member

I was almost going to add another copy before this line... Perhaps it's better to put the copy and the sort inside set(), to make the code more contained? @liggitt WDYT?

This comment has been minimized.

Copy link
@liggitt

liggitt Mar 10, 2020

Member

moving copy and sort into set makes sense to me

@roycaihw roycaihw force-pushed the roycaihw:fix/image-manager-data-race branch from d5f1d52 to b4cdcd4 Mar 10, 2020
@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 10, 2020

@liggitt @Random-Liu Thanks for the review! I've updated the change. PTAL

Copy link
Member

Random-Liu left a comment

/lgtm
/approve

func (i *imageCache) set(images []container.Image) {
i.Lock()
defer i.Unlock()
// make a copy to avoid potential data race

This comment has been minimized.

Copy link
@Random-Liu

Random-Liu Mar 10, 2020

Member

I think we can't avoid the data race, because the image list is passed in, and the caller may concurrently do some operation on it.

However, I do agree that copying it here is better. :)

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 10, 2020

Author Member

Ah you're right. I updated the comment to document that the caller should not reference the list again (similar to the comments for client-go cache and fifo queue)

 // set sorts the input list and updates image cache.
+// 'i' takes ownership of the list, you should not reference the list again
+// after calling this function.
 func (i *imageCache) set(images []container.Image) {
        i.Lock()
        defer i.Unlock()
-       // make a copy to avoid potential data race
+       // Make a copy of the input list. This doesn't completely avoid
+       // potential data race. The caller should not reference the list again
+       // after calling this function.
@roycaihw roycaihw force-pushed the roycaihw:fix/image-manager-data-race branch from b4cdcd4 to 9117a9f Mar 10, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 10, 2020
// Make a copy of the input list. This doesn't completely avoid
// potential data race. The caller should not reference the list again
// after calling this function.
Comment on lines 122 to 124

This comment has been minimized.

Copy link
@liggitt

liggitt Mar 10, 2020

Member

I think this comment is unnecessarily alarming... the set() method is private, and the only place we call it does not hold onto the list and change it further. The godoc on the method clarifying ownership is probably sufficient.

This comment has been minimized.

Copy link
@roycaihw

roycaihw Mar 10, 2020

Author Member

thanks. Removed the comment

@roycaihw roycaihw force-pushed the roycaihw:fix/image-manager-data-race branch 2 times, most recently from 235708e to 913122e Mar 10, 2020
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 10, 2020

Traced all setters/getters of the function, and none are modifying the slice. There is one place that is modifying slice members of the image items in the list:

names := append(image.RepoDigests, image.RepoTags...)

That should change to accumulate names in a new []string var

race and improve efficienty
@roycaihw roycaihw force-pushed the roycaihw:fix/image-manager-data-race branch from 913122e to 462b753 Mar 10, 2020
@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Mar 10, 2020
@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 10, 2020

That should change to accumulate names in a new []string var

Fixed that line.

For the record, I talked with @liggitt offline: making copies for the image list could be expensive given that it's done every few seconds, and I'm not sure how big these lists are (seems that it can be greater than 50). Since these functions are internal, for efficiency we decided to not make copies, and document these functions to make it clear that the callers should not reference/modify the input/output to introduce race. I also removed the unit test because the get() isn't providing "copy" guarantee now (and the test didn't ensure that)-- we depend on the callers to behave well and avoid race.

@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 11, 2020

/retest

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 11, 2020

/lgtm
/approve
/retest

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 11, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, Random-Liu, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 11, 2020

see also #88779 for issues with the kubelet not exiting on panic

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Mar 11, 2020

/retest

1 similar comment
@roycaihw

This comment has been minimized.

Copy link
Member Author

roycaihw commented Mar 11, 2020

/retest

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Mar 11, 2020

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 562a420 into kubernetes:master Mar 11, 2020
16 of 17 checks passed
16 of 17 checks passed
tide Not mergeable. Retesting: pull-kubernetes-verify
Details
cla/linuxfoundation roycaihw authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-kind-ga-only-parallel Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kind Job succeeded.
Details
pull-kubernetes-e2e-kind-ipv6 Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@roycaihw roycaihw deleted the roycaihw:fix/image-manager-data-race branch Mar 11, 2020
k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020
…15-upstream-release-1.15

Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020
…15-upstream-release-1.16

Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
k8s-ci-robot added a commit that referenced this pull request Mar 17, 2020
…15-upstream-release-1.17

Automated cherry pick of #88915: let image cache do sort on write instead of on read to avoid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.