Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440

vladimirvivien · 2018-11-27T03:45:09Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes a race condition that can cause CSI annotations added to Node API object to suddenly disappear after a driver-registrar pod has been deleted and recreated by replica controller (see #71424 for detail).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71424

Does this PR introduce a user-facing change?: NONE

NONE

vladimirvivien · 2018-11-27T03:45:42Z

/milestone v1.13
/priority critical-urgent

k8s-ci-robot · 2018-11-27T03:45:43Z

@vladimirvivien: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.13
/priority critical-urgent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vladimirvivien · 2018-11-27T04:06:28Z

This code changes fixes the race condition that can surface due to concurrent nfsnotify CREATE and DELETE events occurring concurrently. This can cause CSI driver operations that rely on CREATE and DELETE to be triggered out of sync.

Investigation

The fix was tested primarily using the kubernetes/hack/local-up-cluster.sh script. With each script run the following steps were done:

Deploy:

kubectl create serviceaccount csi-node-sa
kubectl create serviceaccount csi-attacher
kubectl apply -f test/e2e/testing-manifests/storage-csi/hostpath/hostpath/

Clean:

sudo mount | grep kubelet
sudo umount /var/lib/kubelet/<path>
sudo rm -rf /var/lib/kubelet
rm /tmp/kube*

Prior to fix
Prior to the fix, I ran the code 10 times using the local-up-cluster script. The results were consistent with the issue identified in #71424 where the CSI node annotations would disappear, an not updated, after a pod delete (and pod redeploy):

 1. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 2. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 3. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 4. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present, shutdown, <clean>
 5. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>
 6. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 7. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>
 8. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>
 9. local-up-cluster, <deploy>, annotation present (after 10+ secs), delete all pods, annotation present, shutdown, <clean>
10. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>

Fix

The fix was to force the handling of the fsnotify CREATE and DELET events serially. After the code was fixed, the CSI node annotation consistently appeared back on the node after a lost (and subsequent redeployment) of a driver-registrar sidecar:

 1. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 2. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 3. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 4. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 5. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 6. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 7. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 8. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 9. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
10. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x)

vladimirvivien · 2018-11-27T04:06:57Z

/sig storage

msau42 · 2018-11-27T04:17:46Z

Iiuc, this means that the handler for each module (ie CSI or device plugin) will run the handlers serially. We should double check that the CSI handlers won't block for too long and that this is an acceptable limitation for device plugin

AishSundar · 2018-11-27T04:43:23Z

/test pull-kubernetes-integration

vladimirvivien · 2018-11-27T04:50:21Z

@msau42 Yes that is a good point since both module share the same process queue. This is a safe solution for CSI handler. During Handle.RegisterPlugin, CSI's longest execution path are calls to the API server and to the backing storage driver itself both of which have timeouts.

Maybe a future solution, in the near future, is to give each module types (CSI, device plugin) its own process queue.

AishSundar · 2018-11-27T05:25:45Z

/milestone v1.13

AishSundar · 2018-11-27T05:26:00Z

/test pull-kubernetes-e2e-kops-aws

saad-ali · 2018-11-27T05:53:51Z

Seems like a reasonable fix for now. Thanks @vladimirvivien.

/assign @vishh
/assign @liggitt

krmayankk · 2018-11-27T06:46:00Z

Is there a way to add tests for these kinds of scenarios in e2e ?

bertinatto · 2018-11-27T09:54:16Z

I tested this with the EBS plugin and it solved the problem that I had. Thanks, @vladimirvivien.

vladimirvivien · 2018-11-27T14:52:59Z

@krmayankk this was a race issue (not functionality) that would have been a bit hard to detect.

msau42 · 2018-11-27T15:04:40Z

#70439 adds an e2e test for this

msau42 · 2018-11-27T15:11:02Z

Oops #70578

liggitt · 2018-11-27T15:52:46Z

pkg/kubelet/util/pluginwatcher/plugin_watcher.go

@@ -111,7 +111,7 @@ func (w *Watcher) Start() error {
 				//TODO: Handle errors by taking corrective measures

 				w.wg.Add(1)
-				go func() {
+				func() {


the goroutine inside traversePluginDir also makes order of events non-deterministic, especially if changes are occurring at the same time the initial scan is being done.

does the handleCreateEvent function verify the created path still exists?

The goroutinein traversePluginDir is used to place items on a non-buffered channel, ws.fsWatcher.Events. If must be present to avoid deadlocks.

handleCreateEvent does not explicitly re-check existence of the dir right before the Driver is delegated to handle the registration:

kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go

Line 250 in e86bdc7

if !fi.IsDir() {

With the changes in this PR, the fsnotify CREATE/DELETE operations should not occur out of sync. If a dir existed right before Registration, it should not go away until a DELETE event comes right after it.

the fsnotify CREATE/DELETE operations should not occur out of sync

That is the expectation from the underlying library. Just to be safe, it will be worth catching any potential issue around delete after create operations and logging it.

With the changes in this PR, the fsnotify CREATE/DELETE operations should not occur out of sync. If a dir existed right before Registration, it should not go away until a DELETE event comes right after it.

nothing guarantees that, correct?

traversePluginDir is called

traversePluginDir adds a filesystem watch to a particular directory

kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go

Lines 205 to 212 in fad2399

case mode.IsDir():

if w.containsBlacklistedDir(path) {

return filepath.SkipDir

}

if err := w.fsWatcher.Add(path); err != nil {

return fmt.Errorf("failed to watch %s, err: %v", path, err)

}

traversePluginDir descends into the directory and adds synthetic Create events for the files found in the dir via a goroutine

kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go

Lines 215 to 221 in fad2399

go func() {

defer w.wg.Done()

w.fsWatcher.Events <- fsnotify.Event{

Name: path,

Op: fsnotify.Create,

}

}()

because there is an active watcher registered in step 2 that can immediately start delivering events, and the synthetic create events in step 3 are delivered via a goroutine, they can interleave with actual observed filesystem events in non-deterministic ways. For example, if a driver is being deleted while this runs:

filepath.Walk lists dir, sees driver socket file

driver socket file is deleted

filesystem delete event is observed and queued

filepath.Walk queues synthetic create event

because the delete is handled, then the synthetic create event, could we end up with a registered driver that doesn't actually exist any more?

To guarantee consistency, shouldn't you be enqueuing existing sockets first prior to accepting fsnotify events from the kernel? Imagine a situation where a socket was identified by path traversal, but before traversePluginDir can enqueue a create event, the socket get's deleted and that event gets processed first before the creation event?

To guarantee consistency, shouldn't you be enqueuing existing sockets first prior to accepting fsnotify events from the kernel?

that's what I expected as well. registering watches on the dirs, processing the contents and enqueuing synthetic create events, then starting processing of the events from the registered watchers

Imagine a situation where a socket was identified by path traversal, but before traversePluginDir can enqueue a create event, the socket get's deleted and that event gets processed first before the creation event?

yes, that's exactly the scenario described above

@liggitt @vishh I think the code already has HappensBefore and HappensAfter serial properties that you are alluding to. The code seems to have 1to1 parity between observed filesystem event and synthetic queued events. To explain, let's further unpack the scenario that Jordan presented earlier:

filepath.Walk lists dir, sees driver socket file

So let's look at some scenarios

filepath.Walk hits dir, adds watcher for it, continue

filesystem creates socket file (from driver)
a. But, socket file is immediately deleted from filesystem
b. According to fswatcher, if the file is removed before it is observed, the Walk will generate an error

filepath.Walk receives error because watcher is missing, returns

Scenario 2

filepath.Walk hits dir, adds watcher for it, continue

filesystem creates socket file (from driver)

filepath.Walk receives socket file info (prior to deletion)
a. queues synthetic create event

Socket file is from filesystem

filepath.Walk receives deleted file info (after deletion)
a. enqueues the observed delete event

Because there is a sequentiality between the creation and immediate deletion of the socket files, the observed events will have before/after relationships. Therefore, the synthetic events are generated and placed on the internal event queue (fsWatcher.Events) should also inherit that sequentiality.

the scenario described in #71440 (comment) is still racy

The synthetic create events traversePluginDir sends to the channel (for socket files encountered by filepath.Walk) are independent of (and can race with) create/delete events sent to the channel by the registered filesystem watchers.

That said, if a synthetic create event was processed after an actual observed delete event, handleCreateEvent does verify the created path still exists:

kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go

Lines 240 to 243 in fad2399

fi, err := os.Stat(event.Name)

if err != nil {

return fmt.Errorf("stat file %s failed: %v", event.Name, err)

}

I still think the raciness should be fixed in a follow up because it makes the event flow hard to understand and relies on compensation in the event handler, but in the context of this PR, it is not unsafe.

vishh · 2018-11-27T18:41:35Z

Handling serially should be fine for current and future plugin types. I don't expect plugins to have a lot of churn or volume (# of plugins).

liggitt · 2018-11-27T18:51:48Z

Handling serially should be fine for current and future plugin types. I don't expect plugins to have a lot of churn or volume (# of plugins).

agree, I'm not concerned by making this bit serial. I am concerned that there is still a race between the synthetic create events from traversePluginDir interleaved with immediately-observed filesystem events from the registered handlers

liggitt · 2018-11-27T23:29:03Z

/lgtm

saad-ali · 2018-11-27T23:30:42Z

Required known issue in release notes: If kubelet plugin registration for a driver fails, kubelet will not retry. The driver must delete and recreate the driver registration socket in order to force kubelet to attempt registration again. Restarting only the driver container may not be sufficient to trigger recreation of the socket, instead a pod restart may be required.

vishh · 2018-11-27T23:51:21Z

@saad-ali you wrote:

instead a pod restart may be required.

What do you mean by pod restart? Shouldn't it be deleting the pod instead?

vishh · 2018-11-27T23:51:30Z

/approve

k8s-ci-robot · 2018-11-27T23:51:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vishh, vladimirvivien

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [vishh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AishSundar · 2018-11-27T23:54:46Z

@marpaia @dstrebel for adding the Known issue to Release notes

I still think the raciness should be fixed in a follow up because it makes the event flow hard to understand and relies on compensation in the event handler, but in the context of this PR, it is not unsafe.

@vishh @liggitt did you want the pending raciness during initialization fixed in a followup PR for 1.13 or are we ok mentioning this as a known issue and proceed to address in 1.14? Speaking to @saad-ali looks like this is edge case that was caught in code review vs actual repro during manual testing. This probably reduces the chances of a real user hitting this, but would like to know your final evaluation of its severity.

liggitt · 2018-11-27T23:58:59Z

Known issue and post-1.13.0 follow up is fine

liggitt · 2018-11-28T00:02:02Z

The known issue is not actually a raciness issue, will coordinate on the known issue text

vishh · 2018-11-28T00:30:42Z

As another AI, I feel we need to have some unit testing in place for the plugin watcher component to simulate races. It should be possible to create and delete temporary files locally to simulate real world scenarios.

…

On Tue, Nov 27, 2018 at 4:23 PM k8s-ci-robot ***@***.***> wrote: @vladimirvivien <https://github.com/vladimirvivien>: The following test *failed*, say /retest to rerun them all: Test name Commit Details Rerun command pull-kubernetes-integration e86bdc7 <e86bdc7> link <https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/71440/pull-kubernetes-integration/36799/> /test pull-kubernetes-integration Full PR test history <https://gubernator.k8s.io/pr/71440>. Your PR dashboard <https://gubernator.k8s.io/pr/vladimirvivien>. Please help us cut down on flakes by linking to <https://git.k8s.io/community/contributors/devel/flaky-tests.md#filing-issues-for-flaky-tests> an open issue <https://github.com/kubernetes/kubernetes/issues?q=is:issue+is:open> when you hit one in your PR. Instructions for interacting with me using PR comments are available here <https://git.k8s.io/community/contributors/guide/pull-requests.md>. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> repository. I understand the commands that are listed here <https://go.k8s.io/bot-commands>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#71440 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKFgk3fj5ZVecEU4WiOq2y-ld_ZRRks5uzddxgaJpZM4Y0iwA> .

AishSundar · 2018-11-28T00:32:49Z

/test pull-kubernetes-integration

Forces fsnotify CREATE and REMOVE to occur serially

e86bdc7

k8s-ci-robot requested review from dims and resouer November 27, 2018 03:45

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 27, 2018

vladimirvivien mentioned this pull request Nov 27, 2018

CSI: kubelet removes NodeID annotation when a driver is restarted #71424

Closed

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Nov 27, 2018

vladimirvivien changed the title ~~Fixes race condition caused by concurrent fsnotify (CREATE and REMOVE)~~ Fixe for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go Nov 27, 2018

k8s-ci-robot added this to the v1.13 milestone Nov 27, 2018

k8s-ci-robot assigned liggitt and vishh Nov 27, 2018

vladimirvivien changed the title ~~Fixe for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go~~ Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go Nov 27, 2018

jsafrane mentioned this pull request Nov 27, 2018

CSI: Add test for passing Pod information in NodePublish call #70439

Merged

liggitt reviewed Nov 27, 2018

View reviewed changes

msau42 mentioned this pull request Nov 27, 2018

CSI: enhance driver tests to update the drivers. #70578

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 27, 2018

AishSundar mentioned this pull request Nov 27, 2018

1.13 Release Notes: "Known Issues" #70955

Closed

saad-ali mentioned this pull request Nov 28, 2018

Kubelet plugin registration should retry on failures #71487

Closed

k8s-ci-robot merged commit 7b24399 into kubernetes:master Nov 28, 2018

dcbw mentioned this pull request Feb 13, 2019

images/node: remove kernel modules and other unnecessary packages openshift/origin#22021

Closed

sjenning mentioned this pull request Feb 15, 2019

UPSTREAM: 71440: Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go openshift/origin#22055

Closed

bertinatto mentioned this pull request Mar 7, 2019

Race condition in plugin registration #75097

Closed

	case mode.IsDir():
	if w.containsBlacklistedDir(path) {
	return filepath.SkipDir
	}

	if err := w.fsWatcher.Add(path); err != nil {
	return fmt.Errorf("failed to watch %s, err: %v", path, err)
	}

	go func() {
	defer w.wg.Done()
	w.fsWatcher.Events <- fsnotify.Event{
	Name: path,
	Op: fsnotify.Create,
	}
	}()

	fi, err := os.Stat(event.Name)
	if err != nil {
	return fmt.Errorf("stat file %s failed: %v", event.Name, err)
	}

Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440

Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440

Conversation

vladimirvivien commented Nov 27, 2018

vladimirvivien commented Nov 27, 2018

k8s-ci-robot commented Nov 27, 2018

vladimirvivien commented Nov 27, 2018 • edited Loading

Investigation

Fix

vladimirvivien commented Nov 27, 2018

msau42 commented Nov 27, 2018

AishSundar commented Nov 27, 2018

vladimirvivien commented Nov 27, 2018

AishSundar commented Nov 27, 2018

AishSundar commented Nov 27, 2018

saad-ali commented Nov 27, 2018

krmayankk commented Nov 27, 2018

bertinatto commented Nov 27, 2018 • edited Loading

vladimirvivien commented Nov 27, 2018

msau42 commented Nov 27, 2018

msau42 commented Nov 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Nov 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishh commented Nov 27, 2018

liggitt commented Nov 27, 2018

liggitt commented Nov 27, 2018

saad-ali commented Nov 27, 2018 • edited Loading

vishh commented Nov 27, 2018

vishh commented Nov 27, 2018

k8s-ci-robot commented Nov 27, 2018

AishSundar commented Nov 27, 2018

liggitt commented Nov 27, 2018

liggitt commented Nov 28, 2018

vishh commented Nov 28, 2018 via email

AishSundar commented Nov 28, 2018

vladimirvivien commented Nov 27, 2018 •

edited

Loading

bertinatto commented Nov 27, 2018 •

edited

Loading

liggitt Nov 27, 2018 •

edited

Loading

saad-ali commented Nov 27, 2018 •

edited

Loading