-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440
Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440
Conversation
/milestone v1.13 |
@vladimirvivien: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This code changes fixes the race condition that can surface due to concurrent nfsnotify CREATE and DELETE events occurring concurrently. This can cause CSI driver operations that rely on CREATE and DELETE to be triggered out of sync. InvestigationThe fix was tested primarily using the kubernetes/hack/local-up-cluster.sh script. With each script run the following steps were done: Deploy: kubectl create serviceaccount csi-node-sa
kubectl create serviceaccount csi-attacher
kubectl apply -f test/e2e/testing-manifests/storage-csi/hostpath/hostpath/ Clean: sudo mount | grep kubelet
sudo umount /var/lib/kubelet/<path>
sudo rm -rf /var/lib/kubelet
rm /tmp/kube* Prior to fix
FixThe fix was to force the handling of the fsnotify CREATE and DELET events serially. After the code was fixed, the CSI node annotation consistently appeared back on the node after a lost (and subsequent redeployment) of a driver-registrar sidecar:
|
/sig storage |
Iiuc, this means that the handler for each module (ie CSI or device plugin) will run the handlers serially. We should double check that the CSI handlers won't block for too long and that this is an acceptable limitation for device plugin |
/test pull-kubernetes-integration |
@msau42 Yes that is a good point since both module share the same process queue. This is a safe solution for CSI handler. During Handle.RegisterPlugin, CSI's longest execution path are calls to the API server and to the backing storage driver itself both of which have timeouts. Maybe a future solution, in the near future, is to give each module types (CSI, device plugin) its own process queue. |
/milestone v1.13 |
/test pull-kubernetes-e2e-kops-aws |
Seems like a reasonable fix for now. Thanks @vladimirvivien. |
Is there a way to add tests for these kinds of scenarios in e2e ? |
I tested this with the EBS plugin and it solved the problem that I had. Thanks, @vladimirvivien. |
@krmayankk this was a race issue (not functionality) that would have been a bit hard to detect. |
#70439 adds an e2e test for this |
Oops #70578 |
@@ -111,7 +111,7 @@ func (w *Watcher) Start() error { | |||
//TODO: Handle errors by taking corrective measures | |||
|
|||
w.wg.Add(1) | |||
go func() { | |||
func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the goroutine inside traversePluginDir also makes order of events non-deterministic, especially if changes are occurring at the same time the initial scan is being done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the handleCreateEvent function verify the created path still exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goroutine
in traversePluginDir
is used to place items on a non-buffered channel, ws.fsWatcher.Events
. If must be present to avoid deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleCreateEvent
does not explicitly re-check existence of the dir right before the Driver is delegated to handle the registration:
if !fi.IsDir() { |
With the changes in this PR, the fsnotify CREATE/DELETE operations should not occur out of sync. If a dir existed right before Registration, it should not go away until a DELETE event comes right after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fsnotify CREATE/DELETE operations should not occur out of sync
That is the expectation from the underlying library. Just to be safe, it will be worth catching any potential issue around delete after create operations and logging it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes in this PR, the fsnotify CREATE/DELETE operations should not occur out of sync. If a dir existed right before Registration, it should not go away until a DELETE event comes right after it.
nothing guarantees that, correct?
-
traversePluginDir is called
-
traversePluginDir adds a filesystem watch to a particular directory
kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go
Lines 205 to 212 in fad2399
case mode.IsDir(): if w.containsBlacklistedDir(path) { return filepath.SkipDir } if err := w.fsWatcher.Add(path); err != nil { return fmt.Errorf("failed to watch %s, err: %v", path, err) } -
traversePluginDir descends into the directory and adds synthetic Create events for the files found in the dir via a goroutine
kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go
Lines 215 to 221 in fad2399
go func() { defer w.wg.Done() w.fsWatcher.Events <- fsnotify.Event{ Name: path, Op: fsnotify.Create, } }()
because there is an active watcher registered in step 2 that can immediately start delivering events, and the synthetic create events in step 3 are delivered via a goroutine, they can interleave with actual observed filesystem events in non-deterministic ways. For example, if a driver is being deleted while this runs:
- filepath.Walk lists dir, sees driver socket file
- driver socket file is deleted
- filesystem delete event is observed and queued
- filepath.Walk queues synthetic create event
because the delete is handled, then the synthetic create event, could we end up with a registered driver that doesn't actually exist any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee consistency, shouldn't you be enqueuing existing sockets first prior to accepting fsnotify events from the kernel? Imagine a situation where a socket was identified by path traversal, but before traversePluginDir
can enqueue a create event, the socket get's deleted and that event gets processed first before the creation event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee consistency, shouldn't you be enqueuing existing sockets first prior to accepting fsnotify events from the kernel?
that's what I expected as well. registering watches on the dirs, processing the contents and enqueuing synthetic create events, then starting processing of the events from the registered watchers
Imagine a situation where a socket was identified by path traversal, but before traversePluginDir can enqueue a create event, the socket get's deleted and that event gets processed first before the creation event?
yes, that's exactly the scenario described above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt @vishh I think the code already has HappensBefore
and HappensAfter
serial properties that you are alluding to. The code seems to have 1to1 parity between observed filesystem event and synthetic queued events. To explain, let's further unpack the scenario that Jordan presented earlier:
filepath.Walk
lists dir, sees driver socket file
So let's look at some scenarios
- filepath.Walk hits dir, adds watcher for it, continue
- filesystem creates socket file (from driver)
a. But, socket file isimmediately deleted
from filesystem
b. According to fswatcher, if the file is removed before it is observed, the Walk will generate an error - filepath.Walk receives error because watcher is missing, returns
Scenario 2
- filepath.Walk hits dir, adds watcher for it, continue
- filesystem creates socket file (from driver)
- filepath.Walk receives socket file info (prior to deletion)
a. queues synthetic create event - Socket file is from filesystem
- filepath.Walk receives deleted file info (after deletion)
a. enqueues the observed delete event
Because there is a sequentiality between the creation and immediate deletion of the socket files, the observed events will have before/after relationships. Therefore, the synthetic events are generated and placed on the internal event queue (fsWatcher.Events) should also inherit that sequentiality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the scenario described in #71440 (comment) is still racy
The synthetic create events traversePluginDir sends to the channel (for socket files encountered by filepath.Walk) are independent of (and can race with) create/delete events sent to the channel by the registered filesystem watchers.
That said, if a synthetic create event was processed after an actual observed delete event, handleCreateEvent does verify the created path still exists:
kubernetes/pkg/kubelet/util/pluginwatcher/plugin_watcher.go
Lines 240 to 243 in fad2399
fi, err := os.Stat(event.Name) | |
if err != nil { | |
return fmt.Errorf("stat file %s failed: %v", event.Name, err) | |
} |
I still think the raciness should be fixed in a follow up because it makes the event flow hard to understand and relies on compensation in the event handler, but in the context of this PR, it is not unsafe.
Handling serially should be fine for current and future plugin types. I don't expect plugins to have a lot of churn or volume (# of plugins). |
agree, I'm not concerned by making this bit serial. I am concerned that there is still a race between the synthetic create events from traversePluginDir interleaved with immediately-observed filesystem events from the registered handlers |
/lgtm |
Required known issue in release notes: |
@saad-ali you wrote:
What do you mean by pod restart? Shouldn't it be |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vishh, vladimirvivien The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@marpaia @dstrebel for adding the Known issue to Release notes
@vishh @liggitt did you want the pending raciness during initialization fixed in a followup PR for 1.13 or are we ok mentioning this as a known issue and proceed to address in 1.14? Speaking to @saad-ali looks like this is edge case that was caught in code review vs actual repro during manual testing. This probably reduces the chances of a real user hitting this, but would like to know your final evaluation of its severity. |
Known issue and post-1.13.0 follow up is fine |
The known issue is not actually a raciness issue, will coordinate on the known issue text |
As another AI, I feel we need to have some unit testing in place for the
plugin watcher component to simulate races. It should be possible to create
and delete temporary files locally to simulate real world scenarios.
…On Tue, Nov 27, 2018 at 4:23 PM k8s-ci-robot ***@***.***> wrote:
@vladimirvivien <https://github.com/vladimirvivien>: The following test
*failed*, say /retest to rerun them all:
Test name Commit Details Rerun command
pull-kubernetes-integration e86bdc7
<e86bdc7>
link
<https://gubernator.k8s.io/build/kubernetes-jenkins/pr-logs/pull/71440/pull-kubernetes-integration/36799/> /test
pull-kubernetes-integration
Full PR test history <https://gubernator.k8s.io/pr/71440>. Your PR
dashboard <https://gubernator.k8s.io/pr/vladimirvivien>. Please help us
cut down on flakes by linking to
<https://git.k8s.io/community/contributors/devel/flaky-tests.md#filing-issues-for-flaky-tests>
an open issue
<https://github.com/kubernetes/kubernetes/issues?q=is:issue+is:open> when
you hit one in your PR.
Instructions for interacting with me using PR comments are available here
<https://git.k8s.io/community/contributors/guide/pull-requests.md>. If
you have questions or suggestions related to my behavior, please file an
issue against the kubernetes/test-infra
<https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:>
repository. I understand the commands that are listed here
<https://go.k8s.io/bot-commands>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#71440 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGvIKFgk3fj5ZVecEU4WiOq2y-ld_ZRRks5uzddxgaJpZM4Y0iwA>
.
|
/test pull-kubernetes-integration |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a race condition that can cause CSI annotations added to Node API object to suddenly disappear after a driver-registrar pod has been deleted and recreated by replica controller (see #71424 for detail).
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #71424
Does this PR introduce a user-facing change?: NONE