Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440

Merged
merged 1 commit into from
Nov 28, 2018
Merged

Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go #71440

merged 1 commit into from
Nov 28, 2018

Conversation

vladimirvivien
Copy link
Member

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes a race condition that can cause CSI annotations added to Node API object to suddenly disappear after a driver-registrar pod has been deleted and recreated by replica controller (see #71424 for detail).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #71424

Does this PR introduce a user-facing change?: NONE

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 27, 2018
@vladimirvivien
Copy link
Member Author

/milestone v1.13
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 27, 2018
@k8s-ci-robot
Copy link
Contributor

@vladimirvivien: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.13
/priority critical-urgent

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 27, 2018
@vladimirvivien
Copy link
Member Author

vladimirvivien commented Nov 27, 2018

This code changes fixes the race condition that can surface due to concurrent nfsnotify CREATE and DELETE events occurring concurrently. This can cause CSI driver operations that rely on CREATE and DELETE to be triggered out of sync.

Investigation

The fix was tested primarily using the kubernetes/hack/local-up-cluster.sh script. With each script run the following steps were done:

Deploy:

kubectl create serviceaccount csi-node-sa
kubectl create serviceaccount csi-attacher
kubectl apply -f test/e2e/testing-manifests/storage-csi/hostpath/hostpath/

Clean:

sudo mount | grep kubelet
sudo umount /var/lib/kubelet/<path>
sudo rm -rf /var/lib/kubelet
rm /tmp/kube*

Prior to fix
Prior to the fix, I ran the code 10 times using the local-up-cluster script. The results were consistent with the issue identified in #71424 where the CSI node annotations would disappear, an not updated, after a pod delete (and pod redeploy):

 1. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 2. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 3. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 4. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present, shutdown, <clean>
 5. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>
 6. local-up-cluster, <deploy>, annotation present, delete all pods, annotation missing, shutdown, <clean>
 7. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>
 8. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>
 9. local-up-cluster, <deploy>, annotation present (after 10+ secs), delete all pods, annotation present, shutdown, <clean>
10. local-up-cluster, <deploy>, annotation present, delete all pods, annotation present then missing, shutdown, <clean>

Fix

The fix was to force the handling of the fsnotify CREATE and DELET events serially. After the code was fixed, the CSI node annotation consistently appeared back on the node after a lost (and subsequent redeployment) of a driver-registrar sidecar:

 1. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 2. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 3. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 4. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 5. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 6. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 7. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 8. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
 9. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x), shutdown, <clean>
10. local-up-cluster, <deploy>, annotation present (10x), delete all pods, annotation present (10x)

@vladimirvivien
Copy link
Member Author

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Nov 27, 2018
@vladimirvivien vladimirvivien changed the title Fixes race condition caused by concurrent fsnotify (CREATE and REMOVE) Fixe for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go Nov 27, 2018
@msau42
Copy link
Member

msau42 commented Nov 27, 2018

Iiuc, this means that the handler for each module (ie CSI or device plugin) will run the handlers serially. We should double check that the CSI handlers won't block for too long and that this is an acceptable limitation for device plugin

@AishSundar
Copy link
Contributor

/test pull-kubernetes-integration

@vladimirvivien
Copy link
Member Author

@msau42 Yes that is a good point since both module share the same process queue. This is a safe solution for CSI handler. During Handle.RegisterPlugin, CSI's longest execution path are calls to the API server and to the backing storage driver itself both of which have timeouts.

Maybe a future solution, in the near future, is to give each module types (CSI, device plugin) its own process queue.

@AishSundar
Copy link
Contributor

/milestone v1.13

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 27, 2018
@AishSundar
Copy link
Contributor

/test pull-kubernetes-e2e-kops-aws

@saad-ali
Copy link
Member

Seems like a reasonable fix for now. Thanks @vladimirvivien.

/assign @vishh
/assign @liggitt

@krmayankk
Copy link

Is there a way to add tests for these kinds of scenarios in e2e ?

@bertinatto
Copy link
Member

bertinatto commented Nov 27, 2018

I tested this with the EBS plugin and it solved the problem that I had. Thanks, @vladimirvivien.

@vladimirvivien
Copy link
Member Author

@krmayankk this was a race issue (not functionality) that would have been a bit hard to detect.

@vladimirvivien vladimirvivien changed the title Fixe for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go Fix for race condition caused by concurrent fsnotify (CREATE and REMOVE) in kubelet/plugin_watcher.go Nov 27, 2018
@msau42
Copy link
Member

msau42 commented Nov 27, 2018

#70439 adds an e2e test for this

@msau42
Copy link
Member

msau42 commented Nov 27, 2018

Oops #70578

@@ -111,7 +111,7 @@ func (w *Watcher) Start() error {
//TODO: Handle errors by taking corrective measures

w.wg.Add(1)
go func() {
func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goroutine inside traversePluginDir also makes order of events non-deterministic, especially if changes are occurring at the same time the initial scan is being done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the handleCreateEvent function verify the created path still exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutinein traversePluginDir is used to place items on a non-buffered channel, ws.fsWatcher.Events. If must be present to avoid deadlocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleCreateEvent does not explicitly re-check existence of the dir right before the Driver is delegated to handle the registration:

With the changes in this PR, the fsnotify CREATE/DELETE operations should not occur out of sync. If a dir existed right before Registration, it should not go away until a DELETE event comes right after it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fsnotify CREATE/DELETE operations should not occur out of sync

That is the expectation from the underlying library. Just to be safe, it will be worth catching any potential issue around delete after create operations and logging it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes in this PR, the fsnotify CREATE/DELETE operations should not occur out of sync. If a dir existed right before Registration, it should not go away until a DELETE event comes right after it.

nothing guarantees that, correct?

  1. traversePluginDir is called

  2. traversePluginDir adds a filesystem watch to a particular directory

    case mode.IsDir():
    if w.containsBlacklistedDir(path) {
    return filepath.SkipDir
    }
    if err := w.fsWatcher.Add(path); err != nil {
    return fmt.Errorf("failed to watch %s, err: %v", path, err)
    }

  3. traversePluginDir descends into the directory and adds synthetic Create events for the files found in the dir via a goroutine

    go func() {
    defer w.wg.Done()
    w.fsWatcher.Events <- fsnotify.Event{
    Name: path,
    Op: fsnotify.Create,
    }
    }()

because there is an active watcher registered in step 2 that can immediately start delivering events, and the synthetic create events in step 3 are delivered via a goroutine, they can interleave with actual observed filesystem events in non-deterministic ways. For example, if a driver is being deleted while this runs:

  1. filepath.Walk lists dir, sees driver socket file
  2. driver socket file is deleted
  3. filesystem delete event is observed and queued
  4. filepath.Walk queues synthetic create event

because the delete is handled, then the synthetic create event, could we end up with a registered driver that doesn't actually exist any more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee consistency, shouldn't you be enqueuing existing sockets first prior to accepting fsnotify events from the kernel? Imagine a situation where a socket was identified by path traversal, but before traversePluginDir can enqueue a create event, the socket get's deleted and that event gets processed first before the creation event?

Copy link
Member

@liggitt liggitt Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee consistency, shouldn't you be enqueuing existing sockets first prior to accepting fsnotify events from the kernel?

that's what I expected as well. registering watches on the dirs, processing the contents and enqueuing synthetic create events, then starting processing of the events from the registered watchers

Imagine a situation where a socket was identified by path traversal, but before traversePluginDir can enqueue a create event, the socket get's deleted and that event gets processed first before the creation event?

yes, that's exactly the scenario described above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt @vishh I think the code already has HappensBefore and HappensAfter serial properties that you are alluding to. The code seems to have 1to1 parity between observed filesystem event and synthetic queued events. To explain, let's further unpack the scenario that Jordan presented earlier:

  1. filepath.Walk lists dir, sees driver socket file

So let's look at some scenarios

  1. filepath.Walk hits dir, adds watcher for it, continue
  2. filesystem creates socket file (from driver)
    a. But, socket file is immediately deleted from filesystem
    b. According to fswatcher, if the file is removed before it is observed, the Walk will generate an error
  3. filepath.Walk receives error because watcher is missing, returns

Scenario 2

  1. filepath.Walk hits dir, adds watcher for it, continue
  2. filesystem creates socket file (from driver)
  3. filepath.Walk receives socket file info (prior to deletion)
    a. queues synthetic create event
  4. Socket file is from filesystem
  5. filepath.Walk receives deleted file info (after deletion)
    a. enqueues the observed delete event

Because there is a sequentiality between the creation and immediate deletion of the socket files, the observed events will have before/after relationships. Therefore, the synthetic events are generated and placed on the internal event queue (fsWatcher.Events) should also inherit that sequentiality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the scenario described in #71440 (comment) is still racy

The synthetic create events traversePluginDir sends to the channel (for socket files encountered by filepath.Walk) are independent of (and can race with) create/delete events sent to the channel by the registered filesystem watchers.

That said, if a synthetic create event was processed after an actual observed delete event, handleCreateEvent does verify the created path still exists:

fi, err := os.Stat(event.Name)
if err != nil {
return fmt.Errorf("stat file %s failed: %v", event.Name, err)
}

I still think the raciness should be fixed in a follow up because it makes the event flow hard to understand and relies on compensation in the event handler, but in the context of this PR, it is not unsafe.

@vishh
Copy link
Contributor

vishh commented Nov 27, 2018

Handling serially should be fine for current and future plugin types. I don't expect plugins to have a lot of churn or volume (# of plugins).

@liggitt
Copy link
Member

liggitt commented Nov 27, 2018

Handling serially should be fine for current and future plugin types. I don't expect plugins to have a lot of churn or volume (# of plugins).

agree, I'm not concerned by making this bit serial. I am concerned that there is still a race between the synthetic create events from traversePluginDir interleaved with immediately-observed filesystem events from the registered handlers

@liggitt
Copy link
Member

liggitt commented Nov 27, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2018
@saad-ali
Copy link
Member

saad-ali commented Nov 27, 2018

Required known issue in release notes: If kubelet plugin registration for a driver fails, kubelet will not retry. The driver must delete and recreate the driver registration socket in order to force kubelet to attempt registration again. Restarting only the driver container may not be sufficient to trigger recreation of the socket, instead a pod restart may be required.

@vishh
Copy link
Contributor

vishh commented Nov 27, 2018

@saad-ali you wrote:

instead a pod restart may be required.

What do you mean by pod restart? Shouldn't it be deleting the pod instead?

@vishh
Copy link
Contributor

vishh commented Nov 27, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vishh, vladimirvivien

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 27, 2018
@AishSundar
Copy link
Contributor

@marpaia @dstrebel for adding the Known issue to Release notes

I still think the raciness should be fixed in a follow up because it makes the event flow hard to understand and relies on compensation in the event handler, but in the context of this PR, it is not unsafe.

@vishh @liggitt did you want the pending raciness during initialization fixed in a followup PR for 1.13 or are we ok mentioning this as a known issue and proceed to address in 1.14? Speaking to @saad-ali looks like this is edge case that was caught in code review vs actual repro during manual testing. This probably reduces the chances of a real user hitting this, but would like to know your final evaluation of its severity.

@liggitt
Copy link
Member

liggitt commented Nov 27, 2018

Known issue and post-1.13.0 follow up is fine

@liggitt
Copy link
Member

liggitt commented Nov 28, 2018

The known issue is not actually a raciness issue, will coordinate on the known issue text

@vishh
Copy link
Contributor

vishh commented Nov 28, 2018 via email

@AishSundar
Copy link
Contributor

/test pull-kubernetes-integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI: kubelet removes NodeID annotation when a driver is restarted
9 participants