Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send events before adding watchers in traversePluginDir #75110

Merged
merged 2 commits into from
Apr 30, 2019

Conversation

bertinatto
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR changes the function devicemanager.traversePluginDir in order to send events before adding watchers to directories.

Which issue(s) this PR fixes:

Fixes #75097

Does this PR introduce a user-facing change?:

NONE

/sig storage
/sig node
/assign @liggitt

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 7, 2019
@bertinatto bertinatto force-pushed the fix_race_watcher branch 5 times, most recently from 5bb7b99 to 4a6ee18 Compare March 11, 2019 15:53
@bertinatto
Copy link
Member Author

/retest

@bertinatto bertinatto marked this pull request as ready for review March 12, 2019 07:33
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/apiserver area/test and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2019
@bertinatto
Copy link
Member Author

bertinatto commented Apr 17, 2019

CC @kubernetes/sig-node-pr-reviews

@liggitt
Copy link
Member

liggitt commented Apr 17, 2019

/skip

@liggitt
Copy link
Member

liggitt commented Apr 17, 2019

all the required CI jobs are green on this PR, ready for node/storage review

@bertinatto
Copy link
Member Author

/remove-label do-not-merge/work-in-progress

@liggitt liggitt removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2019
Op: fsnotify.Create,
}
//TODO: Handle errors by taking corrective measures
if err := w.handleCreateEvent(event); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation of handleCreateEvent, I think it's more direct to just do:

if !w.containsBlacklistedDir(path) {
  return w.traversePluginDir(path)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow, do you mean inside handleCreateEvent? If so, then it wouldn't call handlePluginRegistration... or am I missing somehting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean "handle create event" doesn't really seem like what this is doing, unless I'm not following the control flow. If you look at what handle create event is doing, the snippet I commented above is all that gets hit in that function, so I'm suggesting to just inline that piece here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK... I understand now...

This part of the code creates "synthetic create events" if we find a socket file. Then we "handle" this event right here (as opposed to sending it to a processing goroutine like it was before).

Since we're dealing with a socket file, we don't really want to traversePluginDir() it, but instead we want to register the plugin that's listening to this unix domain socket file (i.e., handlePluginRegistration()).

However, we also need to make sure that we ignore black-listed directories (like you pointed in the snippet above) and files prefixed with a "."; this would do it:

			if !w.containsBlacklistedDir(path) {
				if !strings.HasPrefix(path, ".") {
					// TODO: Handle errors by taking corrective measures
					if err := w.handlePluginRegistration(path); err != nil {
						klog.Errorf("error %v when handling create event for file: %s", err, path)
					}
				}
			}

IMO handleCreateEvent() looks like would be a better fit though (since we're almost re-implementing it in-line).

Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I completely misread what handleCreateEvent was doing.

select {
case <-c:
return
case <-time.After(2 * time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case <-time.After(2 * time.Second):
case <-time.After(wait.ForeverTestTimeout):

You can't wait timeout before that in a unit test, as it will flake in the CI environment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I took the chance and fixed the other time.After calls from this file.

defer close(c)
require.True(t, waitForEvent(t, exampleEventValidate, hdlr.EventChan(p.pluginName)))
require.False(t, waitForPluginRegistrationStatus(t, p.registrationStatus))
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, you can just do this serially at the end of the test and rely on the whole test timing out for the failure condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're processing plugins serially now, newWatcherWithHandler below would block forever trying to write to the event/registration channels. This goroutine prevents that by receiving info from these channels beforehand.

@bertinatto
Copy link
Member Author

/skip

@bertinatto
Copy link
Member Author

/retest

Currently, the method `pluginwatcher.traversePluginDir` descends into
a directory adding filesystem watchers and creating synthetic `create`
events when it finds sockets files. However, a race condition might
happen when a recently-added watcher observes a `delete` event in a
socket file before `pluginwatcher.traversePlugindir` itself notices
this file.

This patch changes this behavior by registering watchers on
directories, enqueueing and processing `create` events from sockets
found, and only then processing the events from the registered watchers.
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 23, 2019

@bertinatto: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-godeps b7c93d2e234b1c8d995c992a9c8a0518c492a4ee link /test pull-kubernetes-godeps
pull-kubernetes-cross b7c93d2e234b1c8d995c992a9c8a0518c492a4ee link /test pull-kubernetes-cross
pull-kubernetes-e2e-gce-device-plugin-gpu f564557 link /test pull-kubernetes-e2e-gce-device-plugin-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bertinatto
Copy link
Member Author

/skip

@bertinatto
Copy link
Member Author

/retest

@sjenning
Copy link
Contributor

/assign @derekwaynecarr

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2019
@tallclair
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bertinatto, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2019
@k8s-ci-robot k8s-ci-robot merged commit 1608578 into kubernetes:master Apr 30, 2019
Op: fsnotify.Create,
}
//TODO: Handle errors by taking corrective measures
if err := w.handleCreateEvent(event); err != nil {
Copy link
Contributor

@tedyu tedyu Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the err be returned, in the same way error is returned for mode.IsDir() case ?

I created PR #77244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in plugin registration
9 participants