Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race #5029

Merged
merged 6 commits into from
Jan 8, 2019
Merged

Fix data race #5029

merged 6 commits into from
Jan 8, 2019

Conversation

dawxy
Copy link
Contributor

@dawxy dawxy commented Nov 30, 2018

Fix #4357

@banks
Copy link
Member

banks commented Dec 3, 2018

Thanks @dawxy adding to our milestone to be looked at before next release. A quick glance seems simple but as with all concurrency things will need a bit of careful though to reason about whether it could ever be called in a way that will deadlock etc.

@banks banks added this to the 1.4.1 milestone Dec 3, 2018
@dawxy
Copy link
Contributor Author

dawxy commented Dec 4, 2018

@banks
I checked it again and found no deadlocks.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me and seems necessary.

makeQueryOptionsWithContext is used in several of the watchFuncFactory functions which look to only ever be called from here:

consul/watch/plan.go

Lines 63 to 71 in 57b5a1d

for !p.shouldStop() {
// Invoke the handler
blockParamVal, result, err := p.Watcher(p)
// Check if we should terminate since the function
// could have blocked for a while
if p.shouldStop() {
break
}

We can't hold the stop lock outside of the factory func that uses the makeQueryOptionsWithContext because the watcher may block for too long and we dont want the other routines attempting to stop the watch to block for that long. The only solution is then for locking where this PR put it or in all the functions that call it (the watchFuncFactory functions).

I see one other problem here which is that we could be setting a cancel function after the watch has already been stopped. Ideally while holding the lock we would check if the plan is stopped and if so then return nil without setting the cancel func. However all the other factory funcs would need to be altered to abort early when nil is returned. When this doesn't happen its possible for a watch to linger for many minutes while waiting for new data even after it was supposed to be stopped.

I will think on whether the second abort early bit is worthwhile. @banks any thoughts?

@banks
Copy link
Member

banks commented Jan 7, 2019

I see one other problem here which is that we could be setting a cancel function after the watch has already been stopped.

Is that a problem? Calling cancel on an already-cancelled Context is safe.

@mkeeler
Copy link
Member

mkeeler commented Jan 7, 2019

@banks Its possible that comment was misleading. The actual problem is that we could have go routines sitting around for a long time after they should be stopped.

If the watch gets stopped before setting the cancel func but after the first stop check here:

consul/watch/plan.go

Lines 63 to 71 in 57b5a1d

for !p.shouldStop() {
// Invoke the handler
blockParamVal, result, err := p.Watcher(p)
// Check if we should terminate since the function
// could have blocked for a while
if p.shouldStop() {
break
}

Then the watch still gets executed but nothing can ever attempt to cancel it. It isn't a huge problem but we could for example detect while holding the lock that the watch is stopped and execute the new cancel func. That would prevent the watch from hanging around too long.

@banks
Copy link
Member

banks commented Jan 7, 2019

Gotcha. Agree we should avoid leaking the goroutines here. Great catch @mkeeler - exactly the kind of thing I meant about being careful with concurrency stuff!

@dawxy are you keen to implement that? It's trivial and we can do it but don't want to hijack your contribution without asking!

@dawxy
Copy link
Contributor Author

dawxy commented Jan 8, 2019

@mkeeler
I think it's still necessary to stop lock protection to read the modified cancelFunc in cancelFunc and makeQueryOptionsWithContext() when calling plan.Stop().

@dawxy
Copy link
Contributor Author

dawxy commented Jan 8, 2019

@mkeeler PTAL

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. This should fix the concurrency issues as well as ensure that we don't let the new watch hang around for a long time when it should be stopped.

@mkeeler mkeeler merged commit 238d430 into hashicorp:master Jan 8, 2019
@dawxy dawxy deleted the fix_data_race branch January 9, 2019 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants