Skip to content

[18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat()#2702

Merged
anshulpundir merged 1 commit intomoby:bump_v18.03from
thaJeztah:18.03-backport-dispatcher
Jul 26, 2018
Merged

[18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat()#2702
anshulpundir merged 1 commit intomoby:bump_v18.03from
thaJeztah:18.03-backport-dispatcher

Conversation

@thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Jul 10, 2018

backport of #2664 for 18.03

git checkout -b 18.03-backport-dispatcher upstream/bump_v18.03 
git cherry-pick -s -S -x caee4da2afedf487d98d4e66da3de5721ba415fa

cherry-pick was clean; no conflicts

We noticed repeated CI failures pointing to data races because of unlocked access to the dispatcher context in Heartbeat(). Golang also does not provide any guarantees around read-only operations on objects which are otherwise locked.

This will likely have a performance impact, which we will evaluate and some of the dispatcher code might need to be re-written accordingly. But correctness is first.

… in dispatcher Heartbeat()

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
(cherry picked from commit caee4da)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

thaJeztah commented Jul 10, 2018

Interesting; when backporting this to the 17.06 branch, I got a conflict that indicates this may not be needed; suspecting there has been a patch that was applied to the 17.06 branch, but never back ported to master;

diff --cc manager/dispatcher/dispatcher.go
index f201f5fc,407ced0b..00000000
--- a/manager/dispatcher/dispatcher.go
+++ b/manager/dispatcher/dispatcher.go
@@@ -1095,12 -1145,9 +1095,18 @@@ func (d *Dispatcher) Heartbeat(ctx cont
        d.rpcRW.RLock()
        defer d.rpcRW.RUnlock()
  
++<<<<<<< HEAD
 +      // Its OK to call isRunning() here instead of isRunningLocked()
 +      // because of the rpcRW readlock above.
 +      // TODO(anshul) other uses of isRunningLocked() can probably
 +      // also be removed.
 +      if !d.isRunning() {
 +              return nil, grpc.Errorf(codes.Aborted, "dispatcher is stopped")
++=======
+       // TODO(anshul) Explore if its possible to check context here without locking.
+       if _, err := d.isRunningLocked(); err != nil {
+               return nil, status.Errorf(codes.Aborted, "dispatcher is stopped")
++>>>>>>> caee4da2... [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat()
        }
  
        nodeInfo, err := ca.RemoteNode(ctx)

oh right, it's 17.06 that's missing that patch; not sure why it conflicts; ok looks to be #2519 (master), and cherry-picked into 17.06 through #2524

digging further

@thaJeztah
Copy link
Member Author

Also getting the same failures as in #2700 - may be an actual issue after all?

ok  	github.com/docker/swarmkit/manager	7.229s	coverage: 82.0% of statements
time="2018-07-10T12:38:59Z" level=error msg="task allocation failure" error="failed to retrieve network testID3 while allocating task testTaskID3" 
time="2018-07-10T12:38:59Z" level=error msg="Failed allocation for network testID5" error="failed while allocating driver state for network testID5: could not obtain vxlan id for pool 10.0.4.0/24: requested bit is already allocated" 
time="2018-07-10T12:38:59Z" level=error msg="task allocation failure" error="network testID5 attached to task testTaskID6 not allocated yet" 
time="2018-07-10T12:39:00Z" level=error msg="Failed allocation for service testServiceID4" error="requested bit is already allocated" 
time="2018-07-10T12:39:00Z" level=error msg="task allocation failure" error="service testServiceID4 to which task testTaskID7 belongs has pending allocations" 
--- FAIL: TestNodeAllocator (0.05s)
	allocator_test.go:1434: timed out before watchNode found expected node state
FAIL
coverage: 72.4% of statements
FAIL	github.com/docker/swarmkit/manager/allocator	3.775s

@thaJeztah
Copy link
Member Author

thaJeztah commented Jul 10, 2018

So for the 17.06 branch, #2519 (master) was;

And on the 18.03 branch;

@thaJeztah
Copy link
Member Author

So current status:

17.06:

	// Its OK to call isRunning() here instead of isRunningLocked()
	// because of the rpcRW readlock above.
	// TODO(anshul) other uses of isRunningLocked() can probably
	// also be removed.
	if !d.isRunning() {
		return nil, grpc.Errorf(codes.Aborted, "dispatcher is stopped")
	}

18.03:

	// Its OK to call isRunning() here instead of isRunningLocked()
	// because of the rpcRW readlock above.
	// TODO(anshul) other uses of isRunningLocked() can probably
	// also be removed.
	if !d.isRunning() {
		return nil, status.Errorf(codes.Aborted, "dispatcher is stopped")
	}

master (and 18.06):

	// TODO(anshul) Explore if its possible to check context here without locking.
	if _, err := d.isRunningLocked(); err != nil {
		return nil, status.Errorf(codes.Aborted, "dispatcher is stopped")
	}

@anshulpundir I can use some input here which of the two is the right one; if it's what's in 17.06/18.03, then there's something to revert/update on master

@thaJeztah thaJeztah changed the title [18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat() [input needed] [18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat() Jul 10, 2018
@andrewhsu andrewhsu requested a review from anshulpundir July 25, 2018 19:09
@cyli
Copy link
Contributor

cyli commented Jul 25, 2018

@thaJeztah I believe there was a change made originally to isRunning(), but that caused a bunch of data races, so that particular change got reverted (not an actual revert though) back on master. So the correct one is the call to isRunningLocked().

@codecov
Copy link

codecov bot commented Jul 26, 2018

Codecov Report

Merging #2702 into bump_v18.03 will decrease coverage by 0.35%.
The diff coverage is 0%.

@@               Coverage Diff               @@
##           bump_v18.03    #2702      +/-   ##
===============================================
- Coverage        61.82%   61.47%   -0.36%     
===============================================
  Files              134      134              
  Lines            21820    21820              
===============================================
- Hits             13491    13413      -78     
- Misses            6888     6970      +82     
+ Partials          1441     1437       -4

Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I think this can be merged, but leaving it just in case there is still some doubt about 17.06. Please merge at will @thaJeztah, though.

@anshulpundir anshulpundir merged commit 5a37194 into moby:bump_v18.03 Jul 26, 2018
@thaJeztah thaJeztah deleted the 18.03-backport-dispatcher branch July 26, 2018 20:25
@thaJeztah thaJeztah changed the title [input needed] [18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat() [18.03] [manager/dispatcher] Replace call to isRunning() to isRunningLocked() in dispatcher Heartbeat() Jul 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants