Fix overlay vxlan races #2146

ctelfer · 2018-05-09T21:33:14Z

This function takes the patch offered in #1800 and attempts to address the issues found in #1765. It removes a race condition documented in the issue where "re-once"ing the creation of an overlay sandbox can race with incoming join requests to cause a leak/collision with the vxlan interface. This PR addresses said issue by removing the "re-once" pattern and replacing it with a traditional mutex+boolean initializer pattern. This also allows removing some restore error handling by having the sandbox join/leave perform proper cleanup on error. More information is available in the commit logs of the PR.

selansen · 2018-05-09T21:39:01Z

CI Failed

codecov-io · 2018-05-09T22:36:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@eb6b2a5). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #2146   +/-   ##
=========================================
  Coverage          ?   40.48%           
=========================================
  Files             ?      139           
  Lines             ?    22491           
  Branches          ?        0           
=========================================
  Hits              ?     9105           
  Misses            ?    12048           
  Partials          ?     1338

Impacted Files	Coverage Δ
drivers/overlay/ov_network.go	`2.78% <0%> (ø)`
drivers/overlay/peerdb.go	`9.88% <0%> (ø)`
osl/interface_linux.go	`60.15% <0%> (ø)`
drivers/overlay/joinleave.go	`0% <0%> (ø)`
drivers/overlay/overlay.go	`28.63% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb6b2a5...26212fe. Read the comment docs.

fcrisciani · 2018-05-10T04:44:11Z

drivers/overlay/ov_network.go

-	n.joinCnt++
-}
+func (n *network) joinSandbox(s *subnet, restore bool, incJoinCnt bool) error {
+	networkOnce.Do(networkOnceInit)


if 2 joinSanbox are happening, the first one will trigger this networkOnceInit, while the second one will continue forward acquiring the lock. Is it ok for the second one to go ahead also before the first networkOnceInit is completed?
If I look at the code I guess this will create a potential race with the n.initSandbox(restore) that actually if restore is false will try to delete the VNI
Am I missing something?

I checked on this very early on... once.Do() holds a mutex while doing the initialization. If two goroutines execute the same once.Do() concurrently, the second one will block until the first one completes.

fcrisciani · 2018-05-10T04:45:33Z

drivers/overlay/ov_network.go

-	})
-	return s.initErr
+	subnetErr := s.initErr
+	if !s.sboxInit {


isn't this dead code?
if at line 304 was false, will be set to true
if it was already true then won't enter here anyway

This is s.sboxInit not n.sboxInit ... So no, this is different. :)

fcrisciani · 2018-05-10T04:47:39Z

drivers/overlay/ov_network.go

+		// failure of vxlan device creation if the vni is assigned to some other
+		// network.
+		if deleteErr := deleteInterface(vxlanName); deleteErr != nil {
+			logrus.Warnf("could not delete vxlan interface, %s, error %v, after config error, %v\n", vxlanName, deleteErr, err)


you can remove the trailing \n

ah, yes, will do.

selansen · 2018-05-10T20:34:23Z

drivers/overlay/ov_network.go

+			s.sboxInit = true
+		}
+	}
+	if subnetErr != nil {


We declare subnetErr inside if !s.sboxInit { . and again using it for outside the if scope. Can we declare outside of if and make the scope to function level.

Good catch. That's definitely a bug. It is not supposed to be redeclared within the 'if'.

ctelfer · 2018-05-10T21:31:44Z

Pushed an update with a much simplified locking scheme. There is now only a single lock for each network and all its subnets. I originally put in the second lock because of a deadlock (which I had hit during testing before the initial PR) described in the commit log. This new version breaks the deadlock by posting the notification to initialize the peerDB on a join using a fresh goroutine. This prevents the 'join' from deadlocking waiting on the channel of the peerDB goroutine while the peerdb goroutine is waiting for the 'join' to release the network lock.

This version also addresses the comments by @fcrisciani (newlines) and @selansen (redeclaration of subnetErr).

fcrisciani · 2018-05-15T21:32:01Z

drivers/overlay/ov_network.go

@@ -296,29 +294,39 @@ func (d *driver) RevokeExternalConnectivity(nid, eid string) error {
 	return nil
 }

-func (n *network) incEndpointCount() {
+func (n *network) joinSandbox(s *subnet, restore bool, incJoinCount bool) error {
+	networkOnce.Do(networkOnceInit)


can you keep this comment:

// If there is a race between two go routines here only one will win // the other will wait.

fcrisciani · 2018-05-15T21:37:19Z

drivers/overlay/ov_network.go

-		s.initErr = n.initSubnetSandbox(s, restore)
-	})
-	return s.initErr
+	subnetErr := s.initErr


we can simply avoid using this subnetErr, and just leverage the s.initErr

also for line 321

Actually, there is a very specific reason why it isn't. We want to return:

s.initErr iff it was previously set

otherwise we want to return the result of n.initSubnetSandbox(s, restore) regardless of whether that value gets stored in s.initErr. So, on line 321 we do not want to just return s.initErr. It may not have been set if there was a failure but we are not in a restore case.

This is the semantics from the #1800 and I've preserved them. The logic basically implies that if an error is recoverable, return it so it is flagged appropriately, but to make it persistent. But if the error is not recoverable (restore case), make it persistent.

do we expect that the error failure will be different? on multiple retry?

fcrisciani · 2018-05-15T21:44:06Z

drivers/overlay/ov_network.go

@@ -470,7 +477,7 @@ func (n *network) generateVxlanName(s *subnet) string {
 		id = n.id[:5]
 	}

-	return "vx-" + fmt.Sprintf("%06x", n.vxlanID(s)) + "-" + id
+	return "vx-" + fmt.Sprintf("%06x", s.vni) + "-" + id


considering that we are touching this line, this will be more efficient:
fmt.Sprintf("vx-%06x-%v", s.vni,id)

fcrisciani · 2018-05-15T21:44:25Z

drivers/overlay/ov_network.go

@@ -483,7 +490,7 @@ func (n *network) generateBridgeName(s *subnet) string {
 }

 func (n *network) getBridgeNamePrefix(s *subnet) string {
-	return "ov-" + fmt.Sprintf("%06x", n.vxlanID(s))
+	return "ov-" + fmt.Sprintf("%06x", s.vni)


same here just with a single Sprintf

selansen · 2018-05-15T22:08:02Z

drivers/overlay/ov_network.go


-		n.setVxlanID(s, 0)
+	for _, vni := range vnis {
+		n.driver.vxlanIdm.Release(uint64(vni))


why do we need extra for loop instead of using previous logic where its all done inside one place. What are we trying to achieve by doing this ?

Fair question. The vxlanIdm.Release() operation can do locking and write to store while other overlay networks are coming and going as well. So, I was thinking that this prevents holding the network lock through all those other potentially blocking operations. (i.e. collect the work list using the lock and then go through the process of releasing them) But since this function only gets called while the driver is holding the driver lock, that may be a superfluous optimization.

selansen · 2018-05-15T22:14:01Z

drivers/overlay/ov_network.go

 	}

 	return vnis, nil
 }

 func (n *network) obtainVxlanID(s *subnet) error {
 	//return if the subnet already has a vxlan id assigned
-	if s.vni != 0 {
+	if n.vxlanID(s) != 0 {


Other places you replaced n.vxlanID(s) with s.vni or n.vni. Here we are using n.vxlanID(s) instead of directly accessing with s.vni . Trying to understand why this change is required ?

Also fair. The revamped locking holds the network lock through network.join() and network.leave() operations. As such, calling n.vxlanID(s) within those methods (or their sub-functions) would be a double lock. That's the reasons for the changes above.

However, the changes above also remove the subnet locks and place subnet locking all within the domain of the network's lock. As such, network methods need to acquire the network lock before accessing any of the subnet fields. This driver calls obtainVxlanID() outside of the context of network.join() in several places. Hence accessing the subnet members requires doing so through the network lock.

The lack of locking in obtainVxlanId() was, in principle, a race condition before that just wasn't addressed.

selansen · 2018-05-15T22:14:30Z

drivers/overlay/ov_network.go

@@ -1059,7 +1067,7 @@ func (n *network) obtainVxlanID(s *subnet) error {
 			return fmt.Errorf("getting network %q from datastore failed %v", n.id, err)
 		}

-		if s.vni == 0 {
+		if n.vxlanID(s) == 0 {


same as above

fcrisciani · 2018-05-15T23:43:50Z

drivers/overlay/ov_network.go

 	return s.vni
 }

 func (n *network) setVxlanID(s *subnet, vni uint32) {
 	n.Lock()
+	defer n.Unlock()


the defer is just slower, in this case for a 3 lines methods, I would keep it as before

fcrisciani

few minor comments, rest looks good

selansen · 2018-05-16T17:58:52Z

LGTM

ctelfer · 2018-05-16T18:26:12Z

Addressed the comment, defer and sprintf comments above and have re-pushed.

fcrisciani

LGTM

fcrisciani · 2018-07-06T23:25:14Z

@ctelfer are we good with merging this?

Signed-off-by: Santhosh Manohar <santhosh@docker.com> Signed-off-by: Chris Telfer <ctelfer@docker.com>

Instead of using "sync.Once" to determine whether to initialize a network sandbox or subnet sandbox, we use a traditional mutex + initialization boolean. This is because the initialization state isn't truly a once-and-done condition. Rather, libnetwork destroys network and subnet sandboxes when the last endpoint leaves them. The use of sync.Once in this kind of scenario requires, therefore, re-initializing the Once which is impoissible. So the approach that libnetwork currently takes is to use a pointer to a Once and redirect that pointer to a new Once on reset. This leads to nasty race conditions. In addition to refactoring the locking, this patch merges the functions joinSandbox(), and joinSubnetSandbox(). This makes the code both cleaner and it also holds the network and subnet locks through the series of read-modify-writes avoiding further potential races. This does reduce the potential parallelism which could be applied should there be many joins coming in on many different subnets in the same overlay network. However, this should be an extremely minor performance hit for a very obscure case. One important pattern in this commit is that it is crucial to avoid sending peerDB messages while holding a driver or network lock. The changes herein defer such (asynchronous) notifications until after release of such locks. This prevents deadlocks where the peerDB blocks acquiring said locks while the network method blocks trying to send to the peerDB's channel. Signed-off-by: Chris Telfer <ctelfer@docker.com>

ctelfer · 2018-07-10T16:59:33Z

Rebased to head of master, updated a few comments and re-tested a few times to make sure this looks good. Everything has been clean.

I'll admit that I'm still a bit nervous after #2143 / #2180. But from a rational standpoint, what this PR does is remove the unsafe re-onceing from the overlay code, make joins and leaves to the overlay sandboxes atomic, and merge in #1800 error handling code. It should be safe to merge.

m4r10k · 2018-08-30T09:35:32Z

@fcrisciani do you know which Docker version will include this merge (vendors this libnetwork hash)? We are running 18.03 and are regularly experiencing the issue that the vxlan file is already existing.

28686: vx-00101b-nmh1s: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4123 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29077: vx-001017-sxegy: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4119 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
23209: vx-001006-tlzfj: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4102 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29632: vx-001011-ox0iz: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4113 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29658: vx-001028-jjw67: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4136 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
28399: vx-00103a-i7t6y: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4154 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
28911: vx-001020-ups39: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4128 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64

fcrisciani mentioned this pull request May 10, 2018

Cleanup interfaces properly when vxlan plumbing fails #1800

Closed

fcrisciani reviewed May 10, 2018

View reviewed changes

selansen reviewed May 10, 2018

View reviewed changes

ctelfer force-pushed the fix-overlay-vxlan-races branch from 0026dc2 to 5b8a15a Compare May 10, 2018 21:21

fcrisciani reviewed May 15, 2018

View reviewed changes

selansen reviewed May 15, 2018

View reviewed changes

fcrisciani reviewed May 15, 2018

View reviewed changes

fcrisciani reviewed May 16, 2018

View reviewed changes

ctelfer force-pushed the fix-overlay-vxlan-races branch from 5b8a15a to 26212fe Compare May 16, 2018 18:25

fcrisciani approved these changes May 17, 2018

View reviewed changes

ctelfer mentioned this pull request Jul 3, 2018

services do not start: ingress-sbox is already present moby/moby#36743

Open

Santhosh Manohar and others added 2 commits July 10, 2018 10:33

Cleanup interfaces properly when vxlan plumbling fails

441a3f9

Signed-off-by: Santhosh Manohar <santhosh@docker.com> Signed-off-by: Chris Telfer <ctelfer@docker.com>

ctelfer force-pushed the fix-overlay-vxlan-races branch from 26212fe to 5d113d1 Compare July 10, 2018 16:46

fcrisciani merged commit 206ed7f into moby:master Jul 11, 2018

ctelfer deleted the fix-overlay-vxlan-races branch July 11, 2018 18:52

m4r10k mentioned this pull request Sep 5, 2018

Starting container failed: Address already in use when deploying service moby/moby#31698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix overlay vxlan races #2146

Fix overlay vxlan races #2146

ctelfer commented May 9, 2018

selansen commented May 9, 2018

codecov-io commented May 9, 2018 •

edited

fcrisciani May 10, 2018

ctelfer May 10, 2018

fcrisciani May 10, 2018

ctelfer May 10, 2018

fcrisciani May 10, 2018

fcrisciani May 10, 2018

ctelfer May 10, 2018

selansen May 10, 2018

ctelfer May 10, 2018

ctelfer commented May 10, 2018

fcrisciani May 15, 2018

fcrisciani May 15, 2018

fcrisciani May 15, 2018

ctelfer May 16, 2018

fcrisciani May 17, 2018

fcrisciani May 15, 2018

fcrisciani May 15, 2018

selansen May 15, 2018

ctelfer May 16, 2018

selansen May 15, 2018

ctelfer May 16, 2018

selansen May 15, 2018

fcrisciani May 15, 2018

fcrisciani left a comment

selansen commented May 16, 2018

ctelfer commented May 16, 2018

fcrisciani left a comment

fcrisciani commented Jul 6, 2018

ctelfer commented Jul 10, 2018

m4r10k commented Aug 30, 2018

Fix overlay vxlan races #2146

Fix overlay vxlan races #2146

Conversation

ctelfer commented May 9, 2018

selansen commented May 9, 2018

codecov-io commented May 9, 2018 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctelfer commented May 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcrisciani left a comment

Choose a reason for hiding this comment

selansen commented May 16, 2018

ctelfer commented May 16, 2018

fcrisciani left a comment

Choose a reason for hiding this comment

fcrisciani commented Jul 6, 2018

ctelfer commented Jul 10, 2018

m4r10k commented Aug 30, 2018

codecov-io commented May 9, 2018 •

edited