Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overlay vxlan races #2146

Merged
merged 2 commits into from Jul 11, 2018
Merged

Conversation

ctelfer
Copy link
Contributor

@ctelfer ctelfer commented May 9, 2018

This function takes the patch offered in #1800 and attempts to address the issues found in #1765. It removes a race condition documented in the issue where "re-once"ing the creation of an overlay sandbox can race with incoming join requests to cause a leak/collision with the vxlan interface. This PR addresses said issue by removing the "re-once" pattern and replacing it with a traditional mutex+boolean initializer pattern. This also allows removing some restore error handling by having the sandbox join/leave perform proper cleanup on error. More information is available in the commit logs of the PR.

@selansen
Copy link
Collaborator

selansen commented May 9, 2018

CI Failed

@codecov-io
Copy link

codecov-io commented May 9, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@eb6b2a5). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2146   +/-   ##
=========================================
  Coverage          ?   40.48%           
=========================================
  Files             ?      139           
  Lines             ?    22491           
  Branches          ?        0           
=========================================
  Hits              ?     9105           
  Misses            ?    12048           
  Partials          ?     1338
Impacted Files Coverage Δ
drivers/overlay/ov_network.go 2.78% <0%> (ø)
drivers/overlay/peerdb.go 9.88% <0%> (ø)
osl/interface_linux.go 60.15% <0%> (ø)
drivers/overlay/joinleave.go 0% <0%> (ø)
drivers/overlay/overlay.go 28.63% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb6b2a5...26212fe. Read the comment docs.

n.joinCnt++
}
func (n *network) joinSandbox(s *subnet, restore bool, incJoinCnt bool) error {
networkOnce.Do(networkOnceInit)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 2 joinSanbox are happening, the first one will trigger this networkOnceInit, while the second one will continue forward acquiring the lock. Is it ok for the second one to go ahead also before the first networkOnceInit is completed?
If I look at the code I guess this will create a potential race with the n.initSandbox(restore) that actually if restore is false will try to delete the VNI
Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked on this very early on... once.Do() holds a mutex while doing the initialization. If two goroutines execute the same once.Do() concurrently, the second one will block until the first one completes.

})
return s.initErr
subnetErr := s.initErr
if !s.sboxInit {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this dead code?
if at line 304 was false, will be set to true
if it was already true then won't enter here anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is s.sboxInit not n.sboxInit ... So no, this is different. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh...

// failure of vxlan device creation if the vni is assigned to some other
// network.
if deleteErr := deleteInterface(vxlanName); deleteErr != nil {
logrus.Warnf("could not delete vxlan interface, %s, error %v, after config error, %v\n", vxlanName, deleteErr, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove the trailing \n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, will do.

s.sboxInit = true
}
}
if subnetErr != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We declare subnetErr inside if !s.sboxInit { . and again using it for outside the if scope. Can we declare outside of if and make the scope to function level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That's definitely a bug. It is not supposed to be redeclared within the 'if'.

@ctelfer
Copy link
Contributor Author

ctelfer commented May 10, 2018

Pushed an update with a much simplified locking scheme. There is now only a single lock for each network and all its subnets. I originally put in the second lock because of a deadlock (which I had hit during testing before the initial PR) described in the commit log. This new version breaks the deadlock by posting the notification to initialize the peerDB on a join using a fresh goroutine. This prevents the 'join' from deadlocking waiting on the channel of the peerDB goroutine while the peerdb goroutine is waiting for the 'join' to release the network lock.

This version also addresses the comments by @fcrisciani (newlines) and @selansen (redeclaration of subnetErr).

@@ -296,29 +294,39 @@ func (d *driver) RevokeExternalConnectivity(nid, eid string) error {
return nil
}

func (n *network) incEndpointCount() {
func (n *network) joinSandbox(s *subnet, restore bool, incJoinCount bool) error {
networkOnce.Do(networkOnceInit)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep this comment:

	// If there is a race between two go routines here only one will win
	// the other will wait.

s.initErr = n.initSubnetSandbox(s, restore)
})
return s.initErr
subnetErr := s.initErr

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can simply avoid using this subnetErr, and just leverage the s.initErr

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for line 321

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a very specific reason why it isn't. We want to return:

  • s.initErr iff it was previously set
  • otherwise we want to return the result of n.initSubnetSandbox(s, restore) regardless of whether that value gets stored in s.initErr. So, on line 321 we do not want to just return s.initErr. It may not have been set if there was a failure but we are not in a restore case.

This is the semantics from the #1800 and I've preserved them. The logic basically implies that if an error is recoverable, return it so it is flagged appropriately, but to make it persistent. But if the error is not recoverable (restore case), make it persistent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect that the error failure will be different? on multiple retry?

@@ -470,7 +477,7 @@ func (n *network) generateVxlanName(s *subnet) string {
id = n.id[:5]
}

return "vx-" + fmt.Sprintf("%06x", n.vxlanID(s)) + "-" + id
return "vx-" + fmt.Sprintf("%06x", s.vni) + "-" + id

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering that we are touching this line, this will be more efficient:
fmt.Sprintf("vx-%06x-%v", s.vni,id)

@@ -483,7 +490,7 @@ func (n *network) generateBridgeName(s *subnet) string {
}

func (n *network) getBridgeNamePrefix(s *subnet) string {
return "ov-" + fmt.Sprintf("%06x", n.vxlanID(s))
return "ov-" + fmt.Sprintf("%06x", s.vni)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here just with a single Sprintf


n.setVxlanID(s, 0)
for _, vni := range vnis {
n.driver.vxlanIdm.Release(uint64(vni))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need extra for loop instead of using previous logic where its all done inside one place. What are we trying to achieve by doing this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question. The vxlanIdm.Release() operation can do locking and write to store while other overlay networks are coming and going as well. So, I was thinking that this prevents holding the network lock through all those other potentially blocking operations. (i.e. collect the work list using the lock and then go through the process of releasing them) But since this function only gets called while the driver is holding the driver lock, that may be a superfluous optimization.

}

return vnis, nil
}

func (n *network) obtainVxlanID(s *subnet) error {
//return if the subnet already has a vxlan id assigned
if s.vni != 0 {
if n.vxlanID(s) != 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places you replaced n.vxlanID(s) with s.vni or n.vni. Here we are using n.vxlanID(s) instead of directly accessing with s.vni . Trying to understand why this change is required ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fair. The revamped locking holds the network lock through network.join() and network.leave() operations. As such, calling n.vxlanID(s) within those methods (or their sub-functions) would be a double lock. That's the reasons for the changes above.

However, the changes above also remove the subnet locks and place subnet locking all within the domain of the network's lock. As such, network methods need to acquire the network lock before accessing any of the subnet fields. This driver calls obtainVxlanID() outside of the context of network.join() in several places. Hence accessing the subnet members requires doing so through the network lock.

The lack of locking in obtainVxlanId() was, in principle, a race condition before that just wasn't addressed.

@@ -1059,7 +1067,7 @@ func (n *network) obtainVxlanID(s *subnet) error {
return fmt.Errorf("getting network %q from datastore failed %v", n.id, err)
}

if s.vni == 0 {
if n.vxlanID(s) == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

return s.vni
}

func (n *network) setVxlanID(s *subnet, vni uint32) {
n.Lock()
defer n.Unlock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the defer is just slower, in this case for a 3 lines methods, I would keep it as before

Copy link

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor comments, rest looks good

@selansen
Copy link
Collaborator

LGTM

@ctelfer
Copy link
Contributor Author

ctelfer commented May 16, 2018

Addressed the comment, defer and sprintf comments above and have re-pushed.

Copy link

@fcrisciani fcrisciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcrisciani
Copy link

@ctelfer are we good with merging this?

Santhosh Manohar and others added 2 commits July 10, 2018 10:33
Signed-off-by: Santhosh Manohar <santhosh@docker.com>
Signed-off-by: Chris Telfer <ctelfer@docker.com>
Instead of using "sync.Once" to determine whether to initialize a
network sandbox or subnet sandbox, we use a traditional mutex +
initialization boolean.  This is because the initialization state isn't
truly a once-and-done condition.  Rather, libnetwork destroys network
and subnet sandboxes when the last endpoint leaves them.  The use of
sync.Once in this kind of scenario requires, therefore, re-initializing
the Once which is impoissible.  So the approach that libnetwork
currently takes is to use a pointer to a Once and redirect that pointer
to a new Once on reset.  This leads to nasty race conditions.

In addition to refactoring the locking, this patch merges the functions
joinSandbox(), and joinSubnetSandbox(). This makes the code both cleaner
and it also holds the network and subnet locks through the series of
read-modify-writes avoiding further potential races.  This does reduce
the potential parallelism which could be applied should there be many
joins coming in on many different subnets in the same overlay network.
However, this should be an extremely minor performance hit for a very
obscure case.

One important pattern in this commit is that it is crucial to avoid
sending peerDB messages while holding a driver or network lock.  The
changes herein defer such (asynchronous) notifications until after
release of such locks.  This prevents deadlocks where the peerDB
blocks acquiring said locks while the network method blocks trying
to send to the peerDB's channel.

Signed-off-by: Chris Telfer <ctelfer@docker.com>
@ctelfer
Copy link
Contributor Author

ctelfer commented Jul 10, 2018

Rebased to head of master, updated a few comments and re-tested a few times to make sure this looks good. Everything has been clean.

I'll admit that I'm still a bit nervous after #2143 / #2180. But from a rational standpoint, what this PR does is remove the unsafe re-onceing from the overlay code, make joins and leaves to the overlay sandboxes atomic, and merge in #1800 error handling code. It should be safe to merge.

@fcrisciani fcrisciani merged commit 206ed7f into moby:master Jul 11, 2018
@ctelfer ctelfer deleted the fix-overlay-vxlan-races branch July 11, 2018 18:52
@m4r10k
Copy link

m4r10k commented Aug 30, 2018

@fcrisciani do you know which Docker version will include this merge (vendors this libnetwork hash)? We are running 18.03 and are regularly experiencing the issue that the vxlan file is already existing.

28686: vx-00101b-nmh1s: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4123 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29077: vx-001017-sxegy: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4119 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
23209: vx-001006-tlzfj: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4102 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29632: vx-001011-ox0iz: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4113 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
29658: vx-001028-jjw67: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4136 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
28399: vx-00103a-i7t6y: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4154 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
28911: vx-001020-ups39: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4128 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants