Block task starting until node attachments are ready #37604

dperny · 2018-08-07T18:17:00Z

- What I did
Blocks the execution of tasks during the Prepare phase until there exists an IP address for every overlay network in use by the task. This prevents a task from starting before the NetworkAttachment containing the IP address has been sent down to the node.

This is needed because moby/swarmkit#2725 only creates network attachments on an as-needed basis, meaning a node is not guaranteed to have received a network attachment by the time the task requiring it has arrived.

- How I did it
Added a long-polling loop that checks for the existence of an IP address for each overlay network in use by the task. Long-polling is not optimal, but a channel-triggered notification method would require a more substantial rearchitecting.

Also added locking to the network attachment store, to ensure that it is safe for concurrent access.

- How to verify it
Still needs tests, but currently there is no case in which a network attachment won't be ready.

ctelfer · 2018-08-07T20:59:14Z

Bringing this in from moby/swarmkit#2725.

It looks like swarmkit's error library already provides this nice mechanism for reporting temporary errors and telling swarm to retry later: docker/swarmkit/agent/exec/errors.go:MakeTemporary. It also looks like the path between the swarmkit controller.Do() function and the docker engine's daemon/network.go:daemon.createNetwork() call mostly clear. That is, the various calls in between just flag the error and return it without interpretation.

So, if I'm reading this right, it might be possible to address the race condition by changing the code of createNetwork() so it invokes MakeTemporary() on the error it would return when it can't find the network attachment. I.e. on line 350 of daemon/network .go:

        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)
                }

becomes something like

        // Top of file
        import swarmErrors "docker/swarmkit/agent/exec/errors.go"
...
        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, swarmErrors.MakeTemporary(fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id))
                }

Does this seem plausible? I do note that the MakeTemporary() function isn't invoked anywhere in the moby codebase meaning that this is kind of an untested code feature. But from what I see in the swarmkit controller.Do() code:

        fatal := func(err error) (*api.TaskStatus, error) {
...
                // Tests for specific error types...
...
                status.Err = err.Error() // still reported on temporary
                if IsTemporary(err) {
                        return retry()
                }

                // only at this point do we consider the error fatal to the task.
                log.G(ctx).WithError(err).Error("fatal task error")
...
         }
...
        switch status.State {
        case api.TaskStatePreparing:
                if err := ctlr.Prepare(ctx); err != nil && err != ErrTaskPrepared {
                        return fatal(err)
                }

it should do what we want here.

ctelfer · 2018-08-07T21:01:15Z

BTW, one thing I am not clear on w/ the above suggestion is how rapidly it will cause the daemon to poll and retry creating the task while waiting for the node attachment. I'm naively hoping there is some kind of back-off mechanism built into the retry logic, etc.

ctelfer · 2018-08-07T21:05:47Z

daemon/cluster/executor/container/controller.go

+	if err := r.adapter.nodeAttachmentsReady(ctx); err != nil {
+		return err
+	}
+


So, please pardon my ignorance. Three questions:

Is the Prepare() function invoked in its own goroutine and so blocking here won't unduly hold up any other operations?

Alternately, will blocking this goroutine hold up some client (e.g. the swarm manager) for an excessive amount of time that might cause problems? (because it is waiting for the task to be prepared)

Is the context's timeout sufficiently long to allow node network attachments to propagate to the engine?

talex5 · 2018-08-09T11:58:27Z

daemon/cluster/executor/container/adapter.go

+			// TODO(dperny): unsure if this will work with other network
+			// drivers, but i also don't think other network drivers use the
+			// node attachment IP address.
+			if attachment.Network.DriverState.Name == "overlay" {


Is there any reason we can't do all this waiting in libnetwork (possibly even inside the overlay driver itself)?

Waiting in libnetwork would not be possible because libnetwork doesn't have any visibility into the attachment store. It only creates the networks with the addresses it is given. It does not know whether they are valid or not.

ctelfer · 2018-08-14T15:28:32Z

Based on further consideration and discussion with @dperny I agree that the MakeTemporary solution is potentially easy for someone to trip over down the road (e.g. by changing error handling in the middle of the network allocation call chain in moby). Since this is error is already hard to test for, understanding what went wrong in such a case would be quite difficult. Therefore, an explicit test for the race condition (i.e. what this PR provides) seems more sound.

dperny · 2018-08-14T15:33:25Z

Thanks, Chris, put much better than I probably would have this morning.

codecov · 2018-08-15T15:26:27Z

Codecov Report

❗ No coverage uploaded for pull request base (master@9916827). Click here to learn what that means.
The diff coverage is 45.16%.

@@            Coverage Diff            @@
##             master   #37604   +/-   ##
=========================================
  Coverage          ?   36.14%           
=========================================
  Files             ?      610           
  Lines             ?    45084           
  Branches          ?        0           
=========================================
  Hits              ?    16294           
  Misses            ?    26544           
  Partials          ?     2246

dperny · 2018-08-15T15:26:47Z

I've updated the PR by adding a timeout to the waitNodeAttachments call. The comment above it explains why.

ctelfer · 2018-08-16T15:06:33Z

Tried this out (with the swarmkit patch as well) on a simple cluster and found that if I created a service that had one too many IP addresses, all tasks in the service still blocked seemingly indefinitely. They were stuck in the preparing state.

ctelfer · 2018-08-16T15:14:26Z

Clarification: this is to say that all tasks were preparing, not just a few that couldn't get IP addresses.

ctelfer · 2018-08-16T18:42:39Z

Ah, I figured out why the tasks are stuck in preparing. Maybe this was already obvious to others.

vendor/github.com/docker/swarmkit/agent/exec/controller.go's Do() function invokes Prepare() which in turn invokes r.adapter.waitNodeAttachments(waitNodeAttachmentsContext). So far so good.
if waitNodeAttachments() times out due to the waitNodeAttachmentsContext expiring, then waitNodeAttachments() will return context.DeadlineExceeded
this gets passed back to Do(). which then passes the error to fatal() which returns "what should I do when I get this error".
fatal() specifically checks for context.DeadlineExceeded and if it sees it returns retry() as the appropriate action.
On retry() the state of the task will not change. It will stay in api.TaskStatePreparing.
So timing out the attachment wait attachment will keep the thread from polling indefinitely on the task status, but it won't change the task state to reject the task. Was this how the timeout was intended?

dperny · 2018-08-16T20:22:37Z

changed the error type in the case of timeouts.

ctelfer · 2018-08-17T22:47:50Z

I tested with the swarmkit patch and this one in 4 ways:

I tested whether the allocator would max out the tasks given a constrained IP address space and the a service constraint restricting tasks to one node. The allocator used all available IP addresses and did not allocate per-node addresses for unused nodes.
I tested whether creating a service that over-extended the available IPs would eventually time out tasks stuck in preparing state to reject their allocations. This indeed did occur.
I ran a battery of end-to-end tests and they passed.
I ran a script that repeatedly scaled up and down a set of services in parallel waiting for them to converge between scaling stages. I then searched the logs for the error message indicating that a network couldn't get created due to unavailability of the node IP. This error never appeared.

It would probably be good to automate the first two and add them to moby's integration tests. But the logic seems to be doing exactly what it is supposed to.

ctelfer

LGTM

dperny · 2018-08-20T16:50:46Z

i've added a basic test for the new code.

cpuguy83 · 2018-08-20T17:15:12Z

daemon/cluster/executor/container/adapter.go

+
+	// TODO(dperny): essentially, we're long-polling here. this is really
+	// sub-optimal, but a better solution based off signaling channels would
+	// require a more substantial rearchitecture.


Can you open an issue to detail this work?

What if I changed it to a NOTE or just dropped the qualifier altogether? I don't think the work will really ever be worth it unless we get bored or notice a performance bottleneck here.

Blocks the execution of tasks during the Prepare phase until there exists an IP address for every overlay network in use by the task. This prevents a task from starting before the NetworkAttachment containing the IP address has been sent down to the node. Includes a basic test for the correct use case. Signed-off-by: Drew Erny <drew.erny@docker.com>

dperny · 2018-08-20T20:37:44Z

We can't vendor in the latest master of swarmkit, which contains moby/swarmkit#2725, without this change.

thaJeztah

SGTM

thaJeztah · 2018-08-20T20:41:47Z

daemon/cluster/executor/container/adapter_test.go

+import (
+	"testing"
+
+	"context"


nit (no blocker) this should be in the same group as "testing"above

i usually like to separate the testing imports from the stdlib imports. this originally had way more imports and i ran goimports to get rid of them, leaving this.

sorry.

yeah, it's not a big deal; I just try to keep manual ordering / grouping to a minimum to keep out "personal taste" (which usually doesn't work well with many different contributors)

thaJeztah · 2018-08-20T20:58:07Z

daemon/cluster/executor/container/adapter_test.go

+	}
+
+	// now wait 200 ms for waitNodeAttachments to pick up the change
+	time.Sleep(200 * time.Millisecond)


not a big fan of sleeps, but guess we can look into alternatives in a follow up

tiborvass

LGTM

andrewhsu

LGTM

desirable to get in first before swarmkit vendor updates

andrewhsu · 2018-08-20T23:37:04Z

Can ignore the z failures because of intermittent z infrastructure issues we are experiencing at the moment:

curl: (28) Operation timed out after 300439 milliseconds with 0 out of 0 bytes received

thaJeztah · 2018-08-20T23:49:59Z

Experimental failure is also not related, and a known flaky test

GordonTheTurtle added the status/0-triage label Aug 7, 2018

dperny mentioned this pull request Aug 7, 2018

Only allocate attachments needed by nodes moby/swarmkit#2725

Merged

dperny force-pushed the task-wait-for-attachments branch 2 times, most recently from 8df9eb3 to 69aacc0 Compare August 7, 2018 18:32

ctelfer reviewed Aug 7, 2018

View reviewed changes

talex5 reviewed Aug 9, 2018

View reviewed changes

dperny force-pushed the task-wait-for-attachments branch from 69aacc0 to f7f3fd8 Compare August 10, 2018 15:46

yongtang added status/1-design-review and removed status/0-triage labels Aug 11, 2018

dperny force-pushed the task-wait-for-attachments branch from f7f3fd8 to 6b7d28c Compare August 14, 2018 15:04

dperny force-pushed the task-wait-for-attachments branch from 6b7d28c to 5fdf77c Compare August 15, 2018 15:26

dperny force-pushed the task-wait-for-attachments branch from 5fdf77c to a5fe64b Compare August 15, 2018 15:58

dperny force-pushed the task-wait-for-attachments branch from a5fe64b to b972ee6 Compare August 16, 2018 20:22

ctelfer approved these changes Aug 17, 2018

View reviewed changes

dperny force-pushed the task-wait-for-attachments branch from b972ee6 to d0dd691 Compare August 20, 2018 16:50

dperny force-pushed the task-wait-for-attachments branch from d0dd691 to a149d4b Compare August 20, 2018 17:12

cpuguy83 reviewed Aug 20, 2018

View reviewed changes

dperny force-pushed the task-wait-for-attachments branch from a149d4b to 3c81dc3 Compare August 20, 2018 20:28

thaJeztah approved these changes Aug 20, 2018

View reviewed changes

tiborvass approved these changes Aug 20, 2018

View reviewed changes

thaJeztah added status/4-merge and removed status/1-design-review labels Aug 20, 2018

andrewhsu approved these changes Aug 20, 2018

View reviewed changes

tiborvass merged commit 9d71a57 into moby:master Aug 21, 2018

thaJeztah added the area/swarm label Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block task starting until node attachments are ready #37604

Block task starting until node attachments are ready #37604

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

ctelfer commented Aug 7, 2018

ctelfer Aug 7, 2018

talex5 Aug 9, 2018

ctelfer Aug 13, 2018

ctelfer commented Aug 14, 2018

dperny commented Aug 14, 2018

codecov bot commented Aug 15, 2018 •

edited

Loading

dperny commented Aug 15, 2018

ctelfer commented Aug 16, 2018

ctelfer commented Aug 16, 2018

ctelfer commented Aug 16, 2018

dperny commented Aug 16, 2018

ctelfer commented Aug 17, 2018 •

edited

Loading

ctelfer left a comment

dperny commented Aug 20, 2018

cpuguy83 Aug 20, 2018

dperny Aug 20, 2018

dperny commented Aug 20, 2018

thaJeztah left a comment

thaJeztah Aug 20, 2018

dperny Aug 20, 2018

thaJeztah Aug 20, 2018

thaJeztah Aug 20, 2018

tiborvass left a comment

andrewhsu left a comment

andrewhsu commented Aug 20, 2018

thaJeztah commented Aug 20, 2018

Block task starting until node attachments are ready #37604

Block task starting until node attachments are ready #37604

Conversation

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

ctelfer commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctelfer commented Aug 14, 2018

dperny commented Aug 14, 2018

codecov bot commented Aug 15, 2018 • edited Loading

Codecov Report

dperny commented Aug 15, 2018

ctelfer commented Aug 16, 2018

ctelfer commented Aug 16, 2018

ctelfer commented Aug 16, 2018

dperny commented Aug 16, 2018

ctelfer commented Aug 17, 2018 • edited Loading

ctelfer left a comment

Choose a reason for hiding this comment

dperny commented Aug 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dperny commented Aug 20, 2018

thaJeztah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tiborvass left a comment

Choose a reason for hiding this comment

andrewhsu left a comment

Choose a reason for hiding this comment

andrewhsu commented Aug 20, 2018

thaJeztah commented Aug 20, 2018

codecov bot commented Aug 15, 2018 •

edited

Loading

ctelfer commented Aug 17, 2018 •

edited

Loading