Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block task starting until node attachments are ready #37604

Merged
merged 1 commit into from
Aug 21, 2018

Conversation

dperny
Copy link
Contributor

@dperny dperny commented Aug 7, 2018

- What I did
Blocks the execution of tasks during the Prepare phase until there exists an IP address for every overlay network in use by the task. This prevents a task from starting before the NetworkAttachment containing the IP address has been sent down to the node.

This is needed because moby/swarmkit#2725 only creates network attachments on an as-needed basis, meaning a node is not guaranteed to have received a network attachment by the time the task requiring it has arrived.

- How I did it
Added a long-polling loop that checks for the existence of an IP address for each overlay network in use by the task. Long-polling is not optimal, but a channel-triggered notification method would require a more substantial rearchitecting.

Also added locking to the network attachment store, to ensure that it is safe for concurrent access.

- How to verify it
Still needs tests, but currently there is no case in which a network attachment won't be ready.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 7, 2018

Bringing this in from moby/swarmkit#2725.

It looks like swarmkit's error library already provides this nice mechanism for reporting temporary errors and telling swarm to retry later: docker/swarmkit/agent/exec/errors.go:MakeTemporary. It also looks like the path between the swarmkit controller.Do() function and the docker engine's daemon/network.go:daemon.createNetwork() call mostly clear. That is, the various calls in between just flag the error and return it without interpretation.

So, if I'm reading this right, it might be possible to address the race condition by changing the code of createNetwork() so it invokes MakeTemporary() on the error it would return when it can't find the network attachment. I.e. on line 350 of daemon/network .go:

        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)
                }

becomes something like

        // Top of file
        import swarmErrors "docker/swarmkit/agent/exec/errors.go"
...
        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, swarmErrors.MakeTemporary(fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id))
                }

Does this seem plausible? I do note that the MakeTemporary() function isn't invoked anywhere in the moby codebase meaning that this is kind of an untested code feature. But from what I see in the swarmkit controller.Do() code:

        fatal := func(err error) (*api.TaskStatus, error) {
...
                // Tests for specific error types...
...
                status.Err = err.Error() // still reported on temporary
                if IsTemporary(err) {
                        return retry()
                }

                // only at this point do we consider the error fatal to the task.
                log.G(ctx).WithError(err).Error("fatal task error")
...
         }
...
        switch status.State {
        case api.TaskStatePreparing:
                if err := ctlr.Prepare(ctx); err != nil && err != ErrTaskPrepared {
                        return fatal(err)
                }

it should do what we want here.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 7, 2018

BTW, one thing I am not clear on w/ the above suggestion is how rapidly it will cause the daemon to poll and retry creating the task while waiting for the node attachment. I'm naively hoping there is some kind of back-off mechanism built into the retry logic, etc.

if err := r.adapter.nodeAttachmentsReady(ctx); err != nil {
return err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, please pardon my ignorance. Three questions:

  1. Is the Prepare() function invoked in its own goroutine and so blocking here won't unduly hold up any other operations?
  2. Alternately, will blocking this goroutine hold up some client (e.g. the swarm manager) for an excessive amount of time that might cause problems? (because it is waiting for the task to be prepared)
  3. Is the context's timeout sufficiently long to allow node network attachments to propagate to the engine?

// TODO(dperny): unsure if this will work with other network
// drivers, but i also don't think other network drivers use the
// node attachment IP address.
if attachment.Network.DriverState.Name == "overlay" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we can't do all this waiting in libnetwork (possibly even inside the overlay driver itself)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting in libnetwork would not be possible because libnetwork doesn't have any visibility into the attachment store. It only creates the networks with the addresses it is given. It does not know whether they are valid or not.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 14, 2018

Based on further consideration and discussion with @dperny I agree that the MakeTemporary solution is potentially easy for someone to trip over down the road (e.g. by changing error handling in the middle of the network allocation call chain in moby). Since this is error is already hard to test for, understanding what went wrong in such a case would be quite difficult. Therefore, an explicit test for the race condition (i.e. what this PR provides) seems more sound.

@dperny
Copy link
Contributor Author

dperny commented Aug 14, 2018

Thanks, Chris, put much better than I probably would have this morning.

@codecov
Copy link

codecov bot commented Aug 15, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@9916827). Click here to learn what that means.
The diff coverage is 45.16%.

@@            Coverage Diff            @@
##             master   #37604   +/-   ##
=========================================
  Coverage          ?   36.14%           
=========================================
  Files             ?      610           
  Lines             ?    45084           
  Branches          ?        0           
=========================================
  Hits              ?    16294           
  Misses            ?    26544           
  Partials          ?     2246

@dperny
Copy link
Contributor Author

dperny commented Aug 15, 2018

I've updated the PR by adding a timeout to the waitNodeAttachments call. The comment above it explains why.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 16, 2018

Tried this out (with the swarmkit patch as well) on a simple cluster and found that if I created a service that had one too many IP addresses, all tasks in the service still blocked seemingly indefinitely. They were stuck in the preparing state.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 16, 2018

Clarification: this is to say that all tasks were preparing, not just a few that couldn't get IP addresses.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 16, 2018

Ah, I figured out why the tasks are stuck in preparing. Maybe this was already obvious to others.

  • vendor/github.com/docker/swarmkit/agent/exec/controller.go's Do() function invokes Prepare() which in turn invokes r.adapter.waitNodeAttachments(waitNodeAttachmentsContext). So far so good.
  • if waitNodeAttachments() times out due to the waitNodeAttachmentsContext expiring, then waitNodeAttachments() will return context.DeadlineExceeded
  • this gets passed back to Do(). which then passes the error to fatal() which returns "what should I do when I get this error".
  • fatal() specifically checks for context.DeadlineExceeded and if it sees it returns retry() as the appropriate action.
  • On retry() the state of the task will not change. It will stay in api.TaskStatePreparing.
    So timing out the attachment wait attachment will keep the thread from polling indefinitely on the task status, but it won't change the task state to reject the task. Was this how the timeout was intended?

@dperny
Copy link
Contributor Author

dperny commented Aug 16, 2018

changed the error type in the case of timeouts.

@ctelfer
Copy link
Contributor

ctelfer commented Aug 17, 2018

I tested with the swarmkit patch and this one in 4 ways:

  • I tested whether the allocator would max out the tasks given a constrained IP address space and the a service constraint restricting tasks to one node. The allocator used all available IP addresses and did not allocate per-node addresses for unused nodes.
  • I tested whether creating a service that over-extended the available IPs would eventually time out tasks stuck in preparing state to reject their allocations. This indeed did occur.
  • I ran a battery of end-to-end tests and they passed.
  • I ran a script that repeatedly scaled up and down a set of services in parallel waiting for them to converge between scaling stages. I then searched the logs for the error message indicating that a network couldn't get created due to unavailability of the node IP. This error never appeared.

It would probably be good to automate the first two and add them to moby's integration tests. But the logic seems to be doing exactly what it is supposed to.

Copy link
Contributor

@ctelfer ctelfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dperny
Copy link
Contributor Author

dperny commented Aug 20, 2018

i've added a basic test for the new code.


// TODO(dperny): essentially, we're long-polling here. this is really
// sub-optimal, but a better solution based off signaling channels would
// require a more substantial rearchitecture.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue to detail this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I changed it to a NOTE or just dropped the qualifier altogether? I don't think the work will really ever be worth it unless we get bored or notice a performance bottleneck here.

Blocks the execution of tasks during the Prepare phase until there
exists an IP address for every overlay network in use by the task. This
prevents a task from starting before the NetworkAttachment containing
the IP address has been sent down to the node.

Includes a basic test for the correct use case.

Signed-off-by: Drew Erny <drew.erny@docker.com>
@dperny
Copy link
Contributor Author

dperny commented Aug 20, 2018

We can't vendor in the latest master of swarmkit, which contains moby/swarmkit#2725, without this change.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

import (
"testing"

"context"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (no blocker) this should be in the same group as "testing"above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i usually like to separate the testing imports from the stdlib imports. this originally had way more imports and i ran goimports to get rid of them, leaving this.

sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's not a big deal; I just try to keep manual ordering / grouping to a minimum to keep out "personal taste" (which usually doesn't work well with many different contributors)

}

// now wait 200 ms for waitNodeAttachments to pick up the change
time.Sleep(200 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big fan of sleeps, but guess we can look into alternatives in a follow up

Copy link
Contributor

@tiborvass tiborvass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@andrewhsu andrewhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

desirable to get in first before swarmkit vendor updates

@andrewhsu
Copy link
Member

Can ignore the z failures because of intermittent z infrastructure issues we are experiencing at the moment:

curl: (28) Operation timed out after 300439 milliseconds with 0 out of 0 bytes received

@thaJeztah
Copy link
Member

Experimental failure is also not related, and a known flaky test

@tiborvass tiborvass merged commit 9d71a57 into moby:master Aug 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants