New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: Proceed with startup if cluster component can't be created #31631

Merged
merged 1 commit into from Mar 22, 2017

Conversation

Projects
None yet
8 participants
@aaronlehmann
Contributor

aaronlehmann commented Mar 8, 2017

The current behavior is for dockerd to fail to start if the swarm component can't be started for some reason. This can be difficult to debug remotely because the daemon won't be running at all, so it's not possible to hit endpoints like /info to see what's going on. It's also very difficult to recover from the situation, since commands like "docker swarm leave" are unavailable.

Change the behavior to allow startup to proceed.

Note this is a change we'll have to communicate well. People may expect they can rely on successful daemon startup as an indicator that swarm mode is functioning (though frankly I expect that in most cases people are surprised when dockerd suddenly fails to start). I'm not sure where this change needs to be documented.

Fixes #29580

cc @tonistiigi @aluzzardi

cluster: Proceed with startup if cluster component can't be created
The current behavior is for dockerd to fail to start if the swarm
component can't be started for some reason. This can be difficult to
debug remotely because the daemon won't be running at all, so it's not
possible to hit endpoints like /info to see what's going on. It's also
very difficult to recover from the situation, since commands like
"docker swarm leave" are unavailable.

Change the behavior to allow startup to proceed.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann
Contributor

aaronlehmann commented Mar 8, 2017

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann
Contributor

aaronlehmann commented Mar 8, 2017

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 8, 2017

Contributor

Note that the root cause of #29580 was that the IP address of a cluster member had changed. When swarm mode is started, the daemon resolves the local IP to provide it to libnetwork. If this was the first cluster member (set up with swarm init), the local address is stored on disk. Otherwise, if this node joined the swarm with swarm join, we store the peer address that was passed to swarm join, and check the what local address is used when routing to this address. Normally this works fine, but if we've moved to a different network, the peer address may no longer be routable. Thus, the local address can't be found, and swarm mode can't start up. Without this PR, this also prevents the daemon from starting up.

I also considered submitting a PR to let swarm mode start up even in the case where the local address can't be determined. This would be helpful for laptop users who roam between networks. However, after thinking through it some more, I think blocking swarm startup is the right thing to do in this case. The problem only arises on a node that joined with swarm join, so it will not affect single-node clusters. If you are part of a multinode cluster and can no longer route to the address which you originally joined, preventing swarm from running seems like the right thing to do. With this PR, the daemon will still start, and you'll be able to use docker info to see that swarm mode is not running and understand why.

In the future, we will support a local version of swarm mode that doesn't expose any ports to the outside world until the user decides to. This way, features like secrets will be supported without needing to officially be part of a swarm cluster. In this mode, it will be important to support nodes without any network access at all. Ideally, the networking agent won't be started until the user decides to join multiple nodes together, so the question of resolving the local IP won't arise at all for single-node clusters. It might take some work to make this change, though.

cc @aboch

Contributor

aaronlehmann commented Mar 8, 2017

Note that the root cause of #29580 was that the IP address of a cluster member had changed. When swarm mode is started, the daemon resolves the local IP to provide it to libnetwork. If this was the first cluster member (set up with swarm init), the local address is stored on disk. Otherwise, if this node joined the swarm with swarm join, we store the peer address that was passed to swarm join, and check the what local address is used when routing to this address. Normally this works fine, but if we've moved to a different network, the peer address may no longer be routable. Thus, the local address can't be found, and swarm mode can't start up. Without this PR, this also prevents the daemon from starting up.

I also considered submitting a PR to let swarm mode start up even in the case where the local address can't be determined. This would be helpful for laptop users who roam between networks. However, after thinking through it some more, I think blocking swarm startup is the right thing to do in this case. The problem only arises on a node that joined with swarm join, so it will not affect single-node clusters. If you are part of a multinode cluster and can no longer route to the address which you originally joined, preventing swarm from running seems like the right thing to do. With this PR, the daemon will still start, and you'll be able to use docker info to see that swarm mode is not running and understand why.

In the future, we will support a local version of swarm mode that doesn't expose any ports to the outside world until the user decides to. This way, features like secrets will be supported without needing to officially be part of a swarm cluster. In this mode, it will be important to support nodes without any network access at all. Ideally, the networking agent won't be started until the user decides to join multiple nodes together, so the question of resolving the local IP won't arise at all for single-node clusters. It might take some work to make this change, though.

cc @aboch

@ehazlett

This comment has been minimized.

Show comment
Hide comment
@ehazlett

ehazlett Mar 8, 2017

Contributor

Yes in the event of an issue I think it would be better to allow swarm commands instead of trying to dig through the filesystem to remove whatever is needed. As you mentioned, we will definitely want to update docs to clearly point out the new behavior.

design LGTM

Contributor

ehazlett commented Mar 8, 2017

Yes in the event of an issue I think it would be better to allow swarm commands instead of trying to dig through the filesystem to remove whatever is needed. As you mentioned, we will definitely want to update docs to clearly point out the new behavior.

design LGTM

@dperny

This comment has been minimized.

Show comment
Hide comment
@dperny

dperny Mar 10, 2017

Contributor

I am not a maintainer, but this LGTM.

Contributor

dperny commented Mar 10, 2017

I am not a maintainer, but this LGTM.

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann
Contributor

aaronlehmann commented Mar 18, 2017

@tonistiigi

This comment has been minimized.

Show comment
Hide comment
@tonistiigi

tonistiigi Mar 18, 2017

Member

LGTM

Member

tonistiigi commented Mar 18, 2017

LGTM

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Mar 22, 2017

Contributor

Generally OK, but what if we only allow it start while in debug mode?

Contributor

cpuguy83 commented Mar 22, 2017

Generally OK, but what if we only allow it start while in debug mode?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 22, 2017

Contributor

I think that's still a poor experience. You'd have to mess with daemon flags to even get into a state where Docker is up and you can debug the problem. If we went that route, I think it would be better to provide a daemon flag that clears swarm state (but I think it's better still to let the daemon start as usual, and let the user leave the swarm with normal CLI commands, as in this PR).

Contributor

aaronlehmann commented Mar 22, 2017

I think that's still a poor experience. You'd have to mess with daemon flags to even get into a state where Docker is up and you can debug the problem. If we went that route, I think it would be better to provide a daemon flag that clears swarm state (but I think it's better still to let the daemon start as usual, and let the user leave the swarm with normal CLI commands, as in this PR).

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Mar 22, 2017

Contributor

Can docker info reflect that swarm-mode is not running, or degraded in some way?

Contributor

cpuguy83 commented Mar 22, 2017

Can docker info reflect that swarm-mode is not running, or degraded in some way?

@aaronlehmann

This comment has been minimized.

Show comment
Hide comment
@aaronlehmann

aaronlehmann Mar 22, 2017

Contributor

docker info will indicate that swarm mode hit an error condition, and show the error message.

Contributor

aaronlehmann commented Mar 22, 2017

docker info will indicate that swarm mode hit an error condition, and show the error message.

@cpuguy83

LGTM

@aaronlehmann aaronlehmann merged commit 9b33edf into moby:master Mar 22, 2017

4 of 6 checks passed

powerpc Jenkins build is being scheduled
Details
z Jenkins build is being scheduled
Details
dco-signed All commits are signed
experimental Jenkins build Docker-PRs-experimental 31434 has succeeded
Details
janky Jenkins build Docker-PRs 40052 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 11123 has succeeded
Details

@GordonTheTurtle GordonTheTurtle added this to the 17.05.0 milestone Mar 22, 2017

@aaronlehmann aaronlehmann deleted the aaronlehmann:swarm-failure-on-startup branch Mar 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment