Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command to leave WAN pool without leaving LAN pool #6548

Closed
freddygv opened this issue Sep 26, 2019 · 5 comments · Fixed by #11722
Closed

Command to leave WAN pool without leaving LAN pool #6548

freddygv opened this issue Sep 26, 2019 · 5 comments · Fixed by #11722
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature

Comments

@freddygv
Copy link
Contributor

freddygv commented Sep 26, 2019

Feature Description

Currently the consul leave command will trigger a graceful leave and shutdown of the agent it is called on.

Consul should provide a command and HTTP API endpoint for servers to leave a WAN pool without:

  • Being removed from the raft config
  • Leaving the LAN pool
  • Shutting down

Use Case(s)

The main use-case would be to split up two WAN-joined datacenters.

The only way to do this without downtime currently is to:
1. Block cross-DC server communication.
2. Have a server in each DC call consul force-leave <node-name>.<dc> on all the servers in the other DC (once the servers in the other DC are marked as failed).

There is currently no workaround for this.

@freddygv freddygv added type/enhancement Proposed improvement or new feature theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner labels Sep 26, 2019
@slackpad
Copy link
Contributor

This is a duplicate of #3307 but this has more context so might be a better issue. There was an attempt to do this in #3414 that has some more context and some issues that we ran into.

@banks
Copy link
Member

banks commented May 14, 2020

Relatedly, I don't think consul force-leave <node-name>.<dc> actually works as documented currently. When I test it with a federated cluster, even when all nodes are up and healthy I get:

$ consul members -wan
Node            Address          Status  Type    Build     Protocol  DC   Segment
node-24507.dc1  127.0.0.1:24509  alive   server  1.7.0dev  2         dc1  <all>
node-24512.dc1  127.0.0.1:24514  alive   server  1.7.0dev  2         dc1  <all>
node-24532.dc2  127.0.0.1:24534  alive   server  1.7.0dev  2         dc2  <all>
node-24537.dc2  127.0.0.1:24539  alive   server  1.7.0dev  2         dc2  <all>
node-24542.dc2  127.0.0.1:24544  alive   server  1.7.0dev  2         dc2  <all>
node-24562.dc3  127.0.0.1:24564  alive   server  1.7.0dev  2         dc3  <all>
node-24567.dc3  127.0.0.1:24569  alive   server  1.7.0dev  2         dc3  <all>
node-24572.dc3  127.0.0.1:24574  alive   server  1.7.0dev  2         dc3  <all>
node-8500.dc1   127.0.0.1:8302   alive   server  1.7.0dev  2         dc1  <all>
$ consul force-leave -prune node-24562.dc3
Error force leaving: Unexpected response code: 500 (agent: No node found with name 'node-24562.dc3')

@freddygv
Copy link
Contributor Author

freddygv commented May 14, 2020

@banks there was a change made that broke that workaround. This was to fix an issue where force-leave called in the current DC without the DC suffix led to the force-left node not leaving the WAN pool.

The relevant code is here:
https://github.com/hashicorp/consul/blob/master/agent/consul/server.go#L1127

When you call <node-name>.dc2 from dc1 we naively append dc1 for the WAN pool removal, and the call is made with <node-name>.dc2.dc1.

I updated the issue to state there's no workaround currently available.

Edit:

Just noticed that you tried to force-leave a node that's alive. That's never been possible. Only failed nodes can be force-left because if they're still alive they will refute the messages about them being failed/leaving.

@banks
Copy link
Member

banks commented May 15, 2020

There was another change made recently that is the reason the workaround now doesn't work: aed5cb7

In an attempt to improve the error message on force-leave (a great idea) we missed the case that it could be a WAN node so need to fix that.

@sriyer
Copy link

sriyer commented Oct 8, 2020

@banks @freddygv , it seems that at the moment there is no way to un-federate a cluster from another. Is the force-leave option going to be fixed soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants