Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of acceptance of datacenter as a param for force-leave from WAN pool #6609

Closed
wants to merge 4 commits into from

Conversation

dedifferentiator
Copy link

I hope I did everything the proper way, waiting for review.
Also need some help with writing tests for -datacenter arg, need to know how to create fake node in another datacenter in forceleave_test.go

Resolves #6582

Add support for datacenter as a param for force-leave from WAN pool
@hashicorp-cla
Copy link

hashicorp-cla commented Oct 9, 2019

CLA assistant check
All committers have signed the CLA.

t.Parallel()
c, s := makeClient(t)
defer s.Stop()

agent := c.Agent()

// Eject somebody
err := agent.ForceLeavePrune("foo")
q := &QueryOptions{Datacenter: "", Prune: true}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe it's worth to add test for forceLeaveWithOpts with datacenter specified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be useful.

Copy link
Contributor

@freddygv freddygv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @dedifferentiator! Sorry about the delay in reviewing this. I made comments in line with some recommendations.

Aside from those comments, we should also add some tests for removing nodes in a different datacenter. These should be added in the command and agent packages, and they would be modeled after: TestForceLeaveCommand and TestAgent_ForceLeave_ACLDeny.

Here is a setup that should be verified in both:

  • Set up two agents, with each one in a different datacenter (passing a datacenter config into the hcl param of NewTestAgent)
  • Join them via a WAN join
  • Shut down one of them
  • Run force-leave without prune, and verify that the status transitions to serf.StatusLeft
  • Run force-leave with prune, and verify that the node is removed from the WAN member list.

If these tests pass we can be more confident that the datacenter flag is working as intended.

Lastly, since the HTTP handler and CLI command are being updated, the website docs should also be updated to reflect the datacenter param.

These are the docs files to update:
website/source/api/agent.html.md
website/source/docs/commands/force-leave.html.markdown.erb

_, prune := req.URL.Query()["prune"]
//Check the value of the prune and datacenter query
datacenter := req.URL.Query().Get("datacenter")
prune, err := strconv.ParseBool(req.URL.Query().Get("prune"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the value of prune should be left as it was before.

We intend for users to use the flag like this:
/agent/force-leave/:node?prune

If a user only passes prune and not something like prune=true, then ParseBool will return an error and prune would be set to false.

func (a *Agent) ForceLeavePrune(node string) error {
//ForceLeaveWithOpts is used to have an a failed agent removed
//from the list of members with specified additional arguments
func (a *Agent) ForceLeaveWithOpts(node string, q *QueryOptions) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some internal discussion we decided that we shouldn't piggy-backoff of QueryOptions here. QueryOptions should mostly be reserved for params that apply to multiple functions, rather than just for one (prune only applies to ForceLeave).

Rather than passing in QueryOptions, the signature could instead be:

func (a *Agent) ForceLeaveWithOpts(node string, datacenter string, prune bool) error

Then the params could be set like this:

r.params.Set("prune", "")
r.params.Set("dc", datacenter)

@@ -155,6 +155,10 @@ type QueryOptions struct {
// services. This currently affects prepared query execution.
Connect bool

// Prune is used in force leave command to remove agent completely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be removed, as per the comment above

@@ -676,6 +680,9 @@ func (r *request) setQueryOptions(q *QueryOptions) {
if q.Connect {
r.params.Set("connect", "true")
}
if q.Prune {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to be removed

@@ -85,5 +89,6 @@ Usage: consul force-leave [options] name
Consul will attempt to reconnect to those failed nodes for some period of
time before eventually reaping them.

-prune Remove agent completely from list of members
-prune Remove agent completely from list of members
-datacenter Specify datacenter name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-datacenter Specify datacenter name
-datacenter Name of the datacenter to remove the node from

}

func (c *cmd) init() {
c.flags = flag.NewFlagSet("", flag.ContinueOnError)
c.flags.BoolVar(&c.prune, "prune", false,
"Remove agent completely from list of members")
c.flags.StringVar(&c.datacenter, "datacenter", "", "Specify datacenter name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.flags.StringVar(&c.datacenter, "datacenter", "", "Specify datacenter name")
c.flags.StringVar(&c.datacenter, "datacenter", "", "Name of the datacenter to remove the node from")

@@ -52,8 +55,9 @@ func (c *cmd) Run(args []string) int {
return 1
}

if c.prune {
err = client.Agent().ForceLeavePrune(nodes[0])
if c.prune || c.datacenter != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be adjusted after removing prune from QueryOptions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if c.prune || c.datacenter != "" {
    err = client.Agent().ForceLeaveWithOpts(nodes[0], c.datacenter, c.prune)
} else {
    err = client.Agent().ForceLeave(nodes[0])
}

Which way should it be adjusted? ForceLeaveWithOpts can be with unspecified prune and specified datacenter and vice versa, in other case it's just ForceLeave

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That block you just posted is what I had in mind.

t.Parallel()
c, s := makeClient(t)
defer s.Stop()

agent := c.Agent()

// Eject somebody
err := agent.ForceLeavePrune("foo")
q := &QueryOptions{Datacenter: "", Prune: true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be useful.

@dedifferentiator
Copy link
Author

dedifferentiator commented Nov 5, 2019

@freddygv
Okay, so what do we have, I spend a couple of days trying to fix error in TestForceLeaveCommand_datacenter, but with no success, after running forceleave command with specified dc I got:

    log.go:172: TestForceLeaveCommand_datacenter - 2019/11/05 16:41:52.160920 [DEBUG] http: Request PUT /v1/agent/force-leave/Node-f1fb52c3-1d0f-9604-3c00-692921472212?dc=dc2 (42.959µs) from=127.0.0.1:41948
    log.go:172: TestForceLeaveCommand_datacenter - 2019/11/05 16:41:52.762668 [DEBUG] tlsutil: OutgoingRPCWrapper with version 1
    log.go:172: TestForceLeaveCommand_datacenter - 2019/11/05 16:41:58.330392 [DEBUG] memberlist: Failed ping: Node-f1fb52c3-1d0f-9604-3c00-692921472212.dc2 (timeout reached)
    retry.go:121: forceleave_test.go:93: got status "leaving" want "left"
        forceleave_test.go:93: got status "alive" want "left"

For now I decided not to push changes in master since they obviously are not correct, so here they are. Maybe I should push them here? Not sure about that.

But at the same time when forceleave -datacenter=dc doesn't work, TestForceLeaveCommand_prune_datacenter for forceleave -prune -datacenter=dcpasses with no error occured. Maybe I don't see something what doesn't allow forceleave work properly, so help is required

@freddygv
Copy link
Contributor

Hi @dedifferentiator,

After looking through your two PRs, the changes you made look good, and we really appreciate that you took the time to contribute.

Unfortunately, we have decided to not implement this feature. There are a few reasons for this:

  • There was a separate PR merged to mitigate a significant portion of the need for this feature, which was to avoid listing decommissioned datacenters in catalog endpoints.
  • There is additional complexity regarding functionality in Consul Enterprise that wasn't accounted for. This is because cross-DC force-leave would interact with Network Areas which also function across DCs.
  • Also an API release was tagged in November, which means that ForceLeavePrune is now in the customer-facing API. Making changes there would require adding a third function or making a breaking change.
  • The ACL permissions required for cross-DC requests like this one need more thinking, since ACL rules are typically restricted to a single datacenter.

@freddygv freddygv closed this Jan 13, 2020
@dedifferentiator
Copy link
Author

@freddygv Understand, thank you for allowing trying myself in open-source, you have been helping me a lot during this PR, despite it is unmerged, I liked working on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept datacenter as a param for force-leave from WAN pool
4 participants