Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow service deregistration with node write permission #5217

Merged

Conversation

ShimmerGlass
Copy link
Contributor

@ShimmerGlass ShimmerGlass commented Jan 11, 2019

Fixes: #5209

With ACLs enabled if an agent is wiped and restarted without a leave
it can no longer deregister the services it had previously registered
because it no longer has the tokens the services were registered with.
To remedy that we allow service deregistration from tokens with node
write permission.

@ShimmerGlass ShimmerGlass force-pushed the fix-deregister-after-wipe branch 2 times, most recently from 5cc4ab6 to 9a4ae88 Compare January 11, 2019 13:09
@pierresouchay
Copy link
Contributor

It make sense because as long as the user has the ACL to leave the cluster (which effectively remove all services of agent), it should have the rights to remove services the agent is "owning". Having the right on the agent being the correct right to leave, it sounds like a reasonable patch to me.

@ShimmerGlass
Copy link
Contributor Author

We saw this problem again today : several service instances were registered in the catalog while the agent they were registered on before was trying to deregister them without success (since it had no token with permissions to do so) becode the node had been reinstalled.
The only solution we had was to manually deregister each instance via /catalog calls. These instances were green in Consul and revieving traffic.
One other argument toward this change is the fact that service checks can already be deregistered with only node permissions (see https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1233).

@pierresouchay
Copy link
Contributor

Could someone have a look to this PR?

@ShimmerGlass
Copy link
Contributor Author

Any update on this ?
This issue is still affecting us with ghost passing service instances.

@banks banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label May 14, 2019
@banks banks added this to the 1.5.1 milestone May 14, 2019
@banks banks requested a review from a team May 14, 2019 15:17
@mkeeler mkeeler modified the milestones: 1.5.1, 1.5.2 May 23, 2019
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work, I left some comments.

@@ -727,7 +727,7 @@ service "service" {
err = msgpackrpc.CallWithCodec(codec, "Catalog.Deregister",
&structs.DeregisterRequest{
Datacenter: "dc1",
Node: "node",
Node: "nope",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this file change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the previous behavior this call (write access on node but not on service) would fail. It now succeeds if the token has write access on the node, which is the case here, so I changed the test to target a node the token does not has write access to.

agent/consul/acl.go Outdated Show resolved Hide resolved
With ACLs enabled if an agent is wiped and restarted without a leave
it can no longer deregister the services it had previously registered
because it no longer has the tokens the services were registered with.
To remedy that we allow service deregistration from tokens with node
write permission.
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Left a tiny comment, something we just noticed.

If you don't get to it, I will make the change so that we can include your PR in the release that we are hopefully cutting tomorrow.

agent/consul/acl.go Show resolved Hide resolved
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@hanshasselberg hanshasselberg merged commit 81f8092 into hashicorp:master Jun 27, 2019
@pierresouchay
Copy link
Contributor

@i0rek 🥇 Thank you so much, this will really help on large infrastructures! (That's actually one of our last pain points)

@hanshasselberg
Copy link
Member

We are closing the gap! 🙏

@ShimmerGlass ShimmerGlass deleted the fix-deregister-after-wipe branch July 1, 2019 08:13
@ShimmerGlass
Copy link
Contributor Author

@i0rek thx (:

@charnet1019
Copy link

how to call /catalog to delete instance?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ACL] Ghost service instance after agent wipe
6 participants