Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

Closed
pierresouchay opened this issue Nov 9, 2017 · 3 comments · Fixed by #4771
Closed

Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

pierresouchay opened this issue Nov 9, 2017 · 3 comments · Fixed by #4771
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected
Milestone

Comments

@pierresouchay
Copy link
Contributor

Hello

Consul Version: 0.9.3 (client & server)

We are using ACLs to register services and HealthCheck in a on-demand cluster, thus people might be able to use wrong tokens for the service they want to describe.

We found out that when several services have wrong ACLs, the state in the local Consul Agent is not the same as seen in the cluster.

While it seems legit for users having wrong token, however, sometimes, it seems that legit services cannot sync their state neither.

We are registering lots of services using the localhost:8500/v1/agent/service/register endpoint, and this is especially visible for health check statuses that are not properly updated. The case seems linked to having several services with wrong tokens (those services might include several heathchecks). Forbidding applications with wrong tokens seems to avoid this issue, but this is problematic since it desync the agent with its state reported by the cluster.

Example of messages as seen in the Consul Agent:

Nov 02 01:50:16 mesos-slave060-par consul[11906]: 2017/11/02 01:50:16 [ERR] consul: RPC failed to server 10.251.1.97:8300: rpc error: rpc error: Permission denied
Nov 02 01:50:16 mesos-slave060-par consul[11906]: 2017/11/02 01:50:16 [WARN] agent: Service 'mysuper-service-31519' registration blocked by ACLs

Here mysuper-service-31519 has clearly a wrong token and its registration fails.
But after a while, having several services doing similar thing completely break agent synchronization, so the state in the cluster is not updated anymore.

A restart of the agent solves the issue most of the time (while we had cases where we had to cleanup the repository of checks to solve the issue). It is especially strange that the service registered with his token is then properly updated (including its healthchecks) -> so I suspect that after a restart, Consul is loosing the token context and does not use the ACL anymore.

We are trying to create for now a clear reproduction of the bug, but it seems complicated (having several clients doing the same), and seems to happen after a while.

Do you have an idea where it could come from?

@slackpad slackpad added this to the 1.0.2 milestone Nov 14, 2017
@slackpad slackpad modified the milestones: 1.0.2, 1.0.3 Dec 13, 2017
@slackpad slackpad added type/bug Feature does not function as expected theme/acls ACL and token generation labels Dec 20, 2017
@slackpad slackpad modified the milestones: 1.0.3, Next Jan 13, 2018
@wdauchy
Copy link

wdauchy commented Aug 14, 2018

is there any progress regarding this issue?

@pierresouchay
Copy link
Contributor Author

This one #4405 would also help to diagnose those issues

mkeeler pushed a commit that referenced this issue Jan 4, 2019
…= false (#4771)

Fixes: #3676

This fixes a bug were registering an agent with a non-existent ACL token can prevent other 
services registered with a good token from being synced to the server when using 
`acl_enforce_version_8 = false`.

## Background

When `acl_enforce_version_8` is off the agent does not check the ACL token validity before 
storing the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error 
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced 
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.

## Changes 

This change modify the sync process to continue even if there is an error. 
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers 
were upgraded to a newer version that has more strict checks on the service definition...). 
Then all services and check that can be synced will, and those that don't will be marked as errors in 
the logs instead of blocking the whole process.
@pierresouchay
Copy link
Contributor Author

@mkeeler thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants