Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Registration/De-registration slows down due to existing services with expired token. #3363

Closed
rshsmi opened this issue Aug 5, 2017 · 6 comments
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected waiting-reply Waiting on response from Original Poster or another individual in the thread
Milestone

Comments

@rshsmi
Copy link

rshsmi commented Aug 5, 2017

consul version for both Client and Server

Client: Consul v0.8.5
Server: Consul v0.8.5

Operating system and Environment details

CentOS Linux release 7.3.1611 (Core)

Description of the Issue (and unexpected/desired result)

The token for services registered with clients expires after a while depend on the TTL, then services without valid tokens generating error:

[ERR] consul: RPC failed to server IP:8300: rpc error: rpc error: ACL not found
[ERR] agent: failed to sync changes: rpc error: rpc error: ACL not found

and service registration/de-registration takes a very long time.

Reproduction steps

  • Register A Service with client(s) - with or without health checks
  • Expire/remove the token which used for service registration
  • Restart consul node which the service has been registered to
  • The mentioned error is seen in the cosnul log
  • Register another service
  • Service is not appearing in consul (UI or API)
  • after a few hours or even the next day, services are appearing in consul (UI or API)
  • De-register the service, de-registration doesn't take effect unless after few hours/ or even the next day.
@slackpad
Copy link
Contributor

slackpad commented Sep 2, 2017

Hi @ars05 for the "Register another service" do you use the same expired token or a valid one?

@rshsmi
Copy link
Author

rshsmi commented Sep 4, 2017

Hi @slackpad - We do use a new valid token for service registration. The "another service" is being registered with a newly generated consul token.

@slackpad
Copy link
Contributor

I've got a theory about this one since you are restarting the Consul node, and some of the token tracking is in-process and won't survive a restart. Seems possibly related to #3676.

@slackpad slackpad added this to the 1.0.2 milestone Nov 14, 2017
@pierresouchay
Copy link
Contributor

Yes, we definitely have the same issue #3676

The checks are not updated nor the new services. This is very problematic for us since our service registration is a on-demand one.

In some case, we even have to clean up everything (all services/checks) in the local agent in order to make it converge again

@slackpad slackpad modified the milestones: 1.0.2, 1.0.3 Dec 13, 2017
@slackpad slackpad added type/bug Feature does not function as expected theme/acls ACL and token generation labels Jan 5, 2018
@slackpad slackpad modified the milestones: 1.0.3, Next Jan 13, 2018
@banks
Copy link
Member

banks commented Nov 28, 2018

@ars05 I'm aware it's been a while and you may no longer be able to say, but are you able to confirm that the issue described in #3676 and it's resolution in #4771 solve the issue here? The tl;dr is that if you are using acl_enforce_version_8 = false there is a bug that can cause this that is fixed there (not merged yet but will be soon).

If so we can close this issue in favour, I just wasn't certain enough that we confirmed the same diagnosis here before I do that.

@mkeeler mkeeler added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 4, 2019
@mkeeler
Copy link
Member

mkeeler commented Jan 4, 2019

The fix for #3676 was merged and should protect against individual check/service registration failures due to ACL not found errors impacting other check/service registration.

The problem was that for not found errors (not permission denied) they were being treated as other server errors and would abort various loops and prevent other service/check registrations. The fix is to ensure that not found errors are treated like permission denied errors which will then only affect that single check/service.

@ars05 Its been a while since @banks question so I am going to close this. If a similar situation crops up again please open a new issue.

@mkeeler mkeeler closed this as completed Jan 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected waiting-reply Waiting on response from Original Poster or another individual in the thread
Projects
None yet
Development

No branches or pull requests

5 participants