Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpoints-controller: race condition between ACL tokens and pods in endpoint resource #599

Closed
adrien-f opened this issue Aug 11, 2021 · 2 comments · Fixed by #601
Closed
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working

Comments

@adrien-f
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

When using the endpoints-controller and when deploying or restarting a certain number of pods, the endpoints-controller will receive multiple events for the same endpoint as the pods are moving.

During the processing of the first event, the client.ACL().TokenList(nil) call will return tokens that have been created during the processing (as the other pods are being started). Because the list in endpointPods is not up to date, the reconciliation process will delete the ACL token causing the init iteration to fail (but will work after a while by retrying to login).

// map is me dumping the subset.Addresses
map[
  {10.195.109.89  0xc0009dcd70 &ObjectReference{Kind:Pod,Namespace:spartacux,Name:site-spartacux-fr-b2c-58b4d579-wd4sw,UID:53880add-5a21-437c-ba1a-969aa579ea0d,APIVersion:,ResourceVersion:143265599,FieldPath:,}}:passing
 {10.195.110.58  0xc0009dcd80 &ObjectReference{Kind:Pod,Namespace:spartacux,Name:site-spartacux-fr-b2c-58b4d579-pb6gh,UID:7745d69f-a9e1-4a06-8676-255ea061a072,APIVersion:,ResourceVersion:143265109,FieldPath:,}}:passing
 {10.195.74.93  0xc0009dcda0 &ObjectReference{Kind:Pod,Namespace:spartacux,Name:site-spartacux-fr-b2c-58b4d579-p5t72,UID:0a0ca60f-ea47-48d5-b675-b678827360b5,APIVersion:,ResourceVersion:143266873,FieldPath:,}}:critical]

// 11/8/2021 at 14:19:54
{"level":"info","ts":1628691594.6090574,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"site-spartacux-fr-b2c-58b4d579-6q9s7"}
{"level":"info","ts":1628691595.269407,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"site-spartacux-fr-b2c-58b4d579-b86nv"}
{"level":"info","ts":1628691595.2737951,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"site-spartacux-fr-b2c-58b4d579-bg5wc"}
{"level":"info","ts":1628691595.2784,"logger":"controller.endpoints","msg":"deleting ACL token for pod","name":"site-spartacux-fr-b2c-58b4d579-j5rmt"}
{"level":"info","ts":1628691595.3030953,"logger":"controller.endpoints","msg":"retrieved","name":"site-spartacux-fr-b2c","ns":"spartacux"}
{"level":"info","ts":1628691595.303154,"logger":"controller.endpoints","msg":"adding target to endpointsPods list","target":"site-spartacux-fr-b2c-58b4d579-wd4sw"}

Reproduction Steps

  • Scale any deployment to 100 pods or more
  • Checking the logs of the init-container:
❯ kubectl logs site-spartacux-fr-b2c-58b4d579-6q9s7 -n spartacux -c consul-connect-init
Wed Aug 11 14:19:54 UTC 2021
{"@level":"info","@message":"Consul login complete","@timestamp":"2021-08-11T14:19:54.121527Z"}
{"@level":"error","@message":"Unable to get Agent services","@timestamp":"2021-08-11T14:19:54.122403Z","error":"Unexpected response code: 403 (ACL not found)"}
{"@level":"error","@message":"Unable to get Agent services","@timestamp":"2021-08-11T14:19:55.123105Z","error":"Unexpected response code: 403 (ACL not found)"}
{"@level":"error","@message":"Unable to get Agent services","@timestamp":"2021-08-11T14:19:56.123835Z","error":"Unexpected response code: 403 (ACL not found)"}
{"@level":"error","@message":"Unable to get Agent services","@timestamp":"2021-08-11T14:19:57.124566Z","error":"Unexpected response code: 403 (ACL not found)"}
{"@level":"error","@message":"Unable to get Agent services","@timestamp":"2021-08-11T14:19:58.125249Z","error":"Unexpected response code: 403 (ACL not found)"
  • Check the deleting ACL token message timestamp vs the login message, the token is deleted just after being created.

Logs

Expected behavior

  • ACL token should not be deleted during a loop if the pod is still present

Environment details

  • consul-k8s version: latest master

Additional Context

Thanks you for your help, let me know if you need more information!

@adrien-f adrien-f added the type/bug Something isn't working label Aug 11, 2021
@ishustava
Copy link
Contributor

Hey @adrien-f

Thanks so much for reporting this bug and giving such detailed explanation. We'll work on a fix for it!

@adrien-f
Copy link
Author

Thank you @ishustava ! This was quick 😄 I confirm the ACL tokens are not removed too soon now during a rollout or big deployment 🎉

lawliet89 pushed a commit to lawliet89/consul-k8s that referenced this issue Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants