-
Notifications
You must be signed in to change notification settings - Fork 16.9k
Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729
Comments
Related to hashicorp/consul-helm#76 |
Let me explain the problem and then a possible workaround: In the consul1.0.0 chart, there is a section for waiting until all peers are up and DNS service resolves peers' address properly. charts/stable/consul/templates/consul.yaml Lines 201 to 212 in 22c8cfc
After this section, it starts consul with IP list of peers. It assumes the peers are accessible at this point. However, there is an issue in DNS service+network of EKS+k8s1.11. Although DNS service resolves the IP addresses one time, it may fail just after the section again and again. I did some benchmark tests on other platforms including EKS+k8s1.10. DNS service on other platforms and on EKS+k8s1.10 achieves stability in 2-3 seconds after all pods are up and running whereas EKS+k8s1.11 achieves same stability after 25-30 seconds. I think it is not an only DNS issue. There may be a problem on CNI too. If consul is started during this period, it cannot achieve a consensus and fails. After the first restart, it starts with a peer list + inconsistent raft.db. Consul rejects to handle this case (See Consul Doc ) A workaround |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions. |
This issue is being automatically closed due to inactivity. |
Is this a request for help?:
Yes
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
Version of Helm and Kubernetes:
kubectl version --short
Client Version: v1.10.3
Server Version: v1.11.5-eks-6bad6d
helm version --short
Client: v2.11.0+g2e55dbe
Server: v2.11.0+g2e55dbe
Which chart:
consul 1.0.0
What happened:
Since DNS doesn't work well during startup, consul cannot handle initialization and gives error.
After the first restart, it cannot recover its state and enters CrashLoopBackoff. The error it gives as follows:
See: https://support.hashicorp.com/hc/en-us/articles/115015603408-Consul-Errors-And-Warnings
What you expected to happen:
I cannot upgrade the consul helm chart in the product I am using. How can I fix this issue? Can I fix it with a small change in the statefulset definition?
How to reproduce it (as minimally and precisely as possible):
Get an EKS cluster with k8s1.11 and install consul 1.0.0 chart again and again. It doesn't always happen but the frequency is not so low. Please don't forget it occurs only during the first initialization. After a proper startup, it doesn't occur again.
Anything else we need to know:
The text was updated successfully, but these errors were encountered: