Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

Closed
erkanerol opened this issue Jan 17, 2019 · 4 comments
Closed

Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

erkanerol opened this issue Jan 17, 2019 · 4 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@erkanerol
Copy link

erkanerol commented Jan 17, 2019

Is this a request for help?:
Yes

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:

kubectl version --short
Client Version: v1.10.3
Server Version: v1.11.5-eks-6bad6d

helm version --short
Client: v2.11.0+g2e55dbe
Server: v2.11.0+g2e55dbe

Which chart:

consul 1.0.0

What happened:
Since DNS doesn't work well during startup, consul cannot handle initialization and gives error.

  • Failed to join 10.240.168.87: Remote state is encrypted and encryption is not configured
  • Failed to resolve -retry-join: lookup -retry-join: no such host
    2019/01/17 14:55:31 [WARN] agent: Join failed: , retrying in 30s
    2019/01/17 14:55:36 [ERR] agent: Coordinate update error: No cluster leader
    2019/01/17 14:55:54 [ERR] agent: failed to sync remote state: No cluster leader
    2019/01/17 14:56:01 [INFO] agent: (LAN) joining: [10.240.168.87 -retry-join]
    2019/01/17 14:56:01 [WARN] memberlist: Failed to resolve -retry-join: lookup -retry-join: no such host
    2019/01/17 14:56:01 [INFO] agent: (LAN) joined: 0 Err: 2 error(s) occurred:

After the first restart, it cannot recover its state and enters CrashLoopBackoff. The error it gives as follows:

==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> WARNING: WAN keyring exists but -encrypt given, using keyring
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Error starting agent: Failed to start Consul server: Failed to start Raft: recovery failed: refused to recover cluster with no initial state, this is probably an operator error

See: https://support.hashicorp.com/hc/en-us/articles/115015603408-Consul-Errors-And-Warnings

What you expected to happen:
I cannot upgrade the consul helm chart in the product I am using. How can I fix this issue? Can I fix it with a small change in the statefulset definition?

How to reproduce it (as minimally and precisely as possible):

Get an EKS cluster with k8s1.11 and install consul 1.0.0 chart again and again. It doesn't always happen but the frequency is not so low. Please don't forget it occurs only during the first initialization. After a proper startup, it doesn't occur again.

Anything else we need to know:

@erkanerol
Copy link
Author

Related to hashicorp/consul-helm#76

@erkanerol
Copy link
Author

erkanerol commented Jan 19, 2019

Let me explain the problem and then a possible workaround:

In the consul1.0.0 chart, there is a section for waiting until all peers are up and DNS service resolves peers' address properly.

for i in $(seq 0 $((${INITIAL_CLUSTER_SIZE} - 1))); do
while true; do
echo "Waiting for ${STATEFULSET_NAME}-${i}.${STATEFULSET_NAME} to come up"
ping -W 1 -c 1 ${STATEFULSET_NAME}-${i}.${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc > /dev/null && break
sleep 1s
done
done
PEERS=""
for i in $(seq 0 $((${INITIAL_CLUSTER_SIZE} - 1))); do
PEERS="${PEERS}${PEERS:+ } -retry-join $(ping -c 1 ${STATEFULSET_NAME}-${i}.${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc | awk -F'[()]' '/PING/{print $2}')"
done

After this section, it starts consul with IP list of peers. It assumes the peers are accessible at this point.

However, there is an issue in DNS service+network of EKS+k8s1.11. Although DNS service resolves the IP addresses one time, it may fail just after the section again and again. I did some benchmark tests on other platforms including EKS+k8s1.10. DNS service on other platforms and on EKS+k8s1.10 achieves stability in 2-3 seconds after all pods are up and running whereas EKS+k8s1.11 achieves same stability after 25-30 seconds. I think it is not an only DNS issue. There may be a problem on CNI too.

If consul is started during this period, it cannot achieve a consensus and fails. After the first restart, it starts with a peer list + inconsistent raft.db. Consul rejects to handle this case (See Consul Doc )

A workaround
You may extend the for loop and ensure that DNS service resolves the peer addresses n consecutive times without any failure. It means DNS+underlying network achieved stability and you can start consul. After first initilization, consul can tolarates network failures. I set n as 10 and it works for me.

@stale
Copy link

stale bot commented Feb 18, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2019
@stale
Copy link

stale bot commented Mar 4, 2019

This issue is being automatically closed due to inactivity.

@stale stale bot closed this as completed Mar 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

1 participant