Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

erkanerol · 2019-01-17T13:17:23Z

Is this a request for help?:
Yes

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Version of Helm and Kubernetes:

kubectl version --short
Client Version: v1.10.3
Server Version: v1.11.5-eks-6bad6d

helm version --short
Client: v2.11.0+g2e55dbe
Server: v2.11.0+g2e55dbe

Which chart:

consul 1.0.0

What happened:
Since DNS doesn't work well during startup, consul cannot handle initialization and gives error.

Failed to join 10.240.168.87: Remote state is encrypted and encryption is not configured

Failed to resolve -retry-join: lookup -retry-join: no such host
2019/01/17 14:55:31 [WARN] agent: Join failed: , retrying in 30s
2019/01/17 14:55:36 [ERR] agent: Coordinate update error: No cluster leader
2019/01/17 14:55:54 [ERR] agent: failed to sync remote state: No cluster leader
2019/01/17 14:56:01 [INFO] agent: (LAN) joining: [10.240.168.87 -retry-join]
2019/01/17 14:56:01 [WARN] memberlist: Failed to resolve -retry-join: lookup -retry-join: no such host
2019/01/17 14:56:01 [INFO] agent: (LAN) joined: 0 Err: 2 error(s) occurred:

After the first restart, it cannot recover its state and enters CrashLoopBackoff. The error it gives as follows:

==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> WARNING: WAN keyring exists but -encrypt given, using keyring
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Error starting agent: Failed to start Consul server: Failed to start Raft: recovery failed: refused to recover cluster with no initial state, this is probably an operator error

See: https://support.hashicorp.com/hc/en-us/articles/115015603408-Consul-Errors-And-Warnings

What you expected to happen:
I cannot upgrade the consul helm chart in the product I am using. How can I fix this issue? Can I fix it with a small change in the statefulset definition?

How to reproduce it (as minimally and precisely as possible):

Get an EKS cluster with k8s1.11 and install consul 1.0.0 chart again and again. It doesn't always happen but the frequency is not so low. Please don't forget it occurs only during the first initialization. After a proper startup, it doesn't occur again.

Anything else we need to know:

erkanerol · 2019-01-17T15:12:10Z

Related to hashicorp/consul-helm#76

erkanerol · 2019-01-19T09:34:46Z

Let me explain the problem and then a possible workaround:

In the consul1.0.0 chart, there is a section for waiting until all peers are up and DNS service resolves peers' address properly.

charts/stable/consul/templates/consul.yaml

Lines 201 to 212 in 22c8cfc

    
                       for i in $(seq 0 $((${INITIAL_CLUSTER_SIZE} - 1))); do 
        
                           while true; do 
        
                               echo "Waiting for ${STATEFULSET_NAME}-${i}.${STATEFULSET_NAME} to come up" 
        
                               ping -W 1 -c 1 ${STATEFULSET_NAME}-${i}.${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc > /dev/null && break 
        
                               sleep 1s 
        
                           done 
        
                       done 
        
                       PEERS="" 
        
                       for i in $(seq 0 $((${INITIAL_CLUSTER_SIZE} - 1))); do 
        
                           PEERS="${PEERS}${PEERS:+ } -retry-join $(ping -c 1 ${STATEFULSET_NAME}-${i}.${STATEFULSET_NAME}.${STATEFULSET_NAMESPACE}.svc | awk -F'[()]' '/PING/{print $2}')" 
        
                       done

After this section, it starts consul with IP list of peers. It assumes the peers are accessible at this point.

However, there is an issue in DNS service+network of EKS+k8s1.11. Although DNS service resolves the IP addresses one time, it may fail just after the section again and again. I did some benchmark tests on other platforms including EKS+k8s1.10. DNS service on other platforms and on EKS+k8s1.10 achieves stability in 2-3 seconds after all pods are up and running whereas EKS+k8s1.11 achieves same stability after 25-30 seconds. I think it is not an only DNS issue. There may be a problem on CNI too.

If consul is started during this period, it cannot achieve a consensus and fails. After the first restart, it starts with a peer list + inconsistent raft.db. Consul rejects to handle this case (See Consul Doc )

A workaround
You may extend the for loop and ensure that DNS service resolves the peer addresses n consecutive times without any failure. It means DNS+underlying network achieved stability and you can start consul. After first initilization, consul can tolarates network failures. I set n as 10 and it works for me.

stale · 2019-02-18T10:29:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2019-03-04T10:59:47Z

This issue is being automatically closed due to inactivity.

erkanerol changed the title ~~Consul 1.0.0 Chart fails on EKS sporadically~~ Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically Jan 19, 2019

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2019

stale bot closed this as completed Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

erkanerol commented Jan 17, 2019 •

edited

Loading

erkanerol commented Jan 17, 2019

erkanerol commented Jan 19, 2019 •

edited

Loading

stale bot commented Feb 18, 2019

stale bot commented Mar 4, 2019

Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

Consul 1.0.0 Chart fails on EKS+k8s 1.11 sporadically #10729

Comments

erkanerol commented Jan 17, 2019 • edited Loading

Is this a request for help?: Yes

erkanerol commented Jan 17, 2019

erkanerol commented Jan 19, 2019 • edited Loading

stale bot commented Feb 18, 2019

stale bot commented Mar 4, 2019

erkanerol commented Jan 17, 2019 •

edited

Loading

Is this a request for help?:
Yes

erkanerol commented Jan 19, 2019 •

edited

Loading