unable to scale self-hosted etcd #346

janwillies · 2017-03-02T01:04:52Z

I'm unable to scale the etcd cluster which I brought up with bootkube.

Bootkube version: master from today with @ericchiang rbac PR
Platform: Ubuntu 16.04.2 LTS

./bootkube-rbac render --asset-dir rbac --experimental-self-hosted-etcd --etcd-servers=http://10.3.0.15:2379 --api-servers=https://10.7.183.59:443

sudo hyperkube kubelet --kubeconfig=/etc/kubernetes/kubeconfig \
    --require-kubeconfig \
    --cni-conf-dir=/etc/kubernetes/cni/net.d \
    --network-plugin=cni \
    --lock-file=/var/run/lock/kubelet.lock \
    --exit-on-lock-contention \
    --pod-manifest-path=/etc/kubernetes/manifests \
    --allow-privileged \
    --node-labels=master=true \
    --minimum-container-ttl-duration=6m0s \
    --cluster_dns=10.3.0.10 \
    --cluster_domain=cluster.local \
    --hostname-override=10.7.183.59

sudo ./bootkube-rbac start --asset-dir=./rbac --experimental-self-hosted-etcd --etcd-server=http://127.0.0.1:12379

Then I joined a second master node and tried scaling the etcd cluster:

curl -H 'Content-Type: application/json' -X PUT --data @scale-etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

{
  "apiVersion": "etcd.coreos.com/v1beta1",
  "kind": "Cluster",
  "metadata": {
    "name": "kube-etcd",
    "namespace": "kube-system"
  },
  "spec": {
    "size": 3
  }
}

The kubernetes cluster becomes unavailable and I see this repeating in the etcd container:

2017-03-02 00:15:08.575611 W | rafthttp: health check for peer ab426eb01c5042b6 could not connect: dial tcp: lookup kube-etcd-0001 on 8.8.8.8:53: no such host

It's looking for the internal DNS name on the host, probably because the etcd cluster runs in the host-network namespace

cc @hongchaodeng and @xiang90

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-03-02T01:11:56Z

@janwillies How many nodes do you have in your kubernetes cluster? can you get the log from the etcd operator via kubectl log?

hongchaodeng · 2017-03-02T01:12:12Z

It's not patching.. You are overwriting self hosted etcd spec.

janwillies · 2017-03-02T01:20:33Z

I have only two nodes, but this shouldn't matter because it fails already when starting the second etcd-node.
etcd-operator:

time="2017-03-02T01:16:41Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:41Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:44Z" level=info msg="spec update: from: {1 3.1.0 false <nil> <nil> <nil> 0xc4203beaa0} to: {2 3.1.0 false <nil> <nil> <nil> <nil>}" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:52Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:52Z" level=info msg="running members: kube-etcd-0000" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:52Z" level=info msg="cluster membership: kube-etcd-0000" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:59Z" level=error msg="fail to create member (kube-etcd-0001): pods \"kube-etcd-0001\" is forbidden: rpc error: code = 14 desc = etcdserver: request timed out" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:59Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:59Z" level=error msg="failed to reconcile: pods \"kube-etcd-0001\" is forbidden: rpc error: code = 14 desc = etcdserver: request timed out" cluster-name=kube-etcd pkg=cluster
E0302 01:17:01.314632       7 election.go:259] Failed to update lock: rpc error: code = 14 desc = etcdserver: request timed out
E0302 01:17:08.319631       7 election.go:259] Failed to update lock: rpc error: code = 14 desc = etcdserver: request timed out
time="2017-03-02T01:17:08Z" level=fatal msg="leader election lost"

etcd-pod:

2017-03-02 01:16:53.256329 W | etcdserver: failed to reach the peerURL(http://kube-etcd-0001:2380) of member 6fbfd0e742d55482 (Get http://kube-etcd-0001:2380/version: dial tcp: lookup kube-etcd-0001 on 8.8.8.8:53: no such host)
2017-03-02 01:16:53.256360 W | etcdserver: cannot get the version of member 6fbfd0e742d55482 (Get http://kube-etcd-0001:2380/version: dial tcp: lookup kube-etcd-0001 on 8.8.8.8:53: no such host)
2017-03-02 01:16:54.451349 I | raft: ae7c18797a0baa96 is starting a new election at term 7
2017-03-02 01:16:54.451391 I | raft: ae7c18797a0baa96 became candidate at term 8

@hongchaodeng what do you mean with "not patching"? What else should I use to scale the etcd-cluster?

hongchaodeng · 2017-03-02T01:26:15Z

Hi @janwillies . Sure, let me explain further.

First of all, this is changing the cluster spec:

curl -H 'Content-Type: application/json' -X PUT --data @scale-etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

{
  "apiVersion": "etcd.coreos.com/v1beta1",
  "kind": "Cluster",
  "metadata": {
    "name": "kube-etcd",
    "namespace": "kube-system"
  },
  "spec": {
    "size": 3
  }
}

What I would recommend you to do, you need to do a reconciliation loop:

get the current cluster TPR via kubectl --kubeconfig=xxx get cluster.etcd kube-etcd -n kube-system -o json
Update spec.size to 3
Then do the update as you did above.
If failed due to version conflict, jump back to step 1. Otherwise, you are good.

Two more notes:

I think upstream has fixed kubectl apply for TPR recently. You might want to try it.
I don't think upstream has support for patching over TPR object right now. That's why we have above reconciliation loop on client side.

xiang90 · 2017-03-02T01:29:27Z

I have only two nodes,

it is unrelated to your current failure, what hongchao suggested is the root cause. But please note that two self hosted etcd members cannot run on the same physical node. so you need at least 3 nodes to scale up to 3.

janwillies · 2017-03-02T01:48:01Z

cool, it's working now:

hyperkube kubectl --namespace=kube-system get cluster.etcd kube-etcd -o json > etcd.json && \
vim etcd.json && \
curl -H 'Content-Type: application/json' -X PUT --data @scale-etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

appreciate the help @xiang90 and @hongchaodeng!

xiang90 assigned hongchaodeng Mar 2, 2017

janwillies closed this as completed Mar 2, 2017

janwillies mentioned this issue Mar 2, 2017

etcd-operator panics on self-hosted bootkube coreos/etcd-operator#851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to scale self-hosted etcd #346

unable to scale self-hosted etcd #346

janwillies commented Mar 2, 2017

xiang90 commented Mar 2, 2017

hongchaodeng commented Mar 2, 2017

janwillies commented Mar 2, 2017 •

edited

Loading

hongchaodeng commented Mar 2, 2017

xiang90 commented Mar 2, 2017

janwillies commented Mar 2, 2017

unable to scale self-hosted etcd #346

unable to scale self-hosted etcd #346

Comments

janwillies commented Mar 2, 2017

xiang90 commented Mar 2, 2017

hongchaodeng commented Mar 2, 2017

janwillies commented Mar 2, 2017 • edited Loading

hongchaodeng commented Mar 2, 2017

xiang90 commented Mar 2, 2017

janwillies commented Mar 2, 2017

janwillies commented Mar 2, 2017 •

edited

Loading