Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

unable to scale self-hosted etcd #346

Closed
janwillies opened this issue Mar 2, 2017 · 6 comments
Closed

unable to scale self-hosted etcd #346

janwillies opened this issue Mar 2, 2017 · 6 comments
Assignees

Comments

@janwillies
Copy link
Contributor

I'm unable to scale the etcd cluster which I brought up with bootkube.

Bootkube version: master from today with @ericchiang rbac PR
Platform: Ubuntu 16.04.2 LTS

./bootkube-rbac render --asset-dir rbac --experimental-self-hosted-etcd --etcd-servers=http://10.3.0.15:2379 --api-servers=https://10.7.183.59:443

sudo hyperkube kubelet --kubeconfig=/etc/kubernetes/kubeconfig \
    --require-kubeconfig \
    --cni-conf-dir=/etc/kubernetes/cni/net.d \
    --network-plugin=cni \
    --lock-file=/var/run/lock/kubelet.lock \
    --exit-on-lock-contention \
    --pod-manifest-path=/etc/kubernetes/manifests \
    --allow-privileged \
    --node-labels=master=true \
    --minimum-container-ttl-duration=6m0s \
    --cluster_dns=10.3.0.10 \
    --cluster_domain=cluster.local \
    --hostname-override=10.7.183.59

sudo ./bootkube-rbac start --asset-dir=./rbac --experimental-self-hosted-etcd --etcd-server=http://127.0.0.1:12379

Then I joined a second master node and tried scaling the etcd cluster:

curl -H 'Content-Type: application/json' -X PUT --data @scale-etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

{
  "apiVersion": "etcd.coreos.com/v1beta1",
  "kind": "Cluster",
  "metadata": {
    "name": "kube-etcd",
    "namespace": "kube-system"
  },
  "spec": {
    "size": 3
  }
}

The kubernetes cluster becomes unavailable and I see this repeating in the etcd container:

2017-03-02 00:15:08.575611 W | rafthttp: health check for peer ab426eb01c5042b6 could not connect: dial tcp: lookup kube-etcd-0001 on 8.8.8.8:53: no such host

It's looking for the internal DNS name on the host, probably because the etcd cluster runs in the host-network namespace

cc @hongchaodeng and @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Mar 2, 2017

@janwillies How many nodes do you have in your kubernetes cluster? can you get the log from the etcd operator via kubectl log?

@hongchaodeng
Copy link
Contributor

It's not patching.. You are overwriting self hosted etcd spec.

@janwillies
Copy link
Contributor Author

janwillies commented Mar 2, 2017

I have only two nodes, but this shouldn't matter because it fails already when starting the second etcd-node.
etcd-operator:

time="2017-03-02T01:16:41Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:41Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:44Z" level=info msg="spec update: from: {1 3.1.0 false <nil> <nil> <nil> 0xc4203beaa0} to: {2 3.1.0 false <nil> <nil> <nil> <nil>}" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:52Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:52Z" level=info msg="running members: kube-etcd-0000" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:52Z" level=info msg="cluster membership: kube-etcd-0000" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:59Z" level=error msg="fail to create member (kube-etcd-0001): pods \"kube-etcd-0001\" is forbidden: rpc error: code = 14 desc = etcdserver: request timed out" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:59Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-03-02T01:16:59Z" level=error msg="failed to reconcile: pods \"kube-etcd-0001\" is forbidden: rpc error: code = 14 desc = etcdserver: request timed out" cluster-name=kube-etcd pkg=cluster
E0302 01:17:01.314632       7 election.go:259] Failed to update lock: rpc error: code = 14 desc = etcdserver: request timed out
E0302 01:17:08.319631       7 election.go:259] Failed to update lock: rpc error: code = 14 desc = etcdserver: request timed out
time="2017-03-02T01:17:08Z" level=fatal msg="leader election lost"

etcd-pod:

2017-03-02 01:16:53.256329 W | etcdserver: failed to reach the peerURL(http://kube-etcd-0001:2380) of member 6fbfd0e742d55482 (Get http://kube-etcd-0001:2380/version: dial tcp: lookup kube-etcd-0001 on 8.8.8.8:53: no such host)
2017-03-02 01:16:53.256360 W | etcdserver: cannot get the version of member 6fbfd0e742d55482 (Get http://kube-etcd-0001:2380/version: dial tcp: lookup kube-etcd-0001 on 8.8.8.8:53: no such host)
2017-03-02 01:16:54.451349 I | raft: ae7c18797a0baa96 is starting a new election at term 7
2017-03-02 01:16:54.451391 I | raft: ae7c18797a0baa96 became candidate at term 8

@hongchaodeng what do you mean with "not patching"? What else should I use to scale the etcd-cluster?

@hongchaodeng
Copy link
Contributor

Hi @janwillies . Sure, let me explain further.

First of all, this is changing the cluster spec:

curl -H 'Content-Type: application/json' -X PUT --data @scale-etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

{
  "apiVersion": "etcd.coreos.com/v1beta1",
  "kind": "Cluster",
  "metadata": {
    "name": "kube-etcd",
    "namespace": "kube-system"
  },
  "spec": {
    "size": 3
  }
}

What I would recommend you to do, you need to do a reconciliation loop:

  1. get the current cluster TPR via kubectl --kubeconfig=xxx get cluster.etcd kube-etcd -n kube-system -o json
  2. Update spec.size to 3
  3. Then do the update as you did above.
  4. If failed due to version conflict, jump back to step 1. Otherwise, you are good.

Two more notes:

  • I think upstream has fixed kubectl apply for TPR recently. You might want to try it.
  • I don't think upstream has support for patching over TPR object right now. That's why we have above reconciliation loop on client side.

@xiang90
Copy link
Contributor

xiang90 commented Mar 2, 2017

I have only two nodes,

it is unrelated to your current failure, what hongchao suggested is the root cause. But please note that two self hosted etcd members cannot run on the same physical node. so you need at least 3 nodes to scale up to 3.

@janwillies
Copy link
Contributor Author

cool, it's working now:

hyperkube kubectl --namespace=kube-system get cluster.etcd kube-etcd -o json > etcd.json && \
vim etcd.json && \
curl -H 'Content-Type: application/json' -X PUT --data @scale-etcd.json http://127.0.0.1:8080/apis/etcd.coreos.com/v1beta1/namespaces/kube-system/clusters/kube-etcd

appreciate the help @xiang90 and @hongchaodeng!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants