Skip to content
This repository has been archived by the owner. It is now read-only.

Support deploying self-hosted etcd #31

Closed
philips opened this issue May 2, 2016 · 22 comments
Closed

Support deploying self-hosted etcd #31

philips opened this issue May 2, 2016 · 22 comments

Comments

@philips
Copy link
Contributor

@philips philips commented May 2, 2016

From the README:

When you start bootkube, you must also give it the addresses of your etcd servers, and enough information for bootkube to create an ssh tunnel to the node that will become a member of the master control plane. Upon startup, bootkube will create a reverse proxy using an ssh connection, which will allow a bootstrap kubelet to contact the apiserver running as part of bootkube.

In the original prototype we had a built in etcd. Why is that no longer part of this?

@aaronlevy
Copy link
Contributor

@aaronlevy aaronlevy commented May 2, 2016

We need the data that ends up in etcd to persist with the cluster that is launched. If that data lives in the bootkube, then bootkube must continue to run for the lifecycle of the cluster.

Alternatively, we need a way to pivot the etcd data injected during the bootstrap process to the "long-lived" etcd cluster. The long-lived cluster would essentially be a self-hosted etcd cluster launched as k8s components just like the rest of the control plane.

What I'd probably like to see is something along the lines of:

  1. Bootkube runs etcd in-process
  2. k8s objects injected to the api-server end up in the local/in-process etcd
  3. One of those objects is an etcd pod definition, which is started as a self-hosted pod on a node.
  4. The self hosted etcd "joins" the existing bootkube etcd, making a cluster of 2 nodes.
  5. etcd replication copies all state to new joined etcd node
  6. bootkube dies after a self-hosted control-plane is started, removing itself from etcd cluster membership
  7. self-hosted etcd cluster is managed from that point forward as a k8s component.

Another option might be trying to copy the etcd keys from the in-process/local node to the self-hosted node, but this can get a little messy because we would be trying to manually copy (and mirror) data of a live cluster.

Some concerns with this approach:

  • Managing etcd membership in K8s is not currently a very good story. It's either waiting on petsets or trying to handle this with lifecycle hooks, or relying on external mechanics for membership management.
  • Pretty unproven and a bit risky from a production perspective to try and run etcd for the cluster, also "in" the cluster. But I can see the value in this from a "get started easily" while we exercise this as a viable option.

@aaronlevy
Copy link
Contributor

@aaronlevy aaronlevy commented May 2, 2016

@philips what do you think about changing this issue to be "support self-host etcd", and dropping from 0.1.0 milestone ?

@aaronlevy
Copy link
Contributor

@aaronlevy aaronlevy commented May 3, 2016

Adding notes from a side-discussion:

Another option that was mentioned is just copying keys from bootkube-etcd to cluster-etcd. This would require some coordination points in the bootkube process:

  1. bootkube-apiserver configured to use bootkube-etcd
  2. bootkube only injects objects for self-hosted etcd pods and waits for them to be started
  3. bootkube stops internal api-server (no more changes to local state)
  4. Copy all etcd keys form local to remote (self-hosted) cluster
  5. start bootkube-apiserver again but have it point to the self-hosted etcd
  6. create the rest of the self-hosted objects & finish bootkube run as normal

@aaronlevy aaronlevy changed the title No built in etcd? Support deploying self-hosted etcd May 3, 2016
@aaronlevy aaronlevy removed this from the v0.1.0 milestone May 3, 2016
@stuart-warren
Copy link

@stuart-warren stuart-warren commented Aug 16, 2016

How do you want the self-hosted apiserver to discover the location of self-hosted etcd?
I tried using an external loadbalancer listening on 2379 with a known address, but the apiserver throws a bunch of:

reflector.go:334] pkg/storage/cacher.go:163: watch of *api.LimitRange ended with: client: etcd cluster is unavailable or misconfigured

v1.3.5 talking to etcd v3.0.3

edit:
These issues were just harmless error messages in the log file from having a 10sec client timeout in the haproxy config constantly breaking watches.

@kalbasit
Copy link
Contributor

@kalbasit kalbasit commented Aug 19, 2016

I've managed to get this done. Using a separate ETCD cluster where each k8s node (master/minion) is running an ETCD in proxy mode. I'm using Terraform to configure both. The etcd module is available here and the k8s module is available here.

P.S: The master is not volatile and cannot be scaled. If the master node reboots it will not start any of the components again, not sure why but bootkube thinks they are running and quits. Possibly due to having /registry in etcd.

P.P.S: I had few issues doing that but mostly related to me adding --cloud-provider=aws to the kubelet, the controller and the api-server. Issues related bootkube started in a container without /etc/resolv.conf and /etc/ssl/certs/ca-certificates.crt. I'll file separate issues/PR for those.

@philips
Copy link
Contributor Author

@philips philips commented Sep 22, 2016

@xiang90 and @hongchaodeng can you put some thoughts together on this in relation to having an etcd controller.

I think there are essentially two paths:

  1. Copy the data from the bootkube etcd to the cluster etcd
  2. Add the bootkube etcd to the cluster etcd, then remove the bootkube etcd once everything is replicated

I think option 2 is better because it means we don't have to worry about cutting over and having split brain. But! How do we do 2 if the cluster only intend to have one etcd member (say in AWS because you will have a single machine cluster backed by EBS).

I think we should try and prototype this out ASAP as this is the last remaining component that hasn't been proven to be self-hostable.

@xiang90
Copy link
Contributor

@xiang90 xiang90 commented Sep 22, 2016

@philips I have thought about this a little bit. And here is the workflow in my mind:

  • create a one member cluster in bootkube

... k8s is ready...

  • start etcd controller
  • etcd controller adds a member into the one member cluster created by bootkube
  • wait for the new member to sync with the seed one
  • remove the bootkube etcd member

Now etcd controller fully control the etcd cluster and can grow the size to desired size.

@ethernetdan
Copy link
Contributor

@ethernetdan ethernetdan commented Sep 24, 2016

Started some work on this - got a bootkube-hosted etcd cluster up, now working on migrating from the bootkube instance to the etcd-controller managed instance

@pires
Copy link

@pires pires commented Oct 13, 2016

Add the bootkube etcd to the cluster etcd, then remove the bootkube etcd once everything is replicated

@philips what happens if the self-hosted etcd cluster (or the control plane behind it) dies? I believe this is why @aaronlevy mentioned it is:

(...) a bit risky from a production perspective to try and run etcd for the cluster, also "in" the cluster. But I can see the value in this from a "get started easily" while we exercise this as a viable option.

This is exactly the concern I shared in the design proposal.

Can this issue clarify if this concept is simply meant for non-production use-cases?

@philips
Copy link
Contributor Author

@philips philips commented Oct 13, 2016

@pires if the self-hosted etcd cluster dies you need to recover using bootkube from a backup. This is really no different than if it died normally and you would have to redeploy the cluster from a backup and restart the API servers again.

@pires
Copy link

@pires pires commented Oct 13, 2016

@philips can you point me to the backup strategy you guys are designing or already implementing?

@xiang90
Copy link
Contributor

@xiang90 xiang90 commented Oct 13, 2016

@pires

I believe the backup @philips mentioned is actually the etcd backup. For the etcd-controller, we do a backup:

  1. every X minutes. X is defined by the user
  2. once we upgrade the cluster
  3. user can hit backup/now endpoint to force a backup when there is an expected important event like upgrading k8s master components.

@pires
Copy link

@pires pires commented Oct 13, 2016

I understand the concept and it should work as you say, I'm just looking for more details on:

  • Where is each etcd member data stored?
  • Where is the backup data stored?
  • How is bootkube leveraging the stored data?

Don't take me wrong, I find this really cool and I'm trying to grasp it as much as possible as sig-cluster-lifecycle looks into HA.

@xiang90
Copy link
Contributor

@xiang90 xiang90 commented Oct 13, 2016

Where is each etcd member data stored?

The data is stored on local storage. etcd has builtin recovery mechinism. When you have a 3 member etcd cluster, you already have 3 local copies

Where is the backup data stored

Backup is a for extra safety. It helps with rollback + disaster recovery.
It stores on PV, like EBS, Ceph, GlusterFS, etc..

How is bootkube leveraging the stored data

If there is a disaster case or bad upgrade, we recover the cluster from the backup.

@philips
Copy link
Contributor Author

@philips philips commented Nov 22, 2016

As an update on the etcd and self-hosted plan we have merged support behind an experimental flag in bootkube: https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/bootkube/start.go#L37

This is self-hosted and self-healing etcd on top of Kubernetes.

@orbatschow
Copy link

@orbatschow orbatschow commented Dec 5, 2016

@philips
What about using the new etcd operator, to run etcd fully managed on top of kubernetes, i think this will simplify maintenance, updates ... alot.

@xiang90
Copy link
Contributor

@xiang90 xiang90 commented Dec 6, 2016

@gitoverflow That is the plan.

@aaronlevy
Copy link
Contributor

@aaronlevy aaronlevy commented Feb 28, 2017

I am going to close this as initial self-hosted etcd support has been merged. There are follow up issues open for specific tasks:

(documentation): #240
(adding support to all hack/* examples): #337
(iptables checkpointing): #284

@aaronlevy aaronlevy closed this Feb 28, 2017
@jamiehannaford
Copy link
Contributor

@jamiehannaford jamiehannaford commented Mar 10, 2017

Although bootkube now supports a self-hosted etcd pod for bootstrapping, I can't find any documentation which explains:

  1. How a follow-up etcd controller syncs with the bootkube etcd pod
  2. Whether it's possible for an etcd-operator to manage the lifecycle of the cluster etcd controller itself (as opposed to a user-defined etcd cluster)

@aaronlevy
Copy link
Contributor

@aaronlevy aaronlevy commented Mar 10, 2017

@jamiehannaford You're right - and we do need to catch up on Documentation. Some tracking issues:

#240
#311
#302

Regarding your questions:

  1. We need to add a "how it works" section to this repo - but the closest so far might be the youtube link in my comment here: #302 - it briefly goes into how the seed etcd pod is pivoted into the self-hosted etcd cluster (by the etcd-operator).

  2. Yes, the plan is for the etcd-operator to manage the cluster-etcd - so things like re-sizing, backups, updates, etc. (of your cluster etcd) could be managed by the etcd-operator.

@jamiehannaford
Copy link
Contributor

@jamiehannaford jamiehannaford commented Mar 13, 2017

@aaronlevy Thanks for the links. I'm still wrapping my head around the boot-up procedure. It seems the chronology for a self-hosted etcd cluster is:

  1. A static pod for etcd is created
  2. The temp control panel is created
  3. The self-hosted control plane components are created against the temp one
  4. When all the components in 3 are ready, the etcd-operator creates the new self-hosted etcd cluster, migrating all the data from 1

My question is, why does the self-hosted etcd need to wait for certain pods to exist before the data migration happens? I thought the data migration would happen first, then all the final control plane elements would be created.

I looked at the init args for kube-apiserver, and it has the eventual IPv4 of the real etcd (10.3.0.15). This means there's a gap of time between the api-server being created and the real etcd existing. Doesn't this create some kind of crash loop since the API server has nothing to connect to? Or is this gap negligible?

@aaronlevy
Copy link
Contributor

@aaronlevy aaronlevy commented Mar 13, 2017

@jamiehannaford

why does the self-hosted etcd need to wait for certain pods to exist before the data migration happens?

It could likely work in this order as well - but there could be more coordination points (vs just "everything is running - so let the etcd-operator take over"). For example, we would need to make sure to deploy kube-proxy & etcd-operator, then do the etcd pivot, then create the rest of the cluster components. Where right now it's just "create all components that exist in the /manifest dir, wait for some of them, do etcd-pivot" - which initially is easier.

Are there any issues particular to the current order that you've found?

Doesn't this create some kind of crash loop since the API server has nothing to connect to?

Sort of. Really everything pivots around etcd / apiserver addressability. The "real" api-server doesn't immediately take over, because it is unable to bind on 8080/443 (bootkube apiserver is still listening on those ports). The rest of the components don't know if they're talking to bootkube-apiserver or "real" apiserver. It's just an address they expect to reach. So when we're ready to pivot to the self-hosted control-plane, it's just simply exiting the bootkube-apiserver so the ports free up.

You're right that there will be a moment where no api-server is bound to the ports - but it's actually fine in most cases for components to fail/retry - much of Kubernetes is designed this way (including its core components).

However, there currently is an issue where the bootkube-apiserver is still "active", but it expects to only be talking to the static/boot etcd node - however - that node may have already been removed as a member if the etcd cluster. This puts us in a state where the "active" bootkube apiserver can no longer reach the data-store and essentially becomes inactive.

See #372 for more info.

The above issue might be as simple as adding both boot-etcd address and the service IP for self-hosted etcd to the bootkube api-server, I just haven't had a chance to test that assumption.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants