New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deploying self-hosted etcd #31

Closed
philips opened this Issue May 2, 2016 · 22 comments

Comments

Projects
None yet
9 participants
@philips
Copy link
Contributor

philips commented May 2, 2016

From the README:

When you start bootkube, you must also give it the addresses of your etcd servers, and enough information for bootkube to create an ssh tunnel to the node that will become a member of the master control plane. Upon startup, bootkube will create a reverse proxy using an ssh connection, which will allow a bootstrap kubelet to contact the apiserver running as part of bootkube.

In the original prototype we had a built in etcd. Why is that no longer part of this?

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented May 2, 2016

We need the data that ends up in etcd to persist with the cluster that is launched. If that data lives in the bootkube, then bootkube must continue to run for the lifecycle of the cluster.

Alternatively, we need a way to pivot the etcd data injected during the bootstrap process to the "long-lived" etcd cluster. The long-lived cluster would essentially be a self-hosted etcd cluster launched as k8s components just like the rest of the control plane.

What I'd probably like to see is something along the lines of:

  1. Bootkube runs etcd in-process
  2. k8s objects injected to the api-server end up in the local/in-process etcd
  3. One of those objects is an etcd pod definition, which is started as a self-hosted pod on a node.
  4. The self hosted etcd "joins" the existing bootkube etcd, making a cluster of 2 nodes.
  5. etcd replication copies all state to new joined etcd node
  6. bootkube dies after a self-hosted control-plane is started, removing itself from etcd cluster membership
  7. self-hosted etcd cluster is managed from that point forward as a k8s component.

Another option might be trying to copy the etcd keys from the in-process/local node to the self-hosted node, but this can get a little messy because we would be trying to manually copy (and mirror) data of a live cluster.

Some concerns with this approach:

  • Managing etcd membership in K8s is not currently a very good story. It's either waiting on petsets or trying to handle this with lifecycle hooks, or relying on external mechanics for membership management.
  • Pretty unproven and a bit risky from a production perspective to try and run etcd for the cluster, also "in" the cluster. But I can see the value in this from a "get started easily" while we exercise this as a viable option.
@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented May 2, 2016

@philips what do you think about changing this issue to be "support self-host etcd", and dropping from 0.1.0 milestone ?

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented May 3, 2016

Adding notes from a side-discussion:

Another option that was mentioned is just copying keys from bootkube-etcd to cluster-etcd. This would require some coordination points in the bootkube process:

  1. bootkube-apiserver configured to use bootkube-etcd
  2. bootkube only injects objects for self-hosted etcd pods and waits for them to be started
  3. bootkube stops internal api-server (no more changes to local state)
  4. Copy all etcd keys form local to remote (self-hosted) cluster
  5. start bootkube-apiserver again but have it point to the self-hosted etcd
  6. create the rest of the self-hosted objects & finish bootkube run as normal

@aaronlevy aaronlevy changed the title No built in etcd? Support deploying self-hosted etcd May 3, 2016

@aaronlevy aaronlevy removed this from the v0.1.0 milestone May 3, 2016

@stuart-warren

This comment has been minimized.

Copy link

stuart-warren commented Aug 16, 2016

How do you want the self-hosted apiserver to discover the location of self-hosted etcd?
I tried using an external loadbalancer listening on 2379 with a known address, but the apiserver throws a bunch of:

reflector.go:334] pkg/storage/cacher.go:163: watch of *api.LimitRange ended with: client: etcd cluster is unavailable or misconfigured

v1.3.5 talking to etcd v3.0.3

edit:
These issues were just harmless error messages in the log file from having a 10sec client timeout in the haproxy config constantly breaking watches.

@kalbasit

This comment has been minimized.

Copy link
Contributor

kalbasit commented Aug 19, 2016

I've managed to get this done. Using a separate ETCD cluster where each k8s node (master/minion) is running an ETCD in proxy mode. I'm using Terraform to configure both. The etcd module is available here and the k8s module is available here.

P.S: The master is not volatile and cannot be scaled. If the master node reboots it will not start any of the components again, not sure why but bootkube thinks they are running and quits. Possibly due to having /registry in etcd.

P.P.S: I had few issues doing that but mostly related to me adding --cloud-provider=aws to the kubelet, the controller and the api-server. Issues related bootkube started in a container without /etc/resolv.conf and /etc/ssl/certs/ca-certificates.crt. I'll file separate issues/PR for those.

@philips

This comment has been minimized.

Copy link
Contributor Author

philips commented Sep 22, 2016

@xiang90 and @hongchaodeng can you put some thoughts together on this in relation to having an etcd controller.

I think there are essentially two paths:

  1. Copy the data from the bootkube etcd to the cluster etcd
  2. Add the bootkube etcd to the cluster etcd, then remove the bootkube etcd once everything is replicated

I think option 2 is better because it means we don't have to worry about cutting over and having split brain. But! How do we do 2 if the cluster only intend to have one etcd member (say in AWS because you will have a single machine cluster backed by EBS).

I think we should try and prototype this out ASAP as this is the last remaining component that hasn't been proven to be self-hostable.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Sep 22, 2016

@philips I have thought about this a little bit. And here is the workflow in my mind:

  • create a one member cluster in bootkube

... k8s is ready...

  • start etcd controller
  • etcd controller adds a member into the one member cluster created by bootkube
  • wait for the new member to sync with the seed one
  • remove the bootkube etcd member

Now etcd controller fully control the etcd cluster and can grow the size to desired size.

@ethernetdan

This comment has been minimized.

Copy link
Contributor

ethernetdan commented Sep 24, 2016

Started some work on this - got a bootkube-hosted etcd cluster up, now working on migrating from the bootkube instance to the etcd-controller managed instance

@pires

This comment has been minimized.

Copy link

pires commented Oct 13, 2016

Add the bootkube etcd to the cluster etcd, then remove the bootkube etcd once everything is replicated

@philips what happens if the self-hosted etcd cluster (or the control plane behind it) dies? I believe this is why @aaronlevy mentioned it is:

(...) a bit risky from a production perspective to try and run etcd for the cluster, also "in" the cluster. But I can see the value in this from a "get started easily" while we exercise this as a viable option.

This is exactly the concern I shared in the design proposal.

Can this issue clarify if this concept is simply meant for non-production use-cases?

@philips

This comment has been minimized.

Copy link
Contributor Author

philips commented Oct 13, 2016

@pires if the self-hosted etcd cluster dies you need to recover using bootkube from a backup. This is really no different than if it died normally and you would have to redeploy the cluster from a backup and restart the API servers again.

@pires

This comment has been minimized.

Copy link

pires commented Oct 13, 2016

@philips can you point me to the backup strategy you guys are designing or already implementing?

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Oct 13, 2016

@pires

I believe the backup @philips mentioned is actually the etcd backup. For the etcd-controller, we do a backup:

  1. every X minutes. X is defined by the user
  2. once we upgrade the cluster
  3. user can hit backup/now endpoint to force a backup when there is an expected important event like upgrading k8s master components.
@pires

This comment has been minimized.

Copy link

pires commented Oct 13, 2016

I understand the concept and it should work as you say, I'm just looking for more details on:

  • Where is each etcd member data stored?
  • Where is the backup data stored?
  • How is bootkube leveraging the stored data?

Don't take me wrong, I find this really cool and I'm trying to grasp it as much as possible as sig-cluster-lifecycle looks into HA.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Oct 13, 2016

Where is each etcd member data stored?

The data is stored on local storage. etcd has builtin recovery mechinism. When you have a 3 member etcd cluster, you already have 3 local copies

Where is the backup data stored

Backup is a for extra safety. It helps with rollback + disaster recovery.
It stores on PV, like EBS, Ceph, GlusterFS, etc..

How is bootkube leveraging the stored data

If there is a disaster case or bad upgrade, we recover the cluster from the backup.

@philips

This comment has been minimized.

Copy link
Contributor Author

philips commented Nov 22, 2016

As an update on the etcd and self-hosted plan we have merged support behind an experimental flag in bootkube: https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/bootkube/start.go#L37

This is self-hosted and self-healing etcd on top of Kubernetes.

@orbatschow

This comment has been minimized.

Copy link

orbatschow commented Dec 5, 2016

@philips
What about using the new etcd operator, to run etcd fully managed on top of kubernetes, i think this will simplify maintenance, updates ... alot.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Dec 6, 2016

@gitoverflow That is the plan.

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented Feb 28, 2017

I am going to close this as initial self-hosted etcd support has been merged. There are follow up issues open for specific tasks:

(documentation): #240
(adding support to all hack/* examples): #337
(iptables checkpointing): #284

@aaronlevy aaronlevy closed this Feb 28, 2017

@jamiehannaford

This comment has been minimized.

Copy link
Contributor

jamiehannaford commented Mar 10, 2017

Although bootkube now supports a self-hosted etcd pod for bootstrapping, I can't find any documentation which explains:

  1. How a follow-up etcd controller syncs with the bootkube etcd pod
  2. Whether it's possible for an etcd-operator to manage the lifecycle of the cluster etcd controller itself (as opposed to a user-defined etcd cluster)
@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented Mar 10, 2017

@jamiehannaford You're right - and we do need to catch up on Documentation. Some tracking issues:

#240
#311
#302

Regarding your questions:

  1. We need to add a "how it works" section to this repo - but the closest so far might be the youtube link in my comment here: #302 - it briefly goes into how the seed etcd pod is pivoted into the self-hosted etcd cluster (by the etcd-operator).

  2. Yes, the plan is for the etcd-operator to manage the cluster-etcd - so things like re-sizing, backups, updates, etc. (of your cluster etcd) could be managed by the etcd-operator.

@jamiehannaford

This comment has been minimized.

Copy link
Contributor

jamiehannaford commented Mar 13, 2017

@aaronlevy Thanks for the links. I'm still wrapping my head around the boot-up procedure. It seems the chronology for a self-hosted etcd cluster is:

  1. A static pod for etcd is created
  2. The temp control panel is created
  3. The self-hosted control plane components are created against the temp one
  4. When all the components in 3 are ready, the etcd-operator creates the new self-hosted etcd cluster, migrating all the data from 1

My question is, why does the self-hosted etcd need to wait for certain pods to exist before the data migration happens? I thought the data migration would happen first, then all the final control plane elements would be created.

I looked at the init args for kube-apiserver, and it has the eventual IPv4 of the real etcd (10.3.0.15). This means there's a gap of time between the api-server being created and the real etcd existing. Doesn't this create some kind of crash loop since the API server has nothing to connect to? Or is this gap negligible?

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented Mar 13, 2017

@jamiehannaford

why does the self-hosted etcd need to wait for certain pods to exist before the data migration happens?

It could likely work in this order as well - but there could be more coordination points (vs just "everything is running - so let the etcd-operator take over"). For example, we would need to make sure to deploy kube-proxy & etcd-operator, then do the etcd pivot, then create the rest of the cluster components. Where right now it's just "create all components that exist in the /manifest dir, wait for some of them, do etcd-pivot" - which initially is easier.

Are there any issues particular to the current order that you've found?

Doesn't this create some kind of crash loop since the API server has nothing to connect to?

Sort of. Really everything pivots around etcd / apiserver addressability. The "real" api-server doesn't immediately take over, because it is unable to bind on 8080/443 (bootkube apiserver is still listening on those ports). The rest of the components don't know if they're talking to bootkube-apiserver or "real" apiserver. It's just an address they expect to reach. So when we're ready to pivot to the self-hosted control-plane, it's just simply exiting the bootkube-apiserver so the ports free up.

You're right that there will be a moment where no api-server is bound to the ports - but it's actually fine in most cases for components to fail/retry - much of Kubernetes is designed this way (including its core components).

However, there currently is an issue where the bootkube-apiserver is still "active", but it expects to only be talking to the static/boot etcd node - however - that node may have already been removed as a member if the etcd cluster. This puts us in a state where the "active" bootkube apiserver can no longer reach the data-store and essentially becomes inactive.

See #372 for more info.

The above issue might be as simple as adding both boot-etcd address and the service IP for self-hosted etcd to the bootkube api-server, I just haven't had a chance to test that assumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment