-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backups of etcd (AWS) #40027
Comments
Do you think you could point @foxylion to some docs @hongchaodeng or @xiang90? Thanks! cc: @kubernetes/sig-api-machinery-misc |
@calebamiles I had a discussion with @justinsb and therefore created a pull request (kubernetes/kops#1511) valid for kops etcd backup/restore procedure. I don't know if this also applies for a bare Kubernetes (and if there any docs available). |
Great @foxylion! We (CoreOS) are almost finished waiting for responses to our etcd user survey that will hopefully address topics the operational lifecycle of an etcd cluster for kubernetes |
@calebamiles We are working on the documentation here: https://docs.google.com/document/d/16ES7N51Xj8r1P5ITan3gPRvwMzoavvX9TqoN8IpEOU8/edit?usp=sharing. AWS specific doc should be based on this documentation. |
Hi @xiang90, thanks for developing etcd 😄 A bit more context can be seen at the kube-aws repo. |
@mumoshu Let me help answer that. It's probably not gonna work taking EBS snapshots. There are identities, membership information that can't be restored by restoring only EBS volume. It's recommended to make snapshot via etcdctl or API call, then save it into EBS/S3. See our design in etcd operator: https://github.com/coreos/etcd-operator/blob/master/doc/design/disaster_recovery.md |
Thanks @hongchaodeng, your answer really helped me. |
@xiang90 Thanks for answering my question in the google doc! Restoring "an" etcd member from an EBS snapshot(or even an etcd snapshot) doesn't work(as you might know). Doing so results in the restored etcd member to be refused from joining the etcd cluster due to inconsistency in commit indices. The inconsistency resides between one recorded in the cluster via consensus(is the word correct?) and one from the EBS snapshots(which is a bit older than what is currently recorded in the cluster). The cluster seems to say "why are you requesting the old logs you've already committed before??? I refuse to accept you from joining us!" while the restored etcd member catches up the new logs appended after the snapshot had been taken. EBS snapshots only work when you stop all the etcd nodes to freeze etcd data(including member identities and commit indices) and you restore all the etcd nodes from snapshots. On the other hand, an etcdv3 snapshot seems to work without freezing etcd data like that. Say, you have 3 nodes etcd cluster, choose just one of snapshots and then restore all the nodes from the snapshot via |
Hello. I am trying to restore a k8s cluster from etcd data backup. I am able to restore the etcd cluster to it original state (has all the k8s info). However, when I get the k8s cluster rc,services,deployments,etc they are all gone. The k8s cluster is not like before the restore. |
@xiang90 Can you give me some direction? |
Take a look at the doc at https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ if you find any gap kindly file a defect. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Hi! I am working with a HA Kubernetes cluster deployed via Kops on AWS. I have snapshots of every EBS. I have tried to simulate a disaster of every AZ, for example setting to 0 the desiderd field of ASG, detaching the old volumes. However, when I try to recover the cluster attaching the new volumes obtained via snapshots (setting correctly the tags of the new volumes) I am not able to recover the etcd quorum, I think because it is how said @mumoshu in his message. I have tried to use the EBS created by the snapshots taken at the same moment, but it doesn't work in any case (the etcd-server-events pod goes in crashLoop on every master). Is it right the solution of recover the cluster by simply attaching new volumes taken via snapshots or not? |
@falberto89 As I understood this - it will only work in a non HA setup. When there is more than one etcd node it is possible that a startup from the snapshots will fail, due to inconsistencies. We never tried a HA setup so I can only speak for the non HA setup (where it worked when we tried). |
Hi,
we are going to use Kubernetes on AWS (setup with kops). One thing I wanted to address before going into production is doing backups from data storage.
I've currently set up a non HA Kubernetes master and I want to do regularly backups of etcd to be able to restore the cluster in case of a volume failure at AWS.
This seems currently not possible. There should be an integrated backup/restore solution.
Alternatively (which would most likely concern kops) I should be able to restore a etcd volume snapshot and change the volume ids inside of Kubernetes.
Related issue at kops: kubernetes/kops#1506
The text was updated successfully, but these errors were encountered: