Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups of etcd (AWS) #40027

Closed
foxylion opened this issue Jan 17, 2017 · 18 comments
Closed

Backups of etcd (AWS) #40027

foxylion opened this issue Jan 17, 2017 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/docs Categorizes an issue or PR as relevant to SIG Docs.

Comments

@foxylion
Copy link

foxylion commented Jan 17, 2017

Hi,
we are going to use Kubernetes on AWS (setup with kops). One thing I wanted to address before going into production is doing backups from data storage.

I've currently set up a non HA Kubernetes master and I want to do regularly backups of etcd to be able to restore the cluster in case of a volume failure at AWS.

This seems currently not possible. There should be an integrated backup/restore solution.

Alternatively (which would most likely concern kops) I should be able to restore a etcd volume snapshot and change the volume ids inside of Kubernetes.

Related issue at kops: kubernetes/kops#1506

@calebamiles calebamiles added kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/docs Categorizes an issue or PR as relevant to SIG Docs. labels Jan 17, 2017
@calebamiles
Copy link
Contributor

Do you think you could point @foxylion to some docs @hongchaodeng or @xiang90? Thanks!

cc: @kubernetes/sig-api-machinery-misc

@foxylion
Copy link
Author

foxylion commented Jan 17, 2017

@calebamiles I had a discussion with @justinsb and therefore created a pull request (kubernetes/kops#1511) valid for kops etcd backup/restore procedure. I don't know if this also applies for a bare Kubernetes (and if there any docs available).

@calebamiles
Copy link
Contributor

Great @foxylion! We (CoreOS) are almost finished waiting for responses to our etcd user survey that will hopefully address topics the operational lifecycle of an etcd cluster for kubernetes

@xiang90
Copy link
Contributor

xiang90 commented Jan 17, 2017

@calebamiles We are working on the documentation here: https://docs.google.com/document/d/16ES7N51Xj8r1P5ITan3gPRvwMzoavvX9TqoN8IpEOU8/edit?usp=sharing.

AWS specific doc should be based on this documentation.

@mumoshu
Copy link

mumoshu commented Mar 1, 2017

Hi @xiang90, thanks for developing etcd 😄
I'm the maintainer of kube-aws currently working on the same topic as this issue.
Would you mind clarifying the intended way of taking a consistent backup of an etcd node by answering my question in the draft?

A bit more context can be seen at the kube-aws repo.

@hongchaodeng
Copy link
Contributor

@mumoshu Let me help answer that.

It's probably not gonna work taking EBS snapshots. There are identities, membership information that can't be restored by restoring only EBS volume.

It's recommended to make snapshot via etcdctl or API call, then save it into EBS/S3. See our design in etcd operator: https://github.com/coreos/etcd-operator/blob/master/doc/design/disaster_recovery.md

@mumoshu
Copy link

mumoshu commented Mar 24, 2017

Thanks @hongchaodeng, your answer really helped me.
I ended up implementing kubernetes-retired/kube-aws#417 thanks to your help.

@mumoshu
Copy link

mumoshu commented Mar 24, 2017

@xiang90 Thanks for answering my question in the google doc!
It was good to know that EBS snapshots are consistent.
However, according to @hongchaodeng's advice and my own experience implementing kubernetes-retired/kube-aws#417, just restoring etcd nodes from EBS snapshots doesn't work. You may already know and it was unclear just for me but let me clarify.

Restoring "an" etcd member from an EBS snapshot(or even an etcd snapshot) doesn't work(as you might know). Doing so results in the restored etcd member to be refused from joining the etcd cluster due to inconsistency in commit indices. The inconsistency resides between one recorded in the cluster via consensus(is the word correct?) and one from the EBS snapshots(which is a bit older than what is currently recorded in the cluster). The cluster seems to say "why are you requesting the old logs you've already committed before??? I refuse to accept you from joining us!" while the restored etcd member catches up the new logs appended after the snapshot had been taken.

EBS snapshots only work when you stop all the etcd nodes to freeze etcd data(including member identities and commit indices) and you restore all the etcd nodes from snapshots.

On the other hand, an etcdv3 snapshot seems to work without freezing etcd data like that. Say, you have 3 nodes etcd cluster, choose just one of snapshots and then restore all the nodes from the snapshot via etcdctl snapshot restore. It works(as far as I can see).

@ReneSaenz
Copy link
Contributor

Hello. I am trying to restore a k8s cluster from etcd data backup. I am able to restore the etcd cluster to it original state (has all the k8s info). However, when I get the k8s cluster rc,services,deployments,etc they are all gone. The k8s cluster is not like before the restore.
What am I missing? Can someone point me in the right direction?

@ReneSaenz
Copy link
Contributor

@xiang90 Can you give me some direction?

@radhikapc
Copy link

Take a look at the doc at https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ if you find any gap kindly file a defect.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jan 28, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@falberto89
Copy link

falberto89 commented Oct 18, 2018

Hi! I am working with a HA Kubernetes cluster deployed via Kops on AWS. I have snapshots of every EBS. I have tried to simulate a disaster of every AZ, for example setting to 0 the desiderd field of ASG, detaching the old volumes. However, when I try to recover the cluster attaching the new volumes obtained via snapshots (setting correctly the tags of the new volumes) I am not able to recover the etcd quorum, I think because it is how said @mumoshu in his message. I have tried to use the EBS created by the snapshots taken at the same moment, but it doesn't work in any case (the etcd-server-events pod goes in crashLoop on every master). Is it right the solution of recover the cluster by simply attaching new volumes taken via snapshots or not?

@foxylion
Copy link
Author

foxylion commented Oct 18, 2018

@falberto89 As I understood this - it will only work in a non HA setup. When there is more than one etcd node it is possible that a startup from the snapshots will fail, due to inconsistencies.

We never tried a HA setup so I can only speak for the non HA setup (where it worked when we tried).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/docs Categorizes an issue or PR as relevant to SIG Docs.
Projects
None yet
Development

No branches or pull requests

10 participants