Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups of etcd (AWS) #40027

Closed
foxylion opened this Issue Jan 17, 2017 · 18 comments

Comments

Projects
None yet
10 participants
@foxylion
Copy link

foxylion commented Jan 17, 2017

Hi,
we are going to use Kubernetes on AWS (setup with kops). One thing I wanted to address before going into production is doing backups from data storage.

I've currently set up a non HA Kubernetes master and I want to do regularly backups of etcd to be able to restore the cluster in case of a volume failure at AWS.

This seems currently not possible. There should be an integrated backup/restore solution.

Alternatively (which would most likely concern kops) I should be able to restore a etcd volume snapshot and change the volume ids inside of Kubernetes.

Related issue at kops: kubernetes/kops#1506

@calebamiles

This comment has been minimized.

Copy link
Member

calebamiles commented Jan 17, 2017

Do you think you could point @foxylion to some docs @hongchaodeng or @xiang90? Thanks!

cc: @kubernetes/sig-api-machinery-misc

@foxylion

This comment has been minimized.

Copy link
Author

foxylion commented Jan 17, 2017

@calebamiles I had a discussion with @justinsb and therefore created a pull request (kubernetes/kops#1511) valid for kops etcd backup/restore procedure. I don't know if this also applies for a bare Kubernetes (and if there any docs available).

@calebamiles

This comment has been minimized.

Copy link
Member

calebamiles commented Jan 17, 2017

Great @foxylion! We (CoreOS) are almost finished waiting for responses to our etcd user survey that will hopefully address topics the operational lifecycle of an etcd cluster for kubernetes

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jan 17, 2017

@calebamiles We are working on the documentation here: https://docs.google.com/document/d/16ES7N51Xj8r1P5ITan3gPRvwMzoavvX9TqoN8IpEOU8/edit?usp=sharing.

AWS specific doc should be based on this documentation.

@gianrubio gianrubio referenced this issue Feb 28, 2017

Closed

etcd management #27

0 of 4 tasks complete
@mumoshu

This comment has been minimized.

Copy link

mumoshu commented Mar 1, 2017

Hi @xiang90, thanks for developing etcd 😄
I'm the maintainer of kube-aws currently working on the same topic as this issue.
Would you mind clarifying the intended way of taking a consistent backup of an etcd node by answering my question in the draft?

A bit more context can be seen at the kube-aws repo.

@hongchaodeng

This comment has been minimized.

Copy link
Member

hongchaodeng commented Mar 2, 2017

@mumoshu Let me help answer that.

It's probably not gonna work taking EBS snapshots. There are identities, membership information that can't be restored by restoring only EBS volume.

It's recommended to make snapshot via etcdctl or API call, then save it into EBS/S3. See our design in etcd operator: https://github.com/coreos/etcd-operator/blob/master/doc/design/disaster_recovery.md

@mumoshu

This comment has been minimized.

Copy link

mumoshu commented Mar 24, 2017

Thanks @hongchaodeng, your answer really helped me.
I ended up implementing kubernetes-incubator/kube-aws#417 thanks to your help.

@mumoshu

This comment has been minimized.

Copy link

mumoshu commented Mar 24, 2017

@xiang90 Thanks for answering my question in the google doc!
It was good to know that EBS snapshots are consistent.
However, according to @hongchaodeng's advice and my own experience implementing kubernetes-incubator/kube-aws#417, just restoring etcd nodes from EBS snapshots doesn't work. You may already know and it was unclear just for me but let me clarify.

Restoring "an" etcd member from an EBS snapshot(or even an etcd snapshot) doesn't work(as you might know). Doing so results in the restored etcd member to be refused from joining the etcd cluster due to inconsistency in commit indices. The inconsistency resides between one recorded in the cluster via consensus(is the word correct?) and one from the EBS snapshots(which is a bit older than what is currently recorded in the cluster). The cluster seems to say "why are you requesting the old logs you've already committed before??? I refuse to accept you from joining us!" while the restored etcd member catches up the new logs appended after the snapshot had been taken.

EBS snapshots only work when you stop all the etcd nodes to freeze etcd data(including member identities and commit indices) and you restore all the etcd nodes from snapshots.

On the other hand, an etcdv3 snapshot seems to work without freezing etcd data like that. Say, you have 3 nodes etcd cluster, choose just one of snapshots and then restore all the nodes from the snapshot via etcdctl snapshot restore. It works(as far as I can see).

@ReneSaenz

This comment has been minimized.

Copy link
Contributor

ReneSaenz commented Jun 21, 2017

Hello. I am trying to restore a k8s cluster from etcd data backup. I am able to restore the etcd cluster to it original state (has all the k8s info). However, when I get the k8s cluster rc,services,deployments,etc they are all gone. The k8s cluster is not like before the restore.
What am I missing? Can someone point me in the right direction?

@ReneSaenz

This comment has been minimized.

Copy link
Contributor

ReneSaenz commented Jun 22, 2017

@xiang90 Can you give me some direction?

@radhikapc

This comment has been minimized.

Copy link
Member

radhikapc commented Jun 22, 2017

Take a look at the doc at https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ if you find any gap kindly file a defect.

@robinpercy robinpercy referenced this issue Oct 6, 2017

Closed

etcd backups #1896

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 29, 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 28, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented May 2, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jun 1, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jul 1, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@falberto89

This comment has been minimized.

Copy link

falberto89 commented Oct 18, 2018

Hi! I am working with a HA Kubernetes cluster deployed via Kops on AWS. I have snapshots of every EBS. I have tried to simulate a disaster of every AZ, for example setting to 0 the desiderd field of ASG, detaching the old volumes. However, when I try to recover the cluster attaching the new volumes obtained via snapshots (setting correctly the tags of the new volumes) I am not able to recover the etcd quorum, I think because it is how said @mumoshu in his message. I have tried to use the EBS created by the snapshots taken at the same moment, but it doesn't work in any case (the etcd-server-events pod goes in crashLoop on every master). Is it right the solution of recover the cluster by simply attaching new volumes taken via snapshots or not?

@foxylion

This comment has been minimized.

Copy link
Author

foxylion commented Oct 18, 2018

@falberto89 As I understood this - it will only work in a non HA setup. When there is more than one etcd node it is possible that a startup from the snapshots will fail, due to inconsistencies.

We never tried a HA setup so I can only speak for the non HA setup (where it worked when we tried).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.