Backups of etcd (AWS) #40027

foxylion · 2017-01-17T16:25:04Z

Hi,
we are going to use Kubernetes on AWS (setup with kops). One thing I wanted to address before going into production is doing backups from data storage.

I've currently set up a non HA Kubernetes master and I want to do regularly backups of etcd to be able to restore the cluster in case of a volume failure at AWS.

This seems currently not possible. There should be an integrated backup/restore solution.

Alternatively (which would most likely concern kops) I should be able to restore a etcd volume snapshot and change the volume ids inside of Kubernetes.

Related issue at kops: kubernetes/kops#1506

calebamiles · 2017-01-17T19:53:06Z

Do you think you could point @foxylion to some docs @hongchaodeng or @xiang90? Thanks!

cc: @kubernetes/sig-api-machinery-misc

foxylion · 2017-01-17T20:04:30Z

@calebamiles I had a discussion with @justinsb and therefore created a pull request (kubernetes/kops#1511) valid for kops etcd backup/restore procedure. I don't know if this also applies for a bare Kubernetes (and if there any docs available).

calebamiles · 2017-01-17T21:32:51Z

Great @foxylion! We (CoreOS) are almost finished waiting for responses to our etcd user survey that will hopefully address topics the operational lifecycle of an etcd cluster for kubernetes

xiang90 · 2017-01-17T21:38:40Z

@calebamiles We are working on the documentation here: https://docs.google.com/document/d/16ES7N51Xj8r1P5ITan3gPRvwMzoavvX9TqoN8IpEOU8/edit?usp=sharing.

AWS specific doc should be based on this documentation.

mumoshu · 2017-03-01T23:08:56Z

Hi @xiang90, thanks for developing etcd 😄
I'm the maintainer of kube-aws currently working on the same topic as this issue.
Would you mind clarifying the intended way of taking a consistent backup of an etcd node by answering my question in the draft?

A bit more context can be seen at the kube-aws repo.

hongchaodeng · 2017-03-02T00:00:16Z

@mumoshu Let me help answer that.

It's probably not gonna work taking EBS snapshots. There are identities, membership information that can't be restored by restoring only EBS volume.

It's recommended to make snapshot via etcdctl or API call, then save it into EBS/S3. See our design in etcd operator: https://github.com/coreos/etcd-operator/blob/master/doc/design/disaster_recovery.md

mumoshu · 2017-03-24T04:15:41Z

Thanks @hongchaodeng, your answer really helped me.
I ended up implementing kubernetes-retired/kube-aws#417 thanks to your help.

mumoshu · 2017-03-24T04:29:47Z

@xiang90 Thanks for answering my question in the google doc!
It was good to know that EBS snapshots are consistent.
However, according to @hongchaodeng's advice and my own experience implementing kubernetes-retired/kube-aws#417, just restoring etcd nodes from EBS snapshots doesn't work. You may already know and it was unclear just for me but let me clarify.

Restoring "an" etcd member from an EBS snapshot(or even an etcd snapshot) doesn't work(as you might know). Doing so results in the restored etcd member to be refused from joining the etcd cluster due to inconsistency in commit indices. The inconsistency resides between one recorded in the cluster via consensus(is the word correct?) and one from the EBS snapshots(which is a bit older than what is currently recorded in the cluster). The cluster seems to say "why are you requesting the old logs you've already committed before??? I refuse to accept you from joining us!" while the restored etcd member catches up the new logs appended after the snapshot had been taken.

EBS snapshots only work when you stop all the etcd nodes to freeze etcd data(including member identities and commit indices) and you restore all the etcd nodes from snapshots.

On the other hand, an etcdv3 snapshot seems to work without freezing etcd data like that. Say, you have 3 nodes etcd cluster, choose just one of snapshots and then restore all the nodes from the snapshot via etcdctl snapshot restore. It works(as far as I can see).

ReneSaenz · 2017-06-21T19:25:10Z

Hello. I am trying to restore a k8s cluster from etcd data backup. I am able to restore the etcd cluster to it original state (has all the k8s info). However, when I get the k8s cluster rc,services,deployments,etc they are all gone. The k8s cluster is not like before the restore.
What am I missing? Can someone point me in the right direction?

ReneSaenz · 2017-06-22T16:51:43Z

@xiang90 Can you give me some direction?

radhikapc · 2017-06-22T22:17:04Z

Take a look at the doc at https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ if you find any gap kindly file a defect.

fejta-bot · 2017-12-29T17:29:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-28T17:38:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-05-02T19:01:04Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-01T19:48:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-01T20:33:41Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

falberto89 · 2018-10-18T20:53:14Z

Hi! I am working with a HA Kubernetes cluster deployed via Kops on AWS. I have snapshots of every EBS. I have tried to simulate a disaster of every AZ, for example setting to 0 the desiderd field of ASG, detaching the old volumes. However, when I try to recover the cluster attaching the new volumes obtained via snapshots (setting correctly the tags of the new volumes) I am not able to recover the etcd quorum, I think because it is how said @mumoshu in his message. I have tried to use the EBS created by the snapshots taken at the same moment, but it doesn't work in any case (the etcd-server-events pod goes in crashLoop on every master). Is it right the solution of recover the cluster by simply attaching new volumes taken via snapshots or not?

foxylion · 2018-10-18T21:25:27Z

@falberto89 As I understood this - it will only work in a non HA setup. When there is more than one etcd node it is possible that a startup from the snapshots will fail, due to inconsistencies.

We never tried a HA setup so I can only speak for the non HA setup (where it worked when we tried).

foxylion mentioned this issue Jan 17, 2017

Backups for etcd (AWS) kubernetes/kops#1506

Closed

calebamiles added kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/docs Categorizes an issue or PR as relevant to SIG Docs. labels Jan 17, 2017

gianrubio mentioned this issue Feb 28, 2017

etcd management kubernetes-retired/kube-aws#27

Closed

4 tasks

robinpercy mentioned this issue Oct 6, 2017

etcd backups kubernetes/kops#1896

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 1, 2018

k8s-ci-robot closed this as completed Jul 1, 2018

neolit123 mentioned this issue Aug 2, 2018

Blog Post: Out of the Clouds onto the Ground: How to Make Kubernetes Production Grade Anywhere kubernetes/website#9716

Merged

vidhy mentioned this issue Aug 27, 2018

ETCD backup documentation is wrong kubernetes/kops#5705

Closed

obockows mentioned this issue Jun 30, 2021

Backup of etcd EBS volumes etcd-io/etcd#13136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backups of etcd (AWS) #40027

Backups of etcd (AWS) #40027

foxylion commented Jan 17, 2017 •

edited

Loading

calebamiles commented Jan 17, 2017

foxylion commented Jan 17, 2017 •

edited

Loading

calebamiles commented Jan 17, 2017

xiang90 commented Jan 17, 2017

mumoshu commented Mar 1, 2017 •

edited

Loading

hongchaodeng commented Mar 2, 2017

mumoshu commented Mar 24, 2017

mumoshu commented Mar 24, 2017 •

edited

Loading

ReneSaenz commented Jun 21, 2017

ReneSaenz commented Jun 22, 2017

radhikapc commented Jun 22, 2017

fejta-bot commented Dec 29, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented May 2, 2018

fejta-bot commented Jun 1, 2018

fejta-bot commented Jul 1, 2018

falberto89 commented Oct 18, 2018 •

edited

Loading

foxylion commented Oct 18, 2018 •

edited

Loading

Backups of etcd (AWS) #40027

Backups of etcd (AWS) #40027

Comments

foxylion commented Jan 17, 2017 • edited Loading

calebamiles commented Jan 17, 2017

foxylion commented Jan 17, 2017 • edited Loading

calebamiles commented Jan 17, 2017

xiang90 commented Jan 17, 2017

mumoshu commented Mar 1, 2017 • edited Loading

hongchaodeng commented Mar 2, 2017

mumoshu commented Mar 24, 2017

mumoshu commented Mar 24, 2017 • edited Loading

ReneSaenz commented Jun 21, 2017

ReneSaenz commented Jun 22, 2017

radhikapc commented Jun 22, 2017

fejta-bot commented Dec 29, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented May 2, 2018

fejta-bot commented Jun 1, 2018

fejta-bot commented Jul 1, 2018

falberto89 commented Oct 18, 2018 • edited Loading

foxylion commented Oct 18, 2018 • edited Loading

foxylion commented Jan 17, 2017 •

edited

Loading

foxylion commented Jan 17, 2017 •

edited

Loading

mumoshu commented Mar 1, 2017 •

edited

Loading

mumoshu commented Mar 24, 2017 •

edited

Loading

falberto89 commented Oct 18, 2018 •

edited

Loading

foxylion commented Oct 18, 2018 •

edited

Loading