Galal hussein etcd backup restore #2154

briandowns · 2020-08-22T02:27:21Z

Proposed changes

Add etcd snapshot and restoration feature, this include adding 5 new flags:

"etcd-snapshot-interval"
"etcd-snapshot-dir"
"etcd-snapshot-restore-path"
"etcd-disable-snapshots"
"etcd-snapshot-retention"

Types of changes

New feature (non-breaking change which adds functionality)

Verification

snapshot should be enabled by default and the snapshot dir defaults to /server/db/snapshots, you should verify by checking the directory for snapshots saved there.

restoration should happen the same way cluster-reset is working, when specifying the flag it will move the old data dir to /server/db/etcd-old and then attempt to restore the snapshot by creating a new data dir, and then start etcd with forcing a new cluster so that it start a 1 member cluster. Upon completion of the restore, k3s will exit.

to verify the restoration you should be able to see cluster starting with only one etcd member and the data from the specified snapshot is being restored correctly.

The pr also will add etcd snapshot retention and also a flag to disable snapshots altogether.

Testing

###1 Testing disabling snapshots

start k3s with --cluster-init and --etcd-disable-snapshots

you should see no snapshot has been created in /server/db/snapshots

###2 Testing snapshot interval and retention

start k3s with --etcd-snapshot-interval set to 5s and --etcd-snapshot-retention set to 10

you should see snapshots created every 5 seconds in /server/db/snapshots and no more than 10 snapshots in the directory

Make sure that if the --cluster-reset-restore-path flag is given that the --cluster-reset flag is present:

./k3s server --datastore-endpoint=etcd --cluster-reset-restore-path=/home/bdowns/etcd_snapshot-etcd-snapshot-1598631420
FATA[2020-08-28T19:06:27.534957919Z] Invalid flag use. --cluster-reset required with --cluster-reset-restore-path

Verify that there's a failure when an invalid path is provided:

FATA[2020-08-28T19:07:46.056232602Z] starting kubernetes: preparing server: start cluster and https: etcd: snapshot path does not exist: /home/bdowns/etcd_snapshot-etcd-snapshot-1598631420

###3 testing snapshot restore

To test snapshot restore, just use --etcd-snapshot-restore-path and point to any snapshot file that you have on the system

The cluster should restore and etcd will only have one member, also you should see /server/db/etcd-old directory has been created.

Linked Issues

rancher/rke2#45

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

Signed-off-by: Brian Downs <brian.downs@gmail.com>

pkg/etcd/etcd.go

pkg/cli/cmds/server.go

pkg/etcd/etcd.go

Signed-off-by: Brian Downs <brian.downs@gmail.com>

pkg/cli/cmds/server.go

Signed-off-by: Brian Downs <brian.downs@gmail.com>

briandowns · 2020-08-27T02:42:44Z

The sub command was discussed in a conversation last week or Monday and I think it was expressed as a desire but a step after this was accomplished.

cjellick · 2020-08-27T02:48:18Z

makes sense. i'll wait to see what darren says only to make sure his opinion on the UX doesnt clash with what we have here.
if it doesnt, then we'll merge this and slot the subcommands work in for later. its the one-off snapshotting that im mostly concerned about, which i dont recall talking about, but quite possible i just didnt catch it or it didnt sink in.

cjellick · 2020-08-27T19:27:18Z

Ok, synced with @ibuildthecloud and here's the changes. SPOILER: he wants to do less than what i proposed.

He wants restoring to be part of the --cluster-reset UX. So, --etcd-snapshot-restore-path should be --cluster-reset-restore-path and it should do nothing on its own. It should just influence where --cluster-reset loads its data from
--etcd-snapshot-interval should be changed from an interval to a cron. So, --etcd-snaphot-schedule-cron. Default 12 hours.

Two things unrelated to this PR:

S3 is out of 1.0. We will do S3 backups when we get a full rancher integration in 2.6. We only need local backup for GA.
A command for a one-off on-demand backup is out of 1.0. Same deal: we will hit it as part of the fullfledged 2.6 integration
cc @davidnuzik

davidnuzik · 2020-08-27T19:31:33Z

To track the unrelated items we'd like to pick up later I have created the following for tracking purposes:
rancher/rke2#249 for S3 bucket
rancher/rke2#250 for on-demand backups
cc @cjellick

briandowns · 2020-08-27T19:41:08Z

Ok, synced with @ibuildthecloud and here's the changes. SPOILER: he wants to do less than what i proposed.

He wants restoring to be part of the --cluster-reset UX. So, --etcd-snapshot-restore-path should be --cluster-reset-restore-path and it should do nothing on its own. It should just influence where --cluster-reset loads its data from

--etcd-snapshot-interval should be changed from an interval to a cron. So, --etcd-snaphot-schedule-cron. Default 12 hours.

Two things unrelated to this PR:

S3 is out of 1.0. We will do S3 backups when we get a full rancher integration in 2.6. We only need local backup for GA.

A command for a one-off on-demand backup is out of 1.0. Same deal: we will hit it as part of the fullfledged 2.6 integration
cc @davidnuzik

For the con, do we need to support cronspec?

Signed-off-by: Brian Downs <brian.downs@gmail.com>

…enerate Signed-off-by: Brian Downs <brian.downs@gmail.com>

Signed-off-by: Brian Downs <brian.downs@gmail.com>

briandowns · 2020-08-28T16:34:57Z

@cjellick In regards to --etcd-snapshot-restore-path should be --cluster-reset-restore-path and it should do nothing on its own. It should just influence where --cluster-reset loads its data from, I'm assuming that if the --etcd-snapshot-restore-path flag is present, we're doing a restore of that on reset rather than new cluster creation?

Signed-off-by: Brian Downs <brian.downs@gmail.com>

Dockerfile.dapper761278592

liyimeng · 2020-08-29T20:00:11Z

So there should always need a snapshot to restore a cluster If it loss quorum?

briandowns · 2020-08-29T22:31:39Z

So there should always need a snapshot to restore a cluster If it loss quorum?

If you issue --cluster-reset it will do a reset, however if you provide that flag along with --cluster-reset-restore-path it will do the restore with the given snapshot.

* Add etcd snapshot and restore Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com> * fix error logs Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com> * goimports Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com> * fix flag describtion Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com> * Add disable snapshot and retention Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com> * use creation time for snapshot retention Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com> * unexport method, update var name Signed-off-by: Brian Downs <brian.downs@gmail.com> * adjust snapshot flags Signed-off-by: Brian Downs <brian.downs@gmail.com> * update var name, string concat Signed-off-by: Brian Downs <brian.downs@gmail.com> * revert previous change, create constants Signed-off-by: Brian Downs <brian.downs@gmail.com> * update Signed-off-by: Brian Downs <brian.downs@gmail.com> * updates Signed-off-by: Brian Downs <brian.downs@gmail.com> * type assertion error checking Signed-off-by: Brian Downs <brian.downs@gmail.com> * update Signed-off-by: Brian Downs <brian.downs@gmail.com> * update Signed-off-by: Brian Downs <brian.downs@gmail.com> * update Signed-off-by: Brian Downs <brian.downs@gmail.com> * pr remediation Signed-off-by: Brian Downs <brian.downs@gmail.com> * pr remediation Signed-off-by: Brian Downs <brian.downs@gmail.com> * pr remediation Signed-off-by: Brian Downs <brian.downs@gmail.com> * pr remediation Signed-off-by: Brian Downs <brian.downs@gmail.com> * pr remediation Signed-off-by: Brian Downs <brian.downs@gmail.com> * updates Signed-off-by: Brian Downs <brian.downs@gmail.com> * updates Signed-off-by: Brian Downs <brian.downs@gmail.com> * simplify logic, remove unneeded function Signed-off-by: Brian Downs <brian.downs@gmail.com> * update flags Signed-off-by: Brian Downs <brian.downs@gmail.com> * update flags Signed-off-by: Brian Downs <brian.downs@gmail.com> * add comment Signed-off-by: Brian Downs <brian.downs@gmail.com> * exit on restore completion, update flag names, move retention check Signed-off-by: Brian Downs <brian.downs@gmail.com> * exit on restore completion, update flag names, move retention check Signed-off-by: Brian Downs <brian.downs@gmail.com> * exit on restore completion, update flag names, move retention check Signed-off-by: Brian Downs <brian.downs@gmail.com> * update disable snapshots flag and field names Signed-off-by: Brian Downs <brian.downs@gmail.com> * move function Signed-off-by: Brian Downs <brian.downs@gmail.com> * update field names Signed-off-by: Brian Downs <brian.downs@gmail.com> * update var and field names Signed-off-by: Brian Downs <brian.downs@gmail.com> * update var and field names Signed-off-by: Brian Downs <brian.downs@gmail.com> * update defaultSnapshotIntervalMinutes to 12 like rke Signed-off-by: Brian Downs <brian.downs@gmail.com> * update directory perms Signed-off-by: Brian Downs <brian.downs@gmail.com> * update etc-snapshot-dir usage Signed-off-by: Brian Downs <brian.downs@gmail.com> * update interval to 12 hours Signed-off-by: Brian Downs <brian.downs@gmail.com> * fix usage typo Signed-off-by: Brian Downs <brian.downs@gmail.com> * add cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * add cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * add cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * wire in cron Signed-off-by: Brian Downs <brian.downs@gmail.com> * update deps target to work, add build/data target for creation, and generate Signed-off-by: Brian Downs <brian.downs@gmail.com> * remove dead make targets Signed-off-by: Brian Downs <brian.downs@gmail.com> * error handling, cluster reset functionality Signed-off-by: Brian Downs <brian.downs@gmail.com> * error handling, cluster reset functionality Signed-off-by: Brian Downs <brian.downs@gmail.com> * update Signed-off-by: Brian Downs <brian.downs@gmail.com> * remove intermediate dapper file Signed-off-by: Brian Downs <brian.downs@gmail.com> Co-authored-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

galal-hussein and others added 9 commits August 13, 2020 01:25

Add etcd snapshot and restore

3ead6fa

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

fix error logs

78fd41f

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

goimports

d3575b6

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

fix flag describtion

4be58c5

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

Add disable snapshot and retention

6fa3b27

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

use creation time for snapshot retention

76216ac

Signed-off-by: galal-hussein <hussein.galal.ahmed.11@gmail.com>

resolve merge conflicts

c3ae60f

Signed-off-by: Brian Downs <brian.downs@gmail.com>

unexport method, update var name

31a9b6f

Signed-off-by: Brian Downs <brian.downs@gmail.com>

adjust snapshot flags

a580089

Signed-off-by: Brian Downs <brian.downs@gmail.com>

briandowns requested a review from a team as a code owner August 22, 2020 02:27

briandowns added 8 commits August 21, 2020 19:53

update var name, string concat

1357ac5

Signed-off-by: Brian Downs <brian.downs@gmail.com>

revert previous change, create constants

192a7ff

Signed-off-by: Brian Downs <brian.downs@gmail.com>

update

49668c6

Signed-off-by: Brian Downs <brian.downs@gmail.com>

updates

97f11ce

Signed-off-by: Brian Downs <brian.downs@gmail.com>

type assertion error checking

478bb0f

Signed-off-by: Brian Downs <brian.downs@gmail.com>

update

277ac51

Signed-off-by: Brian Downs <brian.downs@gmail.com>

update

e68ab7e

Signed-off-by: Brian Downs <brian.downs@gmail.com>

update

7b6d2c2

Signed-off-by: Brian Downs <brian.downs@gmail.com>

brandond reviewed Aug 22, 2020

View reviewed changes

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

pkg/etcd/etcd.go Show resolved Hide resolved

brandond reviewed Aug 22, 2020

View reviewed changes

pkg/cli/cmds/server.go Outdated Show resolved Hide resolved

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

pkg/etcd/etcd.go Outdated Show resolved Hide resolved

briandowns added 5 commits August 22, 2020 12:57

pr remediation

e55f556

Signed-off-by: Brian Downs <brian.downs@gmail.com>

pr remediation

0da7678

Signed-off-by: Brian Downs <brian.downs@gmail.com>

pr remediation

1482fa1

Signed-off-by: Brian Downs <brian.downs@gmail.com>

pr remediation

114ae91

Signed-off-by: Brian Downs <brian.downs@gmail.com>

pr remediation

f5ef1cd

Signed-off-by: Brian Downs <brian.downs@gmail.com>

briandowns self-assigned this Aug 22, 2020

briandowns requested a review from a team August 22, 2020 22:50

brandond reviewed Aug 23, 2020

View reviewed changes

briandowns added 2 commits August 22, 2020 18:40

updates

cc76bdd

Signed-off-by: Brian Downs <brian.downs@gmail.com>

updates

2029725

Signed-off-by: Brian Downs <brian.downs@gmail.com>

briandowns added 12 commits August 27, 2020 18:58

add cron

a6a4d54

Signed-off-by: Brian Downs <brian.downs@gmail.com>

add cron

8ad24ad

Signed-off-by: Brian Downs <brian.downs@gmail.com>

add cron

67166b2

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

c077736

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

330c19d

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

c4a83d6

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

6e234dd

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

17a08c7

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

7ccc390

Signed-off-by: Brian Downs <brian.downs@gmail.com>

wire in cron

30da757

Signed-off-by: Brian Downs <brian.downs@gmail.com>

update deps target to work, add build/data target for creation, and g…

b877501

…enerate Signed-off-by: Brian Downs <brian.downs@gmail.com>

remove dead make targets

f839e08

Signed-off-by: Brian Downs <brian.downs@gmail.com>

briandowns added 4 commits August 28, 2020 12:09

error handling, cluster reset functionality

a940f74

Signed-off-by: Brian Downs <brian.downs@gmail.com>

error handling, cluster reset functionality

969cab1

Signed-off-by: Brian Downs <brian.downs@gmail.com>

update

4d6002d

Signed-off-by: Brian Downs <brian.downs@gmail.com>

remove intermediate dapper file

469d100

Signed-off-by: Brian Downs <brian.downs@gmail.com>

brandond requested changes Aug 28, 2020

View reviewed changes

Dockerfile.dapper761278592 Outdated Show resolved Hide resolved

brandond approved these changes Aug 28, 2020

View reviewed changes

cjellick approved these changes Aug 28, 2020

View reviewed changes

briandowns merged commit 866dc94 into k3s-io:master Aug 28, 2020

davidnuzik mentioned this pull request Sep 10, 2020

[Documentation] experimental etcd snapshot/restore procedure #2224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Galal hussein etcd backup restore #2154

Galal hussein etcd backup restore #2154

briandowns commented Aug 22, 2020 •

edited by davidnuzik

briandowns commented Aug 27, 2020

cjellick commented Aug 27, 2020

cjellick commented Aug 27, 2020

davidnuzik commented Aug 27, 2020

briandowns commented Aug 27, 2020

briandowns commented Aug 28, 2020

liyimeng commented Aug 29, 2020

briandowns commented Aug 29, 2020

Galal hussein etcd backup restore #2154

Galal hussein etcd backup restore #2154

Conversation

briandowns commented Aug 22, 2020 • edited by davidnuzik

Proposed changes

Types of changes

Verification

Testing

Linked Issues

briandowns commented Aug 27, 2020

cjellick commented Aug 27, 2020

cjellick commented Aug 27, 2020

davidnuzik commented Aug 27, 2020

briandowns commented Aug 27, 2020

briandowns commented Aug 28, 2020

liyimeng commented Aug 29, 2020

briandowns commented Aug 29, 2020

briandowns commented Aug 22, 2020 •

edited by davidnuzik