Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore multiple (master) servers from etcd snapshot #3174

Closed
StarpTech opened this issue Apr 11, 2021 · 6 comments
Closed

Restore multiple (master) servers from etcd snapshot #3174

StarpTech opened this issue Apr 11, 2021 · 6 comments

Comments

@StarpTech
Copy link

StarpTech commented Apr 11, 2021

Is your feature request related to a problem? Please describe.
Yes, the current documentation only describes how to restore from a single master server setup.

Describe the solution you'd like
It should be possible to restore a snapshot and distribute it to all other servers as described in (rke) https://rancher.com/docs/rke/latest/en/etcd-snapshots/#how-restoring-from-a-snapshot-works

Describe alternatives you've considered
Documentation and automation of how to do it safely with the current implementation. My instructions were as follows:

  1. Stop the master server.
sudo systemctl stop K3s
  1. Restore the master server from a snapshot
./k3s server \
  --cluster-reset \
  --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
  1. Connect you with the different servers and run:
sudo systemctl stop K3s
rm -rf /var/lib/rancher/k3s/data
sudo systemctl start K3s
  1. Cluster is healthy

Additional informations

Cluster was installed with https://github.com/StarpTech/k-andy

@StarpTech StarpTech changed the title How to restore multiple (master) servers from etcd snapshot? Restore multiple (master) servers from etcd snapshot Apr 11, 2021
@brandond
Copy link
Contributor

We don't have a central coordination tool like RKE, and no plans to create one. After restoring the snapshot to the first server, you should remove the database files on the other servers and rejoin them to the cluster.

@StarpTech
Copy link
Author

StarpTech commented Apr 11, 2021

Hi @brandond so the workaround is correct? What's the strategy in the long term to handle restore scenarios in large clusters?

@brandond
Copy link
Contributor

brandond commented Apr 12, 2021

Long term, automation of this sort will likely be handled by Rancher cluster operator orchestration.

@StarpTech
Copy link
Author

StarpTech commented Apr 12, 2021

Could we document the restore procedure of the current implementation with multiple master nodes? I'm not sure if this is the exact right approach.

@brandond
Copy link
Contributor

brandond commented Apr 12, 2021

Follow the restore instructions from the docs. When the restore is complete you will see a message on the console:

logrus.Infof("Etcd is running, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes")

Follow those instructions - stop k3s on the other servers (if it is still running), delete the referenced file, then start k3s again to rejoin the cluster.

@StarpTech
Copy link
Author

Thanks, I didn't recognize the last line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants