Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to recover from a catastrophic failure? #2714

Closed
alkmim opened this issue Jun 20, 2021 · 27 comments
Closed

[QUESTION] How to recover from a catastrophic failure? #2714

alkmim opened this issue Jun 20, 2021 · 27 comments
Assignees
Labels
kind/question Please use `discussion` to ask questions instead require/doc Require updating the longhorn.io documentation

Comments

@alkmim
Copy link

alkmim commented Jun 20, 2021

Question
How should I recover longhorn system if all my nodes die and the only backup I have is the /var/lib/longhorn from all my nodes?

Right now, whenever I simulate this kind of failure and re-install the longhorn system, I can clearly see the replicas in the nodes, but the longhorn does not see them.

Environment:

  • Longhorn version: latest
  • Kubernetes version: latest
  • Node config
    • OS type and version: not decided yet
    • CPU per node: not decided yet
    • Memory per node: not decided yet
    • Disk type: not decided yet
    • Network bandwidth and latency between the nodes: not decided yet
  • Underlying Infrastructure: Baremetal

Additional context
Since this is a simulation, I can backup any other folder/config right before the failure.

@alkmim alkmim added the kind/question Please use `discussion` to ask questions instead label Jun 20, 2021
@jenting jenting added this to New in Community Issue Review via automation Jun 21, 2021
@jenting
Copy link
Contributor

jenting commented Jun 21, 2021

You could refer to the doc on how to recover the data from the replica.

@alkmim
Copy link
Author

alkmim commented Jun 21, 2021

Thanks for the reply @jenting . I just tested the documentation approach. I am able to read the content of the replicas, but I still cannot see the volume in the longhorn UI. Basically it allows me to access the data, but I cannot reuse as part of the longhorn system.

Is there a way I can add the volume to the longhorn system and be able to see it in the UI (with the replicas)?

In a nutshell, I would like to be able to re-use the longhorn system without having to migrate all my data to new volumes (It would take an awful amount of time)

@alkmim
Copy link
Author

alkmim commented Jun 21, 2021

Currently, I am looking at this: https://github.com/longhorn/longhorn-engine#running-a-controller-with-multiple-replicas

I think that any solution would include those steps. What I am trying to understand now ir

  1. How to integrate this with the kubernetes.
  2. If as soon as I start the controller everything will fit together and I will be able to see it on the UI.
  3. Any other manual set up required as part of the other components.

@develmac
Copy link

While following the above documentation I am able to start the longhorn container in docker and the block device gets created, but I can not mount it, since it says:

mount: /mnt/pv: /dev/longhorn/pv-d853ac78 already mounted or mount point busy

I have checked and it does not seem to be a multitool issue.

@alkmim
Copy link
Author

alkmim commented Jun 21, 2021

Ok. So I started going through the code to see how longhorn "detects" a volume. In a nutshell, my understanding is:

  • From the longhorn-ui, it seems that all the information comes from the longhorn-manager.
  • The longhorn-manager retrieves everything from a datastore, including the volume information
  • The datastore retrieves information from many places (Called Informers).

My investigation is leaning towards 3 possibilities:

1 - Start a clean Longhorn installation (following the documentation) and call some manager API to recreate each volumes, reusing the replicas instead of creating new replicas. Not sure if those APIs exist
2 - This options is the same as the first one, but if the API does not exist, maybe create a new API to create volume from existing replicas.
3 - During the manager initialization, somehow initialize the informers (Maybe the replicaInformer?) in a way that it will read which replica is present in the /var/lib/longhorn/replicas folder.

Number 1 and 2 are what I am investigating now.

Since this is my first time going through this code, I would appreciate if anyone familiar with the code can provide me some guidance/help to speed my investigation up.

PS.:

  • I do not mind jumping into a call (zoom, google meet, etc)
  • If I can figure this out I will be glad to provide patch.

@jenting jenting moved this from New to In progress in Community Issue Review Jun 22, 2021
@jenting jenting added the require/doc Require updating the longhorn.io documentation label Jun 22, 2021
@jenting
Copy link
Contributor

jenting commented Jun 22, 2021

I think we should add the documentation on how to recover the local replica data, to perform migration/movement to a newly created Longhorn volume. The workflow like this:

  1. create a new longhorn volume (vol-A) with detached state and replica count = 1
  2. copy/sync the existing replica data to the new create longhorn volume (vol-A) on the host /var/lib/longhorn/replicas
  3. make the vol-A attached and replica count=3
  4. remove the orphan replica on the host

@alkmim Thank you for working on this enhancement. I think the above steps should be work w/o running from longhorn-ui -> longhorn-manager -> longhorn-engine. But if you like to get more familiar with Longhorn, probably you could go with Longhorn architecture doc first.

@jenting
Copy link
Contributor

jenting commented Jun 22, 2021

Possibly we could implement this requirement along with #2461 because when performing the disk replica scanning, it's able to know how to recover the volume from the replica or clean the orphan replica. But it decided by the admin on which replicas to recover and which replicas to clean up.

@jenting jenting self-assigned this Jun 22, 2021
@jenting jenting moved this from In progress to Pending user response in Community Issue Review Jun 22, 2021
@alkmim
Copy link
Author

alkmim commented Jun 22, 2021

Hello @jenting. Thanks for the response. I like your idea on #2714 (comment), but I am still concerned about the fact that on every full reboot of the cluster the entire data would have to be transferred to new volumes, since this can lead to a huge downtime for big datasets and also the overhead on I/O (network and disk)

Since the replicas are already there, healthy and the longhorn-engine can start the controller referencing existent replicas (https://github.com/longhorn/longhorn-engine#running-a-controller-with-multiple-replicas) wouldn't the be better for the disk replica scanning to creating volumes re-using the existing replicas (if they are health of course) instead of creating new volumes and sync the content?

PS.:

  • Thanks for the documentation. I couldn't digest all of it yet, but I have being using it as reference together with the code.
  • I am still available for quick call sync if it helps to clarify/discuss ideas. Almost anytime is good for me (no need to worry about my timezone)
  • I am also willing to contribute with the code since this is an important feature for my setup.

@jenting
Copy link
Contributor

jenting commented Jun 22, 2021

Not sure what's the reason for to full reboot of the cluster, even the Kubernetes control plane node?
If the Longhorn CRs resource is still in the cluster ETCD, then the Longhorn manager should be able to reuse the volume and replicas.

Besides that, thanks for your willingness to contribute to the Longhorn code. You could refer to these also

But if you still have an idea/problem with the Longhorn code, we could discuss it on Rancher slack or let us arrange a call to clarify/discuss the ideas (but we need to know your timezone first 😉).

@alkmim
Copy link
Author

alkmim commented Jun 22, 2021

I can see a couple of scenarios for a full reboot (including by design). In my case, I have a requirement that everything in the cluster should be rebuilt from scratch, using only the data disks. One reason that would impact most people would be a power outage.

I am not sure how the ETCD keeps its information. If the information is persisted across reboots in the filesystem, I can make sure it is always in sync in the data disks. It could be a good direction.

I will read the mentioned documents, but I would also be happy to discuss it on Rancher slack or having a call. Can you invite me to the workspace? It will be good to avoid spending effort in the wrong direction since I do not know longhorn long term plans and I am also very new to the code.

Again, no need to worry about my timezone. I can talk on any timezone.

@alkmim
Copy link
Author

alkmim commented Jun 22, 2021

Another viable option for my use case, would be to follow this workflow:

  • Bootstrap kubernetes cluster
  • Bootstrap longhorn
  • Create all the custom resources (manually or with a script) required to indicate to longhorn where are the replicas.

If this solution works, I would still need to know for every volume/replica what are the Custom Resources that I need to create to be sure the Manager will use it.

@jenting
Copy link
Contributor

jenting commented Jun 22, 2021

Have you considered taking backup to an external S3 provider before rebooting the cluster? So even you rebuilt the cluster from scratch, you could restore the data from an external backup.

Or if all nodes including control plane nodes and worker nodes, you could choose to use external etcd to make the longhorn Custom Resouces (CRs) be kept after the node reboot or rebuilt from scratch. This could make sure the Kubernetes cluster data persistent. After that, even the longhorn components (longhorn-manager, longhorn-instance-engine, longhorn-instance-replicas, ...) gone. After the cluster back, use the external etcd, the longhorn still could access the original data with existed Custom Resources (CRs) + replicas on the nodes.

@alkmim
Copy link
Author

alkmim commented Jun 22, 2021

Have you considered taking backup to an external S3 provider before rebooting the cluster? So even you rebuilt the cluster from scratch, you could restore the data from an external backup.

I have the data available to be restored. The problem with this approach is that it takes a long time to restore even for small amounts of data.

Or if all nodes including control plane nodes and worker nodes, you could choose to use external etcd to make the longhorn Custom Resouces (CRs) be kept after the node reboot or rebuilt from scratch. This could make sure the Kubernetes cluster data persistent. After that, even the longhorn components (longhorn-manager, longhorn-instance-engine, longhorn-instance-replicas, ...) gone. After the cluster back, use the external etcd, the longhorn still could access the original data with existed Custom Resources (CRs) + replicas on the nodes.

Having an external etcd is not an option, since the cluster is entirely self-contained. The reboot will cause the etcd to be rebooted.

In summary, the full reboot will happen. I will still all the data and I can have specific folders in the nodes to be persisted across reboots (etcd folders maybe?). I cannot workaround the reboot, everything will be rebooted. The scenario is the same as a power loss in the entire system.

@alkmim
Copy link
Author

alkmim commented Jun 22, 2021

Right now, I am investigating 2 approaches:

  • Have etcd data containing the resource information to be persisted.
  • Have the volume/replica CRs information stored in yaml files. This way I can just run kubectl apply -f on all the yaml files.

@jenting Please, let know how we can set up a call. I will be glad to provide more information if required.

@alkmim
Copy link
Author

alkmim commented Jun 22, 2021

I just went through all the resources created when I created a new volume. It seems that these would be the resources:

>>> Longhorn Engine (lhe)
NAMESPACE         NAME                 STATE     NODE   INSTANCEMANAGER   IMAGE   AGE
longhorn-system   my-test-e-6248c06e   stopped                                    6h16m
>>> Longhorn Replicas (lhr)
NAMESPACE         NAME                 STATE     NODE    DISK                                   INSTANCEMANAGER   IMAGE   AGE
longhorn-system   my-test-r-0e2e631d   stopped   node3   02887514-8367-4e30-81a0-16672ef0edfa                             6h16m
longhorn-system   my-test-r-54e1e8a7   stopped   node6   9225800d-0229-4edb-b3a5-19642c3f1918                             6h16m
longhorn-system   my-test-r-8f77c254   stopped   node4   b91cbb5e-1553-403e-abe0-f976f7b65beb                             6h16m
>>> Longhorn Volume (lhv)
NAMESPACE         NAME      STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
longhorn-system   my-test   detached   unknown      True        2147483648          6h16m

I will test dumping all of these resources into yaml and than recreate the volumes by running kubectl apply -f. I will update here if it works.

@jenting
Copy link
Contributor

jenting commented Jun 22, 2021

Right now, I am investigating 2 approaches:

  • Have etcd data containing the resource information to be persisted.
  • Have the volume/replica CRs information stored in yaml files. This way I can just run kubectl apply -f on all the yaml files.

Probably you could consider use Rancher cluster restore or Velero to backup and restore the etcd data and longhorn CRs.
Or probably you want this feature #1455.
Let me know if the above approaches could meet your requirement for now.

@jenting Please, let know how we can set up a call. I will be glad to provide more information if required.

cc @yasker @innobead

@alkmim
Copy link
Author

alkmim commented Jun 23, 2021

Hello @jenting. thanks for your reply. Rancher cluster restore and Velero are a good start. Although I will have to figure out a way to restore only the longhorn related info from the etcd backup.

Feature #1455 Sounds promising.

Meanwhile, I continued with the tests I mentioned above. Below are my findings

  • After going through the code it seems that this approach might actually work.
  • I was able to recreate the LH Volume
  • I was not able to recreate LH Engine neither the LH Replicas. I think the resources are being garbage collected as soon as I create them. Kubectl returns "created", but as soon as get the resource it is not there anymore. Below is the yaml I used to create everything the resouces

Yaml used: longhorn_volume.yaml

Some notes:

  • I double checked the disk ID, replica location, etc
  • I actually tried deploying these resources in many different ways. I tried undeploying longhorn manager.
  • The test that made me believe in the garbage collection reason was, starting a fresh kube installation and during Longhorn deployment I did not deploy and pod (removed from yaml definition). Even in this scenario the CRs got deleted as soon as I created them.

@jenting Do you see anything that could cause the LH Replicas and LH Engines to be delete as soon as I create them?

@jenting jenting moved this from Pending user response to In progress in Community Issue Review Jun 25, 2021
@jenting
Copy link
Contributor

jenting commented Jun 25, 2021

cc @shuo-wu

@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 25, 2021

  1. Manually recreating Longhorn CRs (volume, engine, replicas, node, etc) is not a good idea. IIRC, the CR status won't be applied if you use the export yaml files to do manual recreation. As a result, some status field in Longhorn volumes maybe not right then leads to the volume unavailable. e.g., volume.Status.RestoreInitiated should be true if a volume is a restored volume.
  2. If you really want to recreate CRs manually via YAML, you can try with following tips:
    1. Guarantee Spec.DesireState of engines/replicas is stopped., volume.spec.NodeID is empty.
    2. Create replicas then the engine before recovering the volume. Do not do a recovery in an inversed way. Since once there is a new volume, Longhorn will check then create engine & replicas immediately for it.
    3. Manually set volume.Status.RestoreInitiated to true for a restored volume. For restoring volume, it's better to redo restore.
    4. If you modified Longhorn settings, I would recommend you recreate setting CRs as well.
  3. Besides, I am not sure there are any other special cases like restored volumes. You may need to check by yourself.
  4. Actually based on the requirements you mentioned above, the simplest way is always to do ETCD backup and restore via Rancher or some tools like Velero. Then you don't need take care of these trivial issues.

@alkmim
Copy link
Author

alkmim commented Jun 25, 2021

I think I found a good procedure. Longhorn was able to recognize the disks correctly. One of the tricks I was missing was to properly set the metadata.labels in the replicas and engine. Here is the overall approach:

  1. Stop longhorn manager daemonset (kubectl delete daemonset longhorn-manager -n longhorn-system)
  2. Apply the volume.yaml
  3. Take note of the UUID of the volume (required for setting ownership of replica and engine)
  4. Apply replicas.yaml (need to include the UUID of the volume in the ownerReferences)
  5. Apply engine.yaml (need to include the UUID of the volume in the ownerReferences)
  6. Restart longhorn manager (kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml)

Note:

  • I am not using any special features. I am using a volume with no snapshots and 3 replicas. Frontend is iSCSI.
  • I still need to do more testings.
  • Attached is replicas.yaml, engine.yaml and volumes.yaml
    engine.yaml
    replicas.yaml
    volume.yaml

Thanks @shuo-wu and @jenting for your help! Please, let me know your thoughts on this approach. I also hope this helps someone else facing the same requirements.

@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 28, 2021

Stopping longhorn manager pods and modifying UUID is a little bit tricky but I think it will work. As I mentioned above, you may need to handle the restored volume (volume.Spec.FromBackup is not empty) separately.

@alkmim
Copy link
Author

alkmim commented Jun 28, 2021

Thanks for the feedback @shuo-wu . In my case, I am not using restored volumes (at least I think I am not). I took the empty of value volume.Spec.FromBackup from a newly created volume after a added a bit of data to it. Would I still need to change this value?

I have one more concern:

  • In my case, I have many machines and I am using a replica 3. My understanding is that if one of the machines the replica is shuts down it will recreate the replica somewhere else. In this case, after a reboot, It will have 4 replicas (one old/deprecated). Questions:
    • Is there a way to chose the correct 3 replicas?
    • Can I add 4 replicas and longhorn will detect the old one and clean it up?

Thanks for all the feedbacks!

@cclhsu cclhsu added this to the Planning milestone Jun 29, 2021
@cclhsu
Copy link

cclhsu commented Jun 29, 2021

Is there a way to chose the correct 3 replicas?
Can I add 4 replicas and longhorn will detect the old one and clean it up?

After replication done, all 4 replicas should be in healthy state. Currently, Longhorn do automatically scale down the replicas. User can manually delete one of healthy replica. In order to add to volume, you can first increase replica count for the volume and after successful replication and decrease the replica count to original value.

@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 29, 2021

If there is no restored volume, you can ignore the field.

You can delete any replicas you don't want to retain except for the last healthy one. Longhorn will detect the out-of-sync replicas each time when the volume is being attached.
Most of the time, you don't need to worry about replica count or manual replica cleanup except for one special case Clark mentioned above: scale down the replica count for a volume. For this case, you need to remove the redundant replicas manually.

@c3y1huang c3y1huang moved this from In progress to Resolved/Scheduled in Community Issue Review Jun 30, 2021
@alkmim
Copy link
Author

alkmim commented Jul 2, 2021

Thanks for your reply @cclhsu and @shuo-wu

I did some reboots (for testing) and it seems that the recreation following the process describe in my previous post. Feel free to close this issue. If I find any issue related with this procedure I update this issue.

Although, I think it would be a good feature for Longhorn to be able to detect volumes based on the replicas located in the nodes during boot time.

@alkmim alkmim closed this as completed Jul 2, 2021
@innobead innobead removed this from the Planning milestone Aug 26, 2021
@PatrickHuetter
Copy link

I also think it would be a good feature, even if it would be triggered manually by calling an api or pressing a button in the frontend. We had a huge longhorn/cluster outage a few days ago and I spend hours nights and days to recover all volumes (more than 250). The backups worked but 1 backup was defect and 1 one was missing. I don't know why but since I had the data on the hard disk of the old nodes I want to recover from there. At this time I'm reading this thread here and figuring out how to get the data from the old servers hard disk from filesystem and get it imported again into the new cluster with longhorn installed.

@innobead innobead modified the milestone: Planning Dec 25, 2021
@alkmim
Copy link
Author

alkmim commented Dec 30, 2021

At this time I'm reading this thread here and figuring out how to get the data from the old servers hard disk from filesystem and get it imported again into the new cluster with longhorn installed.

I am not sure if it is helpful, but the procedure I tested seems to be working so far (see here for reference: #2714 (comment))

I know it is a manual procedure but it seems to be working just fine.

I actually did an automated python code for my environment thought. Not very hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Please use `discussion` to ask questions instead require/doc Require updating the longhorn.io documentation
Projects
Archived in project
Community Issue Review
Resolved/Scheduled
Development

No branches or pull requests

7 participants