-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fail to backup a Stopped/Off VM due to volume error state #5841
Comments
@albinsun do we cover this in our e2e auto test coverage? or just manual test coverage? |
Some findings from the 2-node SB, the error vmabckup is
This vmbackup tried to backup two VM's disks:
The backup is still in progress, however, these two volumes are already in the detached state, which is not available to take snapshots.
It seems there is some similarity as #3813 , I will try to reproduce it to have further investigation, thanks. |
|
Having some check, our current VM backup TCs mainly locates in
Note that, in the test so far, it's harder to reproduce this in bare-metal env. |
@albinsun - I struggled to reproduce this on bare-metal, I've not been able to have a successful attempt at reproducing it. |
Thank you for check! |
@albinsun - I'll run a few more checks on it too, as it was just two runs - as perhaps I'll be able to reproduce it on a few more iterations 👍 |
we don't have any test cases to backup the VM twice. |
cc @ChanYiLin |
TL;DR LH CSI CreateSnapshot() returns without error but sets ReadyToUse as false, this makes an Taking the 2-node SB as an example Data structure for one problematic volumesnapshot
Timeline during trying to take the snapshot: Around
LH CSI plugin has matched logs
Around
This time LH successfully took the snapshot and backup
However, even LH snapshot and backup were successfully created, the related volumesnapshot kept the state as During the reproduce steps on my site, I turned external-snapshotter log to A possible workaround: |
cc @PhanLe1010 |
I'm not sure yet on the full context of this issue, but it is correct for Longhorn to return no error with The cluster snapshotter components should periodically check back with the CSI plugin to confirm whether the snapshot has become https://github.com/container-storage-interface/spec/blob/master/spec.md#the-ready_to_use-parameter |
I am investigating the permanent failure part, but I think the reason for the initial failure is somewhat clear. This is a Harvester harvester/pkg/api/volumesnapshot/handler.go Lines 142 to 168 in ccd7e1c
This is very confusing to Longhorn, because the only component that is currently allowed to set The volume attachment controller notices Harvester tried to attach the volume and decides to detach it.
But some processes are already starting.
Then the volume attachment controller decides to attach the volume.
There are a flurry of process deletions and creations related to the conflicting actions. The engine process crashes because the replica processes that were started are shut down before it can use them.
In the process of this happening, we try and fail to take a snapshot, kicking off the VM backup failures this ticket was opened with. I think for offline backups to work reliably, versions of Harvester that use Longhorn |
As for the permanent backup failure, Longhorn mostly seems to be performing correctly, given the circumstances. In the two node support bundle, one backup is taken while the volume is attached and one while it is detached. While it is attached, we see multiple csi-snapshotter calls to create a snapshot:
On the csi-plugin side, we respond with no error each time. The last time, we respond with
While the volume is detached, we see something somewhat similar, except an error occurs. On the csi-snapshotter side, the first attempt fails. But it tries again.
On the csi-plugin side, we respond as follows:
So far, this behavior is correct. Now, the backup actually completes almost immediately after the backup CR is created.
In this case, it would have been nice if the csi-plugin had waited until the backup completed to return, so it could return a The support bundle I'm analyzing doesn't continue long enough for me to check whether csi-snapshotter follows up. Does it do so in other reproduces? cc @bk201 @albinsun |
Agree with @ejweber on the immediate error of why CSI snapshot failed. A small nit that Harvester seems to use this logic to attach the volume instead harvester/pkg/api/volume/handler.go Lines 275 to 304 in ccd7e1c
|
For the context, this error in the CSI snapshotter I strongly agree that Harvester should get rid of all the logic which set Longhorn volume.Spec.NodeID directly, it will conflict with Longhorn AD controller and can make the volume faulted |
It may be related to kubernetes-csi/external-snapshotter#953. |
I agree with this workaround. I can trigger this with about 50% regularity using pure Longhorn now that we understand what can lead to it. This workaround has successfully gotten me unstuck. |
I use this approach to produce with Longhorn only (without Harvester):
|
I am thinking if we should recommend the user not to take backup of stopped VM in Harvester 1.2.2 at all because doing so could make the volume faulted or trigger rebuilding (if not all replicas crashed at the time). WDYT @ejweber @WebberHuang1118 |
@FrankYang0529 Please help backport the PR that remove the attach/detach logic in Harvester for backups: #5853 |
Hi @ejweber @PhanLe1010, There are no further logs in csi-snapshotter for the problematic volumesnapshot. IIUC, because Harvester handles its own attaching and detaching, and this brings chaos to volume behavior. But the LH snapshot and backup are completed eventually, why volumesnapshot kept in |
|
It appears to be because of kubernetes-csi/external-snapshotter#953. The reporter there describes a race involving errors and
Unfortunately, when we return an error at some point during the process, it appears to prevent further csi-snapshotter reconciliation (at least some of the time). We tested moving to a version of external-snapshotter that doesn't have the issue in longhorn/longhorn#8618 (comment). You can see the results there. |
Thanks @ejweber @PhanLe1010 , |
Describe the bug
![image](https://private-user-images.githubusercontent.com/2773781/332293844-bfe06eb5-6683-499e-a5be-824b5fd965c2.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkxNDI5MTgsIm5iZiI6MTcxOTE0MjYxOCwicGF0aCI6Ii8yNzczNzgxLzMzMjI5Mzg0NC1iZmUwNmViNS02NjgzLTQ5OWUtYTViZS04MjRiNWZkOTY1YzIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjNUMTEzNjU4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZmY2M2JmNTIzM2RmYTIzOTcyNjk3ZjU0Yzc3ODVlMWM5MDk4ZWQ1YjY1NTE0ODEyODI2YmUzYjQwMzE5Zjg5ZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.dEWa8AoERckVfAlF_ak1S0GlGkpdqIA9-rL4CyIOI8M)
Backup may fail on a
Off
/Stopped VM.To Reproduce
Steps to reproduce the behavior:
Create a custom storage class custom
Create 3 VMs with different volume setup
vm-1 (Only 1 rootdisk)
vm-2-default (1 rootdisk + 1 extra volume using default SC)
vm-2-custom (1 rootdisk + 1 extra volume using custom SC)
Take backup on 3 VMs
Stop 3 VMs
Take backup on 3 VMs
![image](https://private-user-images.githubusercontent.com/2773781/332337106-ba578a76-f978-4ada-8098-56edf6b4850b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkxNDI5MTgsIm5iZiI6MTcxOTE0MjYxOCwicGF0aCI6Ii8yNzczNzgxLzMzMjMzNzEwNi1iYTU3OGE3Ni1mOTc4LTRhZGEtODA5OC01NmVkZjZiNDg1MGIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjNUMTEzNjU4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZDQzMWRlOTc0Mjk2YjkxMWVlZjg3ZTU1ZWM3ODg3ZmY5MDNmMWUxZGIyMjBlNTAwMTg1N2M1MWZkZjA3NTZiMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.sHDW8Z6HgXthiF2s5pwa2f_4O7zAc1QpOU1bdxRw6Qk)
![image](https://private-user-images.githubusercontent.com/2773781/332340248-912b92bd-6115-43b8-bf5d-6b0122e03d89.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkxNDI5MTgsIm5iZiI6MTcxOTE0MjYxOCwicGF0aCI6Ii8yNzczNzgxLzMzMjM0MDI0OC05MTJiOTJiZC02MTE1LTQzYjgtYmY1ZC02YjAxMjJlMDNkODkucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYyMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MjNUMTEzNjU4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9M2RjZjI5NWIyNGQyOGE1NTdiOGQ0NDMwMjkxNzc1ZGE2MDdlM2ZlY2IxNDQ3ZmQxZDFjNTQ3Yzk1NThkNTRiOCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lt9G7lppWZY8iGACWGSPA507fjf-RxoUKLgmNcha3NY)
backup may fail
Longhorn volume replica fail
Expected behavior
Can backup VM both when
Running
andOff
Support bundle
Environment
v1.2.2
Auto
Additional context
supportbundle_off-extra-disk_2nodes.zip
The text was updated successfully, but these errors were encountered: