New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Extra snapshot generated when clone from a detached volume #5986
Comments
When the reproduce steps are followed, the volume controller responsible for initiating the volume clone first creates a snapshot, then updates the volume's status.cloneStatus field(s) with information about the snapshot. On the first attempt to do this, the snapshot is created but the status update does not succeed. On the second attempt to do this, another snapshot is created and the status update succeeds.
The first attempt to update the volume status fails because the volume controller refuses to continue when it fails to update an engine first. Something must have changed on since v1.4.x that either causes the volume controller to update an engine (when it previously wouldn't have) or causes the engine controller to update an engine status simultaneously (when it previously wouldn't have). This regression first appeared on May 14. |
I was able to reproduce this issue in v1.4.2 as well, though it appears to happen only rarely there. It is the result of a race between the volume controller and the engine controller (as detailed above), and v1.5.0 changes seem to have made it more likely for the volume controller to lose it (though I'm not 100% sure why). I am submitting a PR that uses calls SnapshotCreate with a deterministic UUID as the name (instead of letting longhorn-engine generate one itself). I used a deterministic UUID so we can use the same name in a subsequent reconciliation loop if we failed to update the volume or engine with the name of the snapshot we already created. |
Pre Ready-For-Testing Checklist
|
EnvironmentTest resultWait confirm the expected behavior with engineer Test steps (refer to #5986 (comment))Scenario 1: The source PVC is in detached state
Scenario 2: The source PVC is in attached state |
Hi @ejweber, I'm verifying this issue but would like consult with you one question, please.
cc @chriscchien |
I only focused on the detached scenario (since that was how the bug was filed). Let me check. If we HAVE decreased the snapshot count legitimately, I think we will want to update the assertion in the attached automated test case. |
In my 1.4.2 reproduce, I actually saw three snapshots created! All were for the same reason the fix is targeting: a new snapshot is created after a failed reconcile (however this time the failures were due to replica churn).
After updating to master head, we are logging that the fix is working.
This looks to be working as I would expect now, so no worries. I was curious why this fix didn't cause some test case to fail with a wrong snapshot count (like the failure that opened this bug), but it looks like we only |
This fix is possibly the reason: #3692. |
Hi @ejweber, I think I know why the e2e: # Step-8
clone_volume = client.by_id_volume(clone_volume_name)
clone_volume.attach(hostId=lht_host_id)
wait_for_volume_attached(client, clone_volume_name)
clone_volume = wait_for_volume_endpoint(client, clone_volume_name)
# Step-9
check_volume_data(clone_volume, data)
# Step-10
wait_for_volume_healthy(client, clone_volume_name)
# Step-11
source_volume = client.by_id_volume(source_volume_name)
source_volume.attach(hostId=lht_host_id)
source_volume = wait_for_volume_attached(client, source_volume_name)
wait_for_snapshot_count(source_volume, 2) It is because the {
"data": [
{
"checksum": "",
"children": {
},
"created": "2023-06-07T07:30:25Z",
"labels": {
},
"name": "volume-head",
"parent": "2227e197-7a86-5e05-89fb-41b614b0b98a",
"removed": False,
"size": "0",
"usercreated": False
},
{
"checksum": "",
"children": {
"volume-head": True
},
"created": "2023-06-07T07:30:25Z",
"labels": {
"longhorn.io/for-cloning-volume": "pvc-b33eeb51-495d-42a0-8594-4a238372d0eb"
},
"name": "2227e197-7a86-5e05-89fb-41b614b0b98a",
"parent": "",
"removed": False,
"size": "8192",
"usercreated": True
}
],
"resourceType": "snapshot"
} |
Based on the test results that this issue was fixed. Thanks @ejweber explanation again. I will close this issue. |
Describe the bug (馃悰 if you encounter this issue)
From test case test_cloning_with_detached_source_volume
Clone a volume from a detached volume will create extra 1 snapshot in source volume, make test case fail.
In v1.4.2
master-head
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Do not generate extra snapshot as before or have test code update to match new behavior.
Log or Support bundle
N/A
Environment
Additional context
test_inc_restoration_with_multiple_rebuild_and_expansion
The text was updated successfully, but these errors were encountered: