[BUG] Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume #1336

sowmyav27 · 2020-05-13T18:29:44Z

Describe the bug
Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume

To Reproduce

Have a volume v1 - with 3 replicas
write data into the volume
Take a backup
Restore from the backup
While the restore is in progress, delete instance manager engine running the volume.
The restore succeeds.
Data is NOT available in the volume.

Expected behavior
@yasker Should the restore fail in this case? And the volume be shown as "Faulted" ?

Log
longhorn-support-bundle_f7248c7c-cbac-44c3-b611-48b17218f97b_2020-05-13T18-25-27Z.zip

Note:
Backup is in S3

Environment:

Longhorn version: master - 05/13/2020
Kubernetes version: 1.17.5
Node OS type and version: RKE LINUX DO nodes

yasker · 2020-05-14T17:22:18Z

@shuo-wu We need to understand why it behaviors in this way first before fixing this. The question here is why the restore was a success?

shuo-wu · 2020-05-15T04:03:23Z

There are some weird/wrong behaviors after deleting the engine manager:

Stopping the errored engine takes a long time (about 1 minute).

[longhorn-manager-2dv6l] time="2020-05-15T02:53:23Z" level=warning msg="Starts to clean up then recreates pod for instance manager instance-manager-e-4d6d703f with state stopped"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:23Z" level=info msg="Created instance manager pod instance-manager-e-4d6d703f for instance manager instance-manager-e-4d6d703f"
......
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=debug msg="Prepare to delete instance vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="Instance vol1-e-734966aa is state error, error message: Instance Manager error"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="Instance vol1-e-734966aa crashed on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, try to get log"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-734966aa\", UID:\"a2b73e51-3819-4ee6-8c83-154ae24362c7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899399\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=error msg="failed to receive log for instance vol1-e-734966aa: rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="cannot get crash log for instance vol1-e-734966aa on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, error rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=debug msg="Prepare to delete instance vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="Instance vol1-e-734966aa is state error, error message: Instance Manager error"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="Instance vol1-e-734966aa crashed on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, try to get log"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-734966aa\", UID:\"a2b73e51-3819-4ee6-8c83-154ae24362c7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899402\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=error msg="failed to receive log for instance vol1-e-734966aa: rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="cannot get crash log for instance vol1-e-734966aa on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, error rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
......

[longhorn-manager-2dv6l] time="2020-05-15T02:54:25Z" level=debug msg="Instance handler updated instance vol1-e-734966aa state, old state error, new state stopped"

The volume continuously retries auto reattachment.

[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol1\", UID:\"1094bb75-5708-4105-8e74-f557c2961e80\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899223\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=debug msg="Requeue longhorn-system/vol1-e-734966aa due to conflict: Operation cannot be fulfilled on engines.longhorn.io \"vol1-e-734966aa\": the object has been modified; please apply your changes to the latest version and try again"
......
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol1\", UID:\"1094bb75-5708-4105-8e74-f557c2961e80\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899406\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=debug msg="Prepare to delete instance vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=warning msg="Instance vol1-e-734966aa is state error, error message: Instance Manager error"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=warning msg="Instance vol1-e-734966aa crashed on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, try to get log"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-734966aa\", UID:\"a2b73e51-3819-4ee6-8c83-154ae24362c7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899402\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=error msg="failed to receive log for instance vol1-e-734966aa: rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=warning msg="cannot get crash log for instance vol1-e-734966aa on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, error rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
......

The volume will become Ready after the automatic reattachment and detachment.

Ideally, the engine should be stopped and the volume should be Faulted after the engine manager deletion. And this shouldn't take a long time.

shuo-wu · 2020-05-18T10:58:41Z

The explanations for the issue and the behaviors mentioned above:

Since there is no process existing in the newly recreated engine manager, the engine manager needs to wait for 60s to refresh the instance map. This won't fail anything but it's better to improve it.
The auto attachment logic intervenes the auto reattachment (engine dead recovery) workflow:
1. The engine crashes unexpectedly and the auto reattachment logic is triggered.
2. The auto reattachment logic starts to detach the volume first (by cleaning up v.Status.CurrentNodeID) and waits for the volume detached.
3. The auto attachment logic will try to directly attach the volume without any checks/waits once the volume starts detaching (by setting v.Status.CurrentNodeID again). Conflict!
4. When the new engine manager updates the process map and the engine is updated to Stopped, the new engine will be started immediately without shutting down all old replicas (since the auto attachment logic will directly attach the volume).
5. In this case, all replicas keep state running without stopping the restore during the engine restart. After the volume is reattached, those replicas will report restore complete and the volume will be marked as Ready for workloads. But it's false-positive since the replica restore after the engine crash is invalid.

......
[longhorn-manager-mjcwx] time="2020-05-18T09:43:07Z" level=warning msg="Engine of volume vol3 dead unexpectedly, reattach the volume"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol3\", UID:\"943c235c-f953-4f05-adf6-8e7001f647f5\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"10947215\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume vol3 dead unexpectedly, reattach the volume"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=debug msg="Instance handler updated instance vol3-e-14ba4459 state, old state error, new state stopped"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=info msg="----------------Volume vol3 current state is updated from detaching to attaching"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=debug msg="Prepare to create instance vol3-e-14ba4459"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol3-e-14ba4459\", UID:\"9ae8a872-b776-46ea-b44a-775d5f4c33ec\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"10947431\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts vol3-e-14ba4459"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=debug msg="Instance process vol3-e-14ba4459 had been created, need to wait for instance manager update"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:32Z" level=debug msg="Instance process vol3-e-14ba4459 had been created, need to wait for instance manager update"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:32Z" level=debug msg="Instance handler updated instance vol3-e-14ba4459 state, old state stopped, new state starting"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Instance vol3-e-14ba4459 starts running, IP 10.42.2.234"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Instance vol3-e-14ba4459 starts running, Port 10000"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Instance handler updated instance vol3-e-14ba4459 state, old state starting, new state running"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Start monitoring vol3-e-14ba4459"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=info msg="----------------Volume vol3 current state is updated from attaching to attached"

[longhorn-instance-manager] time="2020-05-18T09:41:45Z" level=info msg="Process Manager: prepare to create process vol3-r-3fe3d6c2"
[longhorn-instance-manager] time="2020-05-18T09:41:45Z" level=info msg="Process Manager: created process vol3-r-3fe3d6c2"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/vol3-92d81037, size 2147483648/512"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10000"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Listening on data server 0.0.0.0:10001"
time="2020-05-18T09:41:45Z" level=info msg="Listening on sync agent server 0.0.0.0:10002"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Listening on sync 0.0.0.0:10002"
[longhorn-instance-manager] time="2020-05-18T09:41:45Z" level=info msg="Process vol3-r-3fe3d6c2 has started at localhost:10000"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:47Z" level=info msg="New connection from: 10.42.2.221:47358"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:47Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/vol3-92d81037, size 2147483648/512"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:51Z" level=info msg="Successfully initiated restore for s3://backupbucket@us-east-1/backupstore?backup=backup-f0a33111b9614af7&volume=pvc-0af64609-b5e8-4ac0-9f36-94044bc68db5 to [volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img]"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:15Z" level=info msg="Closing volume"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Restore completed successfully in Server"
time="2020-05-18T09:42:25Z" level=info msg="Reverting to snapshot volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img on 0.0.0.0:10000 at 2020-05-18T09:42:25Z"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Reverting to snapshot volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img on 0.0.0.0:10000 successful"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Done running restore s3://backupbucket@us-east-1/backupstore?backup=backup-f0a33111b9614af7&volume=pvc-0af64609-b5e8-4ac0-9f36-94044bc68db5 to volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img"

The reason why the above replica restore is invalid is that the replica server will be closed once the old engine process is down. The Revert operation or data write operation for the restore doesn't make sense. (NOT SURE which part leads to the data corruption.) Finally, it leads to the false-positive in statement 2!

[vol3-r-3fe3d6c2] time="2020-05-18T09:42:15Z" level=info msg="Closing volume"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Restore completed successfully in Server"
time="2020-05-18T09:42:25Z" level=info msg="Reverting to snapshot volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img on 0.0.0.0:10000 at 2020-05-18T09:42:25Z"

Solution:

Fix the auto attachment bug so that it won't intervene in the reattachment logic (by adding PendingNodeID check).
Make sure the volume will retry the restore after the reattachment.
Finally, the volume restore will succeed.

shuo-wu · 2020-05-19T05:53:45Z

Pre-merged Checklist

Does the PR include the explanation for the fix or the feature?
Is the backend code merged?
Is the reproduce steps/test steps documented?
If labeled: area/ui Has the UI issue filed or ready to be merged?
If labeled: require-automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
if labeled: require-automation-engine Has the engine integration test been merged?
if labeled: require-doc Has the necessary document PR submitted or merged?

sowmyav27 · 2020-05-19T23:48:20Z

Validating on master - 05/19/2020 - 16:30 PDT

Validation: PASSED

Steps:

Setup a backupstore on S3
Create volume and attach to a workload
Write random data to the pod volume and get the md5sum.
Create a backup for the volume.
Restore a new volume from the backup.
Crash the engine during the restore - delete instance manager engine on the node where volume is attached.
Wait for the volume detaching.
Wait for the volume reattached.
Volume should be in detached state after restore is complete.
Attach this restored volume to a new workload.
Verified data is intact.

yasker assigned shuo-wu May 13, 2020

yasker added area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug labels May 13, 2020

yasker added this to the v1.0.0 milestone May 13, 2020

This was referenced May 14, 2020

Fail all replicas if the node is down or the engine is dead during the restore longhorn/longhorn-manager#566

Closed

Add test_engine_crash_during_restore() skeleton longhorn/longhorn-tests#318

Merged

yasker mentioned this issue May 15, 2020

Negative test case issues #1337

Closed

This was referenced May 18, 2020

Allow restore/DR volume retry restoring after engine crashes longhorn/longhorn-manager#575

Merged

[BUG] Removing old instance records after the new IM pod is launched will take 1 minute #1363

Closed

shuo-wu mentioned this issue May 19, 2020

[BUG] DRV stuck in attaching state when restoring is interrupted by rebooting attached node #1328

Closed

sowmyav27 self-assigned this May 19, 2020

sowmyav27 closed this as completed May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume #1336

[BUG] Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume #1336

sowmyav27 commented May 13, 2020

yasker commented May 14, 2020

shuo-wu commented May 15, 2020 •

edited

shuo-wu commented May 18, 2020 •

edited

shuo-wu commented May 19, 2020 •

edited by yasker

sowmyav27 commented May 19, 2020 •

edited

[BUG] Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume #1336

[BUG] Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume #1336

Comments

sowmyav27 commented May 13, 2020

yasker commented May 14, 2020

shuo-wu commented May 15, 2020 • edited

shuo-wu commented May 18, 2020 • edited

shuo-wu commented May 19, 2020 • edited by yasker

sowmyav27 commented May 19, 2020 • edited

shuo-wu commented May 15, 2020 •

edited

shuo-wu commented May 18, 2020 •

edited

shuo-wu commented May 19, 2020 •

edited by yasker

sowmyav27 commented May 19, 2020 •

edited