Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume #1336

Closed
sowmyav27 opened this issue May 13, 2020 · 5 comments
Assignees
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug
Milestone

Comments

@sowmyav27
Copy link

Describe the bug
Backup restore succeeds when an instance manager engine crashes and data is NOT available in the restore volume

To Reproduce

  • Have a volume v1 - with 3 replicas
  • write data into the volume
  • Take a backup
  • Restore from the backup
  • While the restore is in progress, delete instance manager engine running the volume.
  • The restore succeeds.
  • Data is NOT available in the volume.

Expected behavior
@yasker Should the restore fail in this case? And the volume be shown as "Faulted" ?

Log
longhorn-support-bundle_f7248c7c-cbac-44c3-b611-48b17218f97b_2020-05-13T18-25-27Z.zip

Note:
Backup is in S3

Environment:

  • Longhorn version: master - 05/13/2020
  • Kubernetes version: 1.17.5
  • Node OS type and version: RKE LINUX DO nodes
@yasker
Copy link
Member

yasker commented May 14, 2020

@shuo-wu We need to understand why it behaviors in this way first before fixing this. The question here is why the restore was a success?

@shuo-wu
Copy link
Contributor

shuo-wu commented May 15, 2020

There are some weird/wrong behaviors after deleting the engine manager:

  1. Stopping the errored engine takes a long time (about 1 minute).
[longhorn-manager-2dv6l] time="2020-05-15T02:53:23Z" level=warning msg="Starts to clean up then recreates pod for instance manager instance-manager-e-4d6d703f with state stopped"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:23Z" level=info msg="Created instance manager pod instance-manager-e-4d6d703f for instance manager instance-manager-e-4d6d703f"
......
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=debug msg="Prepare to delete instance vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="Instance vol1-e-734966aa is state error, error message: Instance Manager error"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="Instance vol1-e-734966aa crashed on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, try to get log"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-734966aa\", UID:\"a2b73e51-3819-4ee6-8c83-154ae24362c7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899399\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=error msg="failed to receive log for instance vol1-e-734966aa: rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="cannot get crash log for instance vol1-e-734966aa on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, error rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=debug msg="Prepare to delete instance vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="Instance vol1-e-734966aa is state error, error message: Instance Manager error"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="Instance vol1-e-734966aa crashed on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, try to get log"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-734966aa\", UID:\"a2b73e51-3819-4ee6-8c83-154ae24362c7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899402\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=error msg="failed to receive log for instance vol1-e-734966aa: rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="cannot get crash log for instance vol1-e-734966aa on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, error rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
......

[longhorn-manager-2dv6l] time="2020-05-15T02:54:25Z" level=debug msg="Instance handler updated instance vol1-e-734966aa state, old state error, new state stopped"
  1. The volume continuously retries auto reattachment.
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=warning msg="Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol1\", UID:\"1094bb75-5708-4105-8e74-f557c2961e80\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899223\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:35Z" level=debug msg="Requeue longhorn-system/vol1-e-734966aa due to conflict: Operation cannot be fulfilled on engines.longhorn.io \"vol1-e-734966aa\": the object has been modified; please apply your changes to the latest version and try again"
......
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=warning msg="Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:36Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol1\", UID:\"1094bb75-5708-4105-8e74-f557c2961e80\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899406\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume vol1 dead unexpectedly, reattach the volume"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=debug msg="Prepare to delete instance vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=warning msg="Instance vol1-e-734966aa is state error, error message: Instance Manager error"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=warning msg="Instance vol1-e-734966aa crashed on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, try to get log"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-734966aa\", UID:\"a2b73e51-3819-4ee6-8c83-154ae24362c7\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"9899402\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=error msg="failed to receive log for instance vol1-e-734966aa: rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
[longhorn-manager-2dv6l] time="2020-05-15T02:53:44Z" level=warning msg="cannot get crash log for instance vol1-e-734966aa on Instance Manager instance-manager-e-4d6d703f at shuo-longhorn-cluster-1-node-1, error rpc error: code = NotFound desc = cannot find process vol1-e-734966aa"
......
  1. The volume will become Ready after the automatic reattachment and detachment.

Ideally, the engine should be stopped and the volume should be Faulted after the engine manager deletion. And this shouldn't take a long time.

@shuo-wu
Copy link
Contributor

shuo-wu commented May 18, 2020

The explanations for the issue and the behaviors mentioned above:

  1. Since there is no process existing in the newly recreated engine manager, the engine manager needs to wait for 60s to refresh the instance map. This won't fail anything but it's better to improve it.

  2. The auto attachment logic intervenes the auto reattachment (engine dead recovery) workflow:

    1. The engine crashes unexpectedly and the auto reattachment logic is triggered.
    2. The auto reattachment logic starts to detach the volume first (by cleaning up v.Status.CurrentNodeID) and waits for the volume detached.
    3. The auto attachment logic will try to directly attach the volume without any checks/waits once the volume starts detaching (by setting v.Status.CurrentNodeID again). Conflict!
    4. When the new engine manager updates the process map and the engine is updated to Stopped, the new engine will be started immediately without shutting down all old replicas (since the auto attachment logic will directly attach the volume).
    5. In this case, all replicas keep state running without stopping the restore during the engine restart. After the volume is reattached, those replicas will report restore complete and the volume will be marked as Ready for workloads. But it's false-positive since the replica restore after the engine crash is invalid.
......
[longhorn-manager-mjcwx] time="2020-05-18T09:43:07Z" level=warning msg="Engine of volume vol3 dead unexpectedly, reattach the volume"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"vol3\", UID:\"943c235c-f953-4f05-adf6-8e7001f647f5\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"10947215\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume vol3 dead unexpectedly, reattach the volume"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=debug msg="Instance handler updated instance vol3-e-14ba4459 state, old state error, new state stopped"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=info msg="----------------Volume vol3 current state is updated from detaching to attaching"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=debug msg="Prepare to create instance vol3-e-14ba4459"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol3-e-14ba4459\", UID:\"9ae8a872-b776-46ea-b44a-775d5f4c33ec\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"10947431\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts vol3-e-14ba4459"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:31Z" level=debug msg="Instance process vol3-e-14ba4459 had been created, need to wait for instance manager update"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:32Z" level=debug msg="Instance process vol3-e-14ba4459 had been created, need to wait for instance manager update"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:32Z" level=debug msg="Instance handler updated instance vol3-e-14ba4459 state, old state stopped, new state starting"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Instance vol3-e-14ba4459 starts running, IP 10.42.2.234"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Instance vol3-e-14ba4459 starts running, Port 10000"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Instance handler updated instance vol3-e-14ba4459 state, old state starting, new state running"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=debug msg="Start monitoring vol3-e-14ba4459"
[longhorn-manager-mjcwx] time="2020-05-18T09:43:33Z" level=info msg="----------------Volume vol3 current state is updated from attaching to attached"
[longhorn-instance-manager] time="2020-05-18T09:41:45Z" level=info msg="Process Manager: prepare to create process vol3-r-3fe3d6c2"
[longhorn-instance-manager] time="2020-05-18T09:41:45Z" level=info msg="Process Manager: created process vol3-r-3fe3d6c2"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/vol3-92d81037, size 2147483648/512"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10000"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Listening on data server 0.0.0.0:10001"
time="2020-05-18T09:41:45Z" level=info msg="Listening on sync agent server 0.0.0.0:10002"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:45Z" level=info msg="Listening on sync 0.0.0.0:10002"
[longhorn-instance-manager] time="2020-05-18T09:41:45Z" level=info msg="Process vol3-r-3fe3d6c2 has started at localhost:10000"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:47Z" level=info msg="New connection from: 10.42.2.221:47358"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:47Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/vol3-92d81037, size 2147483648/512"
[vol3-r-3fe3d6c2] time="2020-05-18T09:41:51Z" level=info msg="Successfully initiated restore for s3://backupbucket@us-east-1/backupstore?backup=backup-f0a33111b9614af7&volume=pvc-0af64609-b5e8-4ac0-9f36-94044bc68db5 to [volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img]"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:15Z" level=info msg="Closing volume"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Restore completed successfully in Server"
time="2020-05-18T09:42:25Z" level=info msg="Reverting to snapshot volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img on 0.0.0.0:10000 at 2020-05-18T09:42:25Z"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Reverting to snapshot volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img on 0.0.0.0:10000 successful"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Done running restore s3://backupbucket@us-east-1/backupstore?backup=backup-f0a33111b9614af7&volume=pvc-0af64609-b5e8-4ac0-9f36-94044bc68db5 to volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img"
  1. The reason why the above replica restore is invalid is that the replica server will be closed once the old engine process is down. The Revert operation or data write operation for the restore doesn't make sense. (NOT SURE which part leads to the data corruption.) Finally, it leads to the false-positive in statement 2!
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:15Z" level=info msg="Closing volume"
[vol3-r-3fe3d6c2] time="2020-05-18T09:42:25Z" level=info msg="Restore completed successfully in Server"
time="2020-05-18T09:42:25Z" level=info msg="Reverting to snapshot volume-snap-00d4d3a5-32f3-47dc-92ab-64675e3c2601.img on 0.0.0.0:10000 at 2020-05-18T09:42:25Z"

Solution:

  • Fix the auto attachment bug so that it won't intervene in the reattachment logic (by adding PendingNodeID check).
  • Make sure the volume will retry the restore after the reattachment.
    Finally, the volume restore will succeed.

@shuo-wu
Copy link
Contributor

shuo-wu commented May 19, 2020

Pre-merged Checklist

  • Does the PR include the explanation for the fix or the feature?
  • Is the backend code merged?
  • Is the reproduce steps/test steps documented?
  • If labeled: area/ui Has the UI issue filed or ready to be merged?
  • If labeled: require-automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case?
  • if labeled: require-automation-engine Has the engine integration test been merged?
  • if labeled: require-doc Has the necessary document PR submitted or merged?

@sowmyav27
Copy link
Author

sowmyav27 commented May 19, 2020

Validating on master - 05/19/2020 - 16:30 PDT

Validation: PASSED

Steps:

  • Setup a backupstore on S3
  • Create volume and attach to a workload
  • Write random data to the pod volume and get the md5sum.
  • Create a backup for the volume.
  • Restore a new volume from the backup.
  • Crash the engine during the restore - delete instance manager engine on the node where volume is attached.
  • Wait for the volume detaching.
  • Wait for the volume reattached.
  • Volume should be in detached state after restore is complete.
  • Attach this restored volume to a new workload.
  • Verified data is intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) kind/bug
Projects
None yet
Development

No branches or pull requests

3 participants