Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backing image is not deleted and recreated correctly #4256

Closed
chriscchien opened this issue Jul 20, 2022 · 7 comments
Closed

[BUG] Backing image is not deleted and recreated correctly #4256

chriscchien opened this issue Jul 20, 2022 · 7 comments
Assignees
Labels
area/backing-image Backing image related backport/1.3.1 kind/bug kind/regression Regression which has worked before require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@chriscchien
Copy link
Contributor

chriscchien commented Jul 20, 2022

Describe the bug

From e2e

In fresh install environment, run test_exporting_backing_image_from_volume twice, first 1 will pass, but second run will always fail and get Can't find backing image, may have been deleted" backingImage=bi-test1 controller=longhorn-backing-image node=worker2 in longhorn-manager log

And the backing image process will stuck as below
Screenshot from 2022-07-20 16-25-10

Backing image name were fixed in code. if modify the name to new one, test case will pass.

To Reproduce

Steps to reproduce the behavior:

  1. Fresh install longhorn
  2. Install longhorn-test
  3. Run test_exporting_backing_image_from_volume twice
  4. First run will pass, second run will fail

Expected behavior

Should have consistent result

Log or Support bundle

2022-07-20T14:54:30.316356386+08:00 time="2022-07-20T06:54:30Z" level=warning msg="Cannot find backing image bi-test2 during invalid backing image cleanup, will skip it" backingImageManager=backing-image-manager-2a7c-3425 controller=longhorn-backing-image-manager diskUUID=34257b21-38f8-4188-aff7-112acf8e2e20 node=worker2 nodeID=worker2
2022-07-20T14:54:30.321643969+08:00 time="2022-07-20T06:54:30Z" level=debug msg="Can't find backing image, may have been deleted" backingImage=bi-test1 controller=longhorn-backing-image node=worker2
2022-07-20T14:54:30.321660494+08:00 time="2022-07-20T06:54:30Z" level=debug msg="Can't find backing image, may have been deleted" backingImage=bi-test2 controller=longhorn-backing-image node=worker2
2022-07-20T14:54:30.321662637+08:00 time="2022-07-20T06:54:30Z" level=debug msg="Can't find backing image, may have been deleted" backingImage=bi-test222 controller=longhorn-backing-image node=worker2
2022-07-20T14:54:35.329572476+08:00 time="2022-07-20T06:54:35Z" level=warning msg="Cannot find backing image bi-test2 during invalid backing image cleanup, will skip it" backingImageManager=backing-image-manager-2a7c-3425 controller=longhorn-backing-image-manager diskUUID=34257b21-38f8-4188-aff7-112acf8e2e20 node=worker2 nodeID=worker2

longhorn-support-bundle_8f0e407a-3c81-4daa-9f55-d4db8aa23fac_2022-07-20T06-55-14Z.zip

Environment

  • Longhorn version: longhorn-manager e61fc4
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
  • Node config
    • OS type and version: Ubuntu 20.04
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMware

Additional context

#4248

@chriscchien chriscchien added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) kind/regression Regression which has worked before labels Jul 20, 2022
@innobead
Copy link
Member

@cchien816 Is this a feature bug or a test case bug like that the teardown of the test case needs to modify?

@chriscchien
Copy link
Contributor Author

chriscchien commented Jul 20, 2022

@cchien816 Is this a feature bug or a test case bug like that the teardown of the test case needs to modify?

I think its' feature bug because error message Can't find backing image, may have been deleted" backingImage=bi-test1 controller=longhorn-backing-image node=worker2 showed in longhorn-manager log and affected test case

Also backing images created from URL did not hit this issue, thank you

@innobead
Copy link
Member

@shuo-wu Please check if this is a real issue.

@innobead innobead added this to the v1.4.0 milestone Jul 20, 2022
@chriscchien chriscchien changed the title [BUG] Backing image created from volume not clean up correctly [BUG] Backing image created from volume will stuck at file process in e2e Jul 20, 2022
@shuo-wu
Copy link
Contributor

shuo-wu commented Jul 26, 2022

Root cause:
BackingImage deletion won't wait for the file deletion handled by BackingImageManagers. Once the BackingImage CR is gone, the file and the record would be leftovers in BackingImageManagers. Later on, if users/applications create new BackingImage CRs with the previous name, BackingImageManagers cannot handle the leftover correctly (find the mismatching then do cleanup before launching new BackingImages) then the new BackingImages get stuck.

A simpler reproducing step:

  1. Create a random backing image
  2. Create and attach a multi-replica volume using the backing image.
  3. Wait for the attachment complete then delete the volume as well as the backing image
  4. Repeat step1 ~ 3 until the backing image or the attachment gets stuck

Workaround:

  1. Deleting the BackingImageManager pods with the leftover records.
  2. Wait for the pod restart by Longhorn.
  3. Enter into the new pods, check and remove the leftover files.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jul 26, 2022

Pre Ready-For-Testing Checklist

@innobead innobead added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Jul 27, 2022
@shuo-wu
Copy link
Contributor

shuo-wu commented Jul 27, 2022

There is no need to backport this to v1.2.5 since the implementation changes.

@chriscchien
Copy link
Contributor Author

chriscchien commented Jul 27, 2022

Verified on longhorn-manager master-head 16cfd1
Result Pass

Steps

  1. Follow reproduce steps, this error not happened
  2. Run e2e test_exporting_backing_image_from_volume twice, the error not happened as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backing-image Backing image related backport/1.3.1 kind/bug kind/regression Regression which has worked before require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

4 participants