Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] System restore with backing image could fail due to backing image checksum mismatch #9041

Closed
yangchiu opened this issue Jul 19, 2024 · 8 comments
Assignees
Labels
area/system-backup-restore Longhorn system backup restore kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/test-case-revision Require updating the test case severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug

Recently, test case test_system_backup_and_restore_volume_with_backingimage has been failing on both v1.7.x-head and master-head from time to time:

https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/973/testReport/junit/tests/test_system_backup_restore/test_system_backup_and_restore_volume_with_backingimage_nfs_/

https://ci.longhorn.io/job/public/job/v1.7.x/job/v1.7.x-longhorn-tests-sles-amd64/4/testReport/junit/tests/test_system_backup_restore/test_system_backup_and_restore_volume_with_backingimage_s3_/

https://ci.longhorn.io/job/public/job/v1.7.x/job/v1.7.x-longhorn-upgrade-tests-sles-amd64/5/testReport/junit/tests/test_system_backup_restore/test_system_backup_and_restore_volume_with_backingimage_s3_/

It could remain stuck in system restoration indefinitely:

# kubectl get systemrestores.longhorn.io -n longhorn-system
NAME                         STATE       AGE
test-system-restore-8kwef1   Restoring   8h

Probably because the restored backing image checksum mismatch. The Current SHA512 Checksum is bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b, but the Expected SHA512 Checksum is 304f3ed30ca6878e9056ee6f1b02b328239f0d0c2c1272840998212f9734b196371560b3b939037e4f4c2884ce457c2cbc9f0621f4f5d1ca983983c8cdf8cd9a:

system_restore

I've also checked the backing image backup in the backup store, and the checksum is not correct as well:

# cd storage/backupbucket/backupstore/backupstore/backing-images/backing-images/bi-test                   
sh-4.4# cat backing-image.cfg 
{"Name":"bi-test","Size":"1161728","BlockCount":"1","Checksum":"bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b","Labels":null,"CompressionMethod":"lz4","CreatedTime":"2024-07-18T02:26:17Z","CompleteTime":"2024-07-18T02:26:19Z","ProcessingBlocks":{"Blocks":{}},"Blocks":[{"Offset":0,"BlockChecksum":"bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ff"}]}

To Reproduce

Run test case test_system_backup_and_restore_volume_with_backingimage repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_3802211e-5fd7-4e15-8a9b-e95b927fbf11_2024-07-19T01-42-43Z.zip

Environment

  • Longhorn version: v1.7.x-head or master-head
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.30.0+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp6
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/rare < 50% reproducible priority/0 Must be implement or fixed in this release (managed by PO) area/system-backup-restore Longhorn system backup restore require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Jul 19, 2024
@yangchiu yangchiu added this to the v1.7.0 milestone Jul 19, 2024
@derekbit
Copy link
Member

cc @ChanYiLin

@ChanYiLin
Copy link
Contributor

it is quite weird, the testing image should be parrot.raw and when doing backup the correct config should be like following
especially the "BlockCount" should be "6"

{"Name":"parrot","Size":"33554432","BlockCount":"6","Checksum":"304f3ed30ca6878e9056ee6f1b02b328239f0d0c2c1272840998212f9734b196371560b3b939037e4f4c2884ce457c2cbc9f0621f4f5d1ca983983c8cdf8cd9a","Labels":null,"CompressionMethod":"lz4","CreatedTime":"2024-07-19T04:12:51Z","CompleteTime":"2024-07-19T04:12:53Z","ProcessingBlocks":{"Blocks":{}},"Blocks":[{"Offset":0,"BlockChecksum":"03060d6b6c4c19737a263979140cb84f7f5f7e53e5333b93e4154f7b62364ed5"},{"Offset":8388608,"BlockChecksum":"c8684f4bb7725397b97a159aea819808ff305f8f3913b9ac59362d991f5705e4"},{"Offset":14680064,"BlockChecksum":"731859029215873fdac1c9f2f8bd25a334abf0f3a9e1b057cf2cacc2826d86b0"},{"Offset":16777216,"BlockChecksum":"46c343666e37a8c6f6a49840a4aecfe5fb29b72fc3dc3013ab351685084f01c3"},{"Offset":23068672,"BlockChecksum":"731859029215873fdac1c9f2f8bd25a334abf0f3a9e1b057cf2cacc2826d86b0"},{"Offset":25165824,"BlockChecksum":"e4f3a9580b7719cc0f4c77185183ffe4cdfd2917f09f1cc998d49d6b82b72d6c"}]}

It seems the test use wrong backing image in the backup store
but why is there another backing image in the backup store, some tests doesn't clean it up?

@derekbit
Copy link
Member

It seems the test use wrong backing image in the backup store but why is there another backing image in the backup store, some tests doesn't clean it up?

Can we make sure all other backing images are thoroughly cleaned up before executing the test?

@ChanYiLin
Copy link
Contributor

According to my discussion with @roger-ryao before about this issue

Jack: did it fail in single test ever?
Roger: No, I didn't observe it failing in the single test or execute all test case in test_system_backup_restore.py.
Jack: And in full regression it just fails sometimes
Roger: YEp, When it passes, the system restore completes within 50 seconds.

@ChanYiLin
Copy link
Contributor

It seems the test use wrong backing image in the backup store but why is there another backing image in the backup store, some tests doesn't clean it up?

Can we make sure all other backing images are thoroughly cleaned up before executing the test?

Yes, we can clean up all the backup backing image before running the tests

@ChanYiLin
Copy link
Contributor

Oh I see
bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b is the checksum of parrot.qcow2 image
So maybe the previous test didn't cleanup the backup backing image resource
and since the name was always bi-test so it considered it was already backed up

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jul 19, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
    • full regression test can PASSED

PRs:

@roger-ryao
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system-backup-restore Longhorn system backup restore kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/test-case-revision Require updating the test case severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Closed
Development

No branches or pull requests

5 participants