Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

Closed
yangchiu opened this issue Feb 6, 2024 · 10 comments
Assignees
Labels
area/recurring-job Longhorn recurring job related area/volume-backup-restore Volume backup restore kind/bug priority/2 Nice to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Feb 6, 2024

Describe the bug

  1. Create a volume and create a backup recurring job with cron * * * * * to create a backup every minute and retain = 1 to only keep one backup.
  2. Keep writing data to the volume.
  3. Reboot a node of a replica. After the node and the volume are back to work. The recurring job could still work abnormally. Lots of Error backup remain to the backup list:
    backup-error
    Even though a very long time after the reboot event, newly created backup could still randomly run into Error state with error message:
Error:                      proxyServer=10.42.3.10:8501 destination=10.42.1.72:10000: 
failed to backup snapshot vol-yde7-971b6006-6f18-4d12-85dc-c337a72d5220 to backup-5b793b248f96439d: 
rpc error: code = Unknown desc = failed to create backup to s3://backupbucket@us-east-1/backupstore 
for volume vol-yde769: rpc error: code = Unknown 
desc = failed lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck type 1 acquisition

To Reproduce

Checkout PR longhorn/longhorn-tests#1539 and run test case Reboot Replica Node While Heavy Writing And Recurring Jobs Exist

Expected behavior

Support bundle for troubleshooting

supportbundle_359c9225-c5aa-42d3-9af0-6a301e68c081_2024-02-06T05-54-16Z.zip

Environment

  • Longhorn version: v1.5.4-rc1
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) area/volume-backup-restore Volume backup restore area/recurring-job Longhorn recurring job related labels Feb 6, 2024
@yangchiu yangchiu added this to the v1.5.4 milestone Feb 6, 2024
@innobead innobead added the severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) label Feb 6, 2024
@innobead
Copy link
Member

innobead commented Feb 6, 2024

I assume this also happened in 1.6.x. @yangchiu you missed the severity label.

@innobead
Copy link
Member

innobead commented Feb 6, 2024

@james-munson Please help with this. Thanks.

@yangchiu
Copy link
Member Author

yangchiu commented Feb 6, 2024

Could be difficult to reproduce. The test case could fail due to different reasons instead of what described in #7854 (comment).

It seems hard to predict when will a backup creation can be triggered by a recurring job after a node reboot.

@james-munson Please don't try to reproduce it by running the test case since it could be diverse. The test case needed to be refined first. Please look into the support bundle to see if there is any clue, if not, I suggest not to dig into this and wait until the PR is refined and merged. Sorry for the inconvenience.

@james-munson
Copy link
Contributor

james-munson commented Feb 6, 2024

The error comes from backupstore https://github.com/longhorn/backupstore/blob/9c90a8728ca2e1f95f10a4c6cf67cf7ac20c23e4/lock.go#L104-111

In logs/longhorn-system/instance-manager-a98079d3080d186b5dd5471b92655380/instance-manager.log

2024-02-06T05:53:24.019426177Z time="2024-02-06T05:53:24Z" level=info msg="Trying to acquire lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:65" lock="{ volume: vol-yde769, name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC }" pkg=backupstore
2024-02-06T05:53:24.019431667Z time="2024-02-06T05:53:24Z" level=info msg="backupstore volume vol-yde769 contains locks [{ volume: , name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC } { volume: , name: lock-e519ec1bc88a4080, type: 2, acquired: true, serverTime: 2024-02-06 05:53:17 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore

So the BACKUP_LOCK (LockType 1) lock-5242291888d04df6 is rejected by the already-acquired and unexpired DELETION_LOCK (LockType 2) lock-e519ec1bc88a4080 from 7 seconds before. What's up with that one?

@james-munson
Copy link
Contributor

james-munson commented Feb 6, 2024

There are 16 such collisions between a DELETION_LOCK and a BACKUP_LOCK for the two volumes over the previous 90 minutes, each leading to a residual failed backup CR. So either the deletion is taking surprisingly long, or the recurrence schedule of once a minute is pushing too hard.

Is the node restart having an impact on this, or does it happen even without that? The last node restart was at 2024-02-06T04:34:17, and its rebuild finished shortly after:

grep healthyAt replicas.yaml 
      healthyAt: "2024-02-06T04:16:50Z"
      healthyAt: "2024-02-06T04:16:50Z"
      healthyAt: "2024-02-06T04:16:41Z"
      healthyAt: "2024-02-06T04:35:16Z"

and it's hard to see how it would still be having ripple effects an hour and a half later.

The times of the failures are well spread out

grep -h "failed lock" logs/longhorn-system/instance-manager*/* | grep "Failed to create backup" | sort | awk '{ print $1 }'
2024-02-06T04:29:16.968897204Z
2024-02-06T04:38:26.856587527Z
2024-02-06T04:47:20.996068506Z
2024-02-06T04:49:37.315667981Z
2024-02-06T04:58:05.069025238Z
2024-02-06T05:07:29.771652195Z
2024-02-06T05:18:18.749408919Z
2024-02-06T05:18:22.223025461Z
2024-02-06T05:26:30.171523462Z
2024-02-06T05:28:18.806953111Z
2024-02-06T05:34:57.077060093Z
2024-02-06T05:36:23.822837034Z
2024-02-06T05:43:15.192112147Z
2024-02-06T05:45:02.002675977Z
2024-02-06T05:53:19.929305574Z
2024-02-06T05:53:24.058831215Z

and if there were backups every minute over the 90 minutes, most of them by far were successful.

@yangchiu
Copy link
Member Author

yangchiu commented Feb 7, 2024

I think it's fair that lock contention causes backup creation and deletion failed. If it makes sense to you as well, feel free to close this issue! cc @innobead

@james-munson
Copy link
Contributor

Ok. I think I'll compare it to the issues seen in #7744 first.

@innobead
Copy link
Member

innobead commented Feb 7, 2024

Removing backport labels first, as this could be related to #7744.

@innobead innobead removed the priority/1 Highly recommended to fix in this release (managed by PO) label Feb 7, 2024
@innobead innobead added the priority/2 Nice to fix in this release (managed by PO) label Feb 7, 2024
@james-munson
Copy link
Contributor

james-munson commented Feb 8, 2024

This is not the same as #7744. The symptom of lock collision is the same, but that one was caused both by a race in the test script itself, and another race in the cleanup of the backupstore when the setting itself was cleared.

This one is just a simple race between a new backup and the retention policy-based purge from the last backup such a short time before. In no case was the existing DELETION_LOCK older than 15 seconds at the time of collision.

I think this kind of collision was anticipated, looking at https://github.com/longhorn/backupstore/blob/b375644f1c5759cf0e8e91263ccf7dda09516004/lock.go#L104

Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.

@yangchiu
Copy link
Member Author

Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.

The test case will be refined in longhorn/longhorn-tests#1539

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/recurring-job Longhorn recurring job related area/volume-backup-restore Volume backup restore kind/bug priority/2 Nice to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
None yet
Development

No branches or pull requests

4 participants