[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

yangchiu · 2024-02-06T06:24:59Z

Describe the bug

Create a volume and create a backup recurring job with cron * * * * * to create a backup every minute and retain = 1 to only keep one backup.
Keep writing data to the volume.
Reboot a node of a replica. After the node and the volume are back to work. The recurring job could still work abnormally. Lots of Error backup remain to the backup list:

Even though a very long time after the reboot event, newly created backup could still randomly run into Error state with error message:

Error:                      proxyServer=10.42.3.10:8501 destination=10.42.1.72:10000: 
failed to backup snapshot vol-yde7-971b6006-6f18-4d12-85dc-c337a72d5220 to backup-5b793b248f96439d: 
rpc error: code = Unknown desc = failed to create backup to s3://backupbucket@us-east-1/backupstore 
for volume vol-yde769: rpc error: code = Unknown 
desc = failed lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck type 1 acquisition

To Reproduce

Checkout PR longhorn/longhorn-tests#1539 and run test case Reboot Replica Node While Heavy Writing And Recurring Jobs Exist

Expected behavior

Support bundle for troubleshooting

supportbundle_359c9225-c5aa-42d3-9af0-6a301e68c081_2024-02-06T05-54-16Z.zip

Environment

Longhorn version: v1.5.4-rc1
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: sles 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

The text was updated successfully, but these errors were encountered:

innobead · 2024-02-06T06:32:29Z

I assume this also happened in 1.6.x. @yangchiu you missed the severity label.

innobead · 2024-02-06T06:33:08Z

@james-munson Please help with this. Thanks.

yangchiu · 2024-02-06T10:42:33Z

Could be difficult to reproduce. The test case could fail due to different reasons instead of what described in #7854 (comment).

It seems hard to predict when will a backup creation can be triggered by a recurring job after a node reboot.

@james-munson Please don't try to reproduce it by running the test case since it could be diverse. The test case needed to be refined first. Please look into the support bundle to see if there is any clue, if not, I suggest not to dig into this and wait until the PR is refined and merged. Sorry for the inconvenience.

james-munson · 2024-02-06T18:19:13Z

The error comes from backupstore https://github.com/longhorn/backupstore/blob/9c90a8728ca2e1f95f10a4c6cf67cf7ac20c23e4/lock.go#L104-111

In logs/longhorn-system/instance-manager-a98079d3080d186b5dd5471b92655380/instance-manager.log

2024-02-06T05:53:24.019426177Z time="2024-02-06T05:53:24Z" level=info msg="Trying to acquire lock backupstore/volumes/e2/eb/vol-yde769/locks/lock-5242291888d04df6.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:65" lock="{ volume: vol-yde769, name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC }" pkg=backupstore
2024-02-06T05:53:24.019431667Z time="2024-02-06T05:53:24Z" level=info msg="backupstore volume vol-yde769 contains locks [{ volume: , name: lock-5242291888d04df6, type: 1, acquired: false, serverTime: 2024-02-06 05:53:21 +0000 UTC } { volume: , name: lock-e519ec1bc88a4080, type: 2, acquired: true, serverTime: 2024-02-06 05:53:17 +0000 UTC }]" func="backupstore.(*FileLock).canAcquire" file="lock.go:66" pkg=backupstore

So the BACKUP_LOCK (LockType 1) lock-5242291888d04df6 is rejected by the already-acquired and unexpired DELETION_LOCK (LockType 2) lock-e519ec1bc88a4080 from 7 seconds before. What's up with that one?

james-munson · 2024-02-06T19:15:01Z

There are 16 such collisions between a DELETION_LOCK and a BACKUP_LOCK for the two volumes over the previous 90 minutes, each leading to a residual failed backup CR. So either the deletion is taking surprisingly long, or the recurrence schedule of once a minute is pushing too hard.

Is the node restart having an impact on this, or does it happen even without that? The last node restart was at 2024-02-06T04:34:17, and its rebuild finished shortly after:

grep healthyAt replicas.yaml 
      healthyAt: "2024-02-06T04:16:50Z"
      healthyAt: "2024-02-06T04:16:50Z"
      healthyAt: "2024-02-06T04:16:41Z"
      healthyAt: "2024-02-06T04:35:16Z"

and it's hard to see how it would still be having ripple effects an hour and a half later.

The times of the failures are well spread out

grep -h "failed lock" logs/longhorn-system/instance-manager*/* | grep "Failed to create backup" | sort | awk '{ print $1 }'
2024-02-06T04:29:16.968897204Z
2024-02-06T04:38:26.856587527Z
2024-02-06T04:47:20.996068506Z
2024-02-06T04:49:37.315667981Z
2024-02-06T04:58:05.069025238Z
2024-02-06T05:07:29.771652195Z
2024-02-06T05:18:18.749408919Z
2024-02-06T05:18:22.223025461Z
2024-02-06T05:26:30.171523462Z
2024-02-06T05:28:18.806953111Z
2024-02-06T05:34:57.077060093Z
2024-02-06T05:36:23.822837034Z
2024-02-06T05:43:15.192112147Z
2024-02-06T05:45:02.002675977Z
2024-02-06T05:53:19.929305574Z
2024-02-06T05:53:24.058831215Z

and if there were backups every minute over the 90 minutes, most of them by far were successful.

yangchiu · 2024-02-07T02:21:18Z

I think it's fair that lock contention causes backup creation and deletion failed. If it makes sense to you as well, feel free to close this issue! cc @innobead

james-munson · 2024-02-07T03:42:42Z

Ok. I think I'll compare it to the issues seen in #7744 first.

innobead · 2024-02-07T07:53:56Z

Removing backport labels first, as this could be related to #7744.

james-munson · 2024-02-08T20:43:26Z

This is not the same as #7744. The symptom of lock collision is the same, but that one was caused both by a race in the test script itself, and another race in the cleanup of the backupstore when the setting itself was cleared.

This one is just a simple race between a new backup and the retention policy-based purge from the last backup such a short time before. In no case was the existing DELETION_LOCK older than 15 seconds at the time of collision.

I think this kind of collision was anticipated, looking at https://github.com/longhorn/backupstore/blob/b375644f1c5759cf0e8e91263ccf7dda09516004/lock.go#L104

Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.

yangchiu · 2024-02-16T03:05:34Z

Bottom line is, I think this is behaving as designed. Pushing a backup and a purge once a minute may be a little aggressive. I will close this issue, although maybe the test can be slightly modified to recur less often.

The test case will be refined in longhorn/longhorn-tests#1539

yangchiu added kind/bug reproduce/often 80 - 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) area/volume-backup-restore Volume backup restore area/recurring-job Longhorn recurring job related labels Feb 6, 2024

yangchiu added this to the v1.5.4 milestone Feb 6, 2024

innobead added the severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) label Feb 6, 2024

innobead assigned james-munson Feb 6, 2024

innobead modified the milestones: v1.5.4, v1.7.0 Feb 6, 2024

innobead added backport/1.4.5 backport/1.5.4 backport/1.6.1 labels Feb 6, 2024

yangchiu added reproduce/rare < 50% reproducible and removed reproduce/often 80 - 50% reproducible labels Feb 6, 2024

innobead removed backport/1.4.5 backport/1.5.4 backport/1.6.1 labels Feb 7, 2024

innobead removed the priority/1 Highly recommended to fix in this release (managed by PO) label Feb 7, 2024

innobead added the priority/2 Nice to fix in this release (managed by PO) label Feb 7, 2024

innobead assigned ChanYiLin Feb 7, 2024

james-munson closed this as completed Feb 8, 2024

yangchiu mentioned this issue Feb 19, 2024

test(negative): implement backupstore setup longhorn/longhorn-tests#1539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

yangchiu commented Feb 6, 2024

innobead commented Feb 6, 2024 •

edited

innobead commented Feb 6, 2024

yangchiu commented Feb 6, 2024 •

edited

james-munson commented Feb 6, 2024 •

edited

james-munson commented Feb 6, 2024 •

edited

yangchiu commented Feb 7, 2024

james-munson commented Feb 7, 2024

innobead commented Feb 7, 2024

james-munson commented Feb 8, 2024 •

edited

yangchiu commented Feb 16, 2024

[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

[BUG][v1.5.4-rc1] Recurring job failed to create/delete backups after node reboot #7854

Comments

yangchiu commented Feb 6, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

innobead commented Feb 6, 2024 • edited

innobead commented Feb 6, 2024

yangchiu commented Feb 6, 2024 • edited

james-munson commented Feb 6, 2024 • edited

james-munson commented Feb 6, 2024 • edited

yangchiu commented Feb 7, 2024

james-munson commented Feb 7, 2024

innobead commented Feb 7, 2024

james-munson commented Feb 8, 2024 • edited

yangchiu commented Feb 16, 2024

innobead commented Feb 6, 2024 •

edited

yangchiu commented Feb 6, 2024 •

edited

james-munson commented Feb 6, 2024 •

edited

james-munson commented Feb 6, 2024 •

edited

james-munson commented Feb 8, 2024 •

edited