New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Create backup failed: failed lock lock-*.lck type 1 acquisition #7744
Comments
happens when we try to create backup when there is a backup being deleted Wonder if this issue happened in v1.5.x regression test? @yangchiu |
Testing - Modify the test to use generated volume name instead of fixed namehttps://ci.longhorn.io/job/private/job/longhorn-tests-regression/6054/console |
Tested with the same test image, ran the test case for |
Testing - Modify the test to use generated volume name instead of fixed name - 99/100 success
But it actually successfully create the backup There is no locking issue.
|
I think I found the root cause
We shouldn't get the volume first and then wait for backup completion The reason why this issue is so rare might because we first wait for 4 snapshots cc @mantissahz Testing - get the volume after wait_for_backup_completionhttps://ci.longhorn.io/job/private/job/longhorn-tests-regression/6057/console |
Yes, it is. @ChanYiLin complete_backup_1_count = 0
restore_snapshot_name = ""
#volume = client.by_id_volume(volume_name1)
volume = wait_for_backup_completion(client, volume_name1)
for b in volume.backupStatus:
if back1+"-" in b.snapshot:
complete_backup_1_count += 1
restore_snapshot_name = b.snapshot
assert complete_backup_1_count == 1 |
@mantissahz |
I have run the test and confirmed that the root cause is because we get volume before the backup is ready
ConclusionThis issue can concludes to 2 reasons
|
For this test it is also because we deleted the backup previously and the lock hadn't been released yet |
Since no test cases run parallel, it could imply the backups cleanup after a test case completed doesn't really wait for backups being deleted. But actually there is a wait mechanism in
In |
For the pollution, I found the proof from the test supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip you provided At
In
And the type 2 lock belongs to
|
I think it is because two tests cases running too closely that the cleanup hadn't finished yet and it then unmount the nfs and then remount it back for the next test. And it seems there is a chance the lock folder and volume folder is still there somehow not yet deleted maybe got interupted
|
I found the root causes of it
That is why in the log, there are two
|
|
No wait, the obj second reconcile get could still have finalizer on it because it gets it from cache We should skip remove when the volumeDir is gone in |
Pre Ready-For-Testing Checklist
PR:
|
Verified passed on master-head (longhorn-engine 3c1301) and v1.6.x-head (longhorn-engine 5a9b9d) by running test case Test results: Leave issue closing to @chriscchien in case you're already in the middle of testing. |
This reverts commit 1d5129cd1074445964733aa74257d0acd4fc7d9e. in backupstore.
Just reverted it and let's clarify it later. cc @ChanYiLin |
NoteWe should check the folder instead of |
Revisit this after the PR merged. |
Verified passed on master-head (longhorn-engine f241896) by running test case Test results: |
Describe the bug
Run test case
test_recurring_job_restored_from_backup_target[nfs]
. It's possible to encounter backup creation failed with error message:https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-amd64/8/testReport/junit/tests/test_recurring_job/test_recurring_job_restored_from_backup_target_nfs_/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6026/console
To Reproduce
Run test case
test_recurring_job_restored_from_backup_target[nfs]
Expected behavior
Support bundle for troubleshooting
supportbundle_e6e4d284-3981-4fc0-8f00-8051c70902dc_2024-01-22T05-36-02Z.zip
Environment
Additional context
The text was updated successfully, but these errors were encountered: