Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][v1.6.1-rc3] test_backup_lock_creation_during_deletion failed with *.lck type 1 acquisition #8269

Closed
yangchiu opened this issue Mar 28, 2024 · 5 comments
Assignees
Labels
area/volume-backup-restore Volume backup restore kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug

Test case test_backup_lock_creation_during_deletion could fail with *.lck type 1 acquisition:

set_random_backupstore = None
client = <longhorn.Client object at 0x7fd221bef730>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fd221927430>
volume_name = 'longhorn-testvol-nnpfbk'
csi_pv = {'apiVersion': 'v1', 'kind': 'PersistentVolume', 'metadata': {'name': 'longhorn-testvol-nnpfbk-std'}, 'spec': {'access...eplicaTimeout': '30'}, 'volumeHandle': 'longhorn-testvol-nnpfbk-std'}, 'persistentVolumeReclaimPolicy': 'Delete', ...}}
pvc = {'apiVersion': 'v1', 'kind': 'PersistentVolumeClaim', 'metadata': {'name': 'longhorn-testvol-nnpfbk-std-pvc'}, 'spec':...ources': {'requests': {'storage': '1073741824'}}, 'storageClassName': '', 'volumeName': 'longhorn-testvol-nnpfbk-std'}}
pod_make = <function pod_make.<locals>.make_pod at 0x7fd221ab80d0>

    def test_backup_lock_creation_during_deletion(set_random_backupstore, client, core_api, volume_name, csi_pv, pvc, pod_make):  # NOQA
        """
        Test backup locks
        Context:
        To test the locking mechanism that utilizes the backupstore,
        to prevent the following case of concurrent operations.
        - prevent backup creation during backup deletion
    
        steps:
        1. Create a volume, then create the corresponding PV, PVC and Pod.
        2. Wait for the pod running and the volume healthy.
        3. Write data (DATA_SIZE_IN_MB_2) to the pod volume and get the md5sum.
        4. Take a backup.
        5. Wait for the backup to be completed.
        6. Delete the backup.
        7. Create another backup of the same volume.
        8. Wait for the delete backup to be completed.
        8. Wait for the backup to be completed.
        9. Assert there is 1 backup in the backup store.
        """
        backupstore_cleanup(client)
        std_volume_name = volume_name + "-std"
    
        std_pod_name, _, _, std_md5sum1 = \
            prepare_pod_with_data_in_mb(
                client, core_api, csi_pv, pvc, pod_make, std_volume_name,
                data_size_in_mb=DATA_SIZE_IN_MB_1)
        std_volume = client.by_id_volume(std_volume_name)
        snap1 = create_snapshot(client, std_volume_name)
        std_volume.snapshotBackup(name=snap1.name)
        wait_for_backup_completion(client, std_volume_name, snap1.name)
        _, b1 = common.find_backup(client, std_volume_name, snap1.name)
    
        write_pod_volume_random_data(core_api, std_pod_name,
                                     "/data/test2", DATA_SIZE_IN_MB_1)
    
        backup_volume = client.by_id_backupVolume(std_volume_name)
        backup_volume.backupDelete(name=b1.name)
    
        snap2 = create_snapshot(client, std_volume_name)
        std_volume.snapshotBackup(name=snap2.name)
    
        wait_for_backup_delete(client, volume_name, b1.name)
>       wait_for_backup_completion(client, std_volume_name, snap2.name)

test_basic.py:3349: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7fd221bef730>
volume_name = 'longhorn-testvol-nnpfbk-std'
snapshot_name = '99fe2876-8c6e-4629-98be-b12bb44cbce1', retry_count = 300

    def wait_for_backup_completion(client, volume_name, snapshot_name=None,
                                   retry_count=RETRY_BACKUP_COUNTS):
        completed = False
        for _ in range(retry_count):
            v = client.by_id_volume(volume_name)
            for b in v.backupStatus:
                if snapshot_name is not None and b.snapshot != snapshot_name:
                    continue
                if b.state == "Completed":
                    assert b.progress == 100
                    assert b.error == ""
                    completed = True
                    break
            if completed:
                break
            time.sleep(RETRY_BACKUP_INTERVAL)
>       assert completed is True, f" Backup status = {b.state}," \
                                  f" Backup Progress = {b.progress}, Volume = {v}"
E       AssertionError:  Backup status = Error, Backup Progress = 0, Volume = {'accessMode': 'rwo', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [{'backupURL': '', 'error': 'proxyServer=10.42.4.22:8501 destination=10.42.4.22:10010: failed to backup snapshot 99fe2876-8c6e-4629-98be-b12bb44cbce1 to backup-1c2e805248384e04: rpc error: code = Internal desc = failed to create backup: failed to create backup to nfs://longhorn-test-nfs-svc.default:/opt/backupstore for volume longhorn-testvol-nnpfbk-std: rpc error: code = Unknown desc = failed lock backupstore/volumes/03/cd/longhorn-testvol-nnpfbk-std/locks/lock-4f2e7609e43e49fb.lck type 1 acquisition', 'progress': 0, 'replica': '', 'size': '', 'snapshot': '', 'state': 'Error'}], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-03-27T15:26:57Z', 'message': '', 'reason': '', 'status': 'False'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-03-27T15:26:57Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-03-27T15:26:57Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-03-27T15:26:57Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '260808704', 'address': '10.42.4.22', 'currentImage': 'longhornio/longhorn-engine:v1.6.1-rc3', 'endpoint': '/dev/longhorn/longhorn-testvol-nnpfbk-std', 'hostId': 'ip-10-0-2-24', 'image': 'longhornio/longhorn-engine:v1.6.1-rc3', 'instanceManagerName': 'instance-manager-18a31ac82119492eba7439c86b19ca8a', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'longhorn-testvol-nnpfbk-std-e-0', 'requestedBackupRestore': '', 'running': True, 'size': '1073741824', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-03-27 15:26:56 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:v1.6.1-rc3', 'dataEngine': 'v1', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'fromBackup': '', 'frontend': 'blockdev', 'image': 'longhornio/longhorn-engine:v1.6.1-rc3', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': 'default', 'pvName': 'longhorn-testvol-nnpfbk-std', 'pvStatus': 'Bound', 'pvcName': 'longhorn-testvol-nnpfbk-std-pvc', 'workloadsStatus': [{'podName': 'longhorn-testvol-nnpfbk-std-pod', 'podStatus': 'Running', 'workloadName': '', 'workloadType': ''}]}, 'lastAttachedBy': '', 'lastBackup': '', 'lastBackupAt': '', 'migratable': False, 'name': 'longhorn-testvol-nnpfbk-std', 'nodeSelector': [], 'numberOfReplicas': 3, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': [{'error': '', 'isPurging': False, 'progress': 0, 'replica': 'longhorn-testvol-nnpfbk-std-r-4ccb5cd9', 'state': ''}, {'error': '', 'isPurging': False, 'progress': 0, 'replica': 'longhorn-testvol-nnpfbk-std-r-029af8cd', 'state': ''}, {'error': '', 'isPurging': False, 'progress': 0, 'replica': 'longhorn-testvol-nnpfbk-std-r-561f5575', 'state': ''}], 'ready': True, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaDiskSoftAntiAffinity': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '10.42.3.23', 'currentImage': 'longhornio/longhorn-engine:v1.6.1-rc3', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-nnpfbk-std-3da1212e', 'diskID': '0a02e077-460a-4a09-995f-2c1a1320489c', 'diskPath': '/var/lib/longhorn/', 'failedAt': '', 'hostId': 'ip-10-0-2-180', 'image': 'longhornio/longhorn-engine:v1.6.1-rc3', 'instanceManagerName': 'instance-manager-6327107b662e21950646bdb0158b4d0f', 'mode': 'RW', 'name': 'longhorn-testvol-nnpfbk-std-r-029af8cd', 'running': True}, {'address': '10.42.1.25', 'currentImage': 'longhornio/longhorn-engine:v1.6.1-rc3', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-nnpfbk-std-133befd7', 'diskID': '5e0fe913-5ace-4884-8d19-ee75d1e8eabb', 'diskPath': '/var/lib/longhorn/', 'failedAt': '', 'hostId': 'ip-10-0-2-121', 'image': 'longhornio/longhorn-engine:v1.6.1-rc3', 'instanceManagerName': 'instance-manager-647c81664f47f7f8ecdf091dc5f0c151', 'mode': 'RW', 'name': 'longhorn-testvol-nnpfbk-std-r-4ccb5cd9', 'running': True}, {'address': '10.42.4.22', 'currentImage': 'longhornio/longhorn-engine:v1.6.1-rc3', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-nnpfbk-std-fe3e58c8', 'diskID': '09dc1e02-6dcd-4422-bcf3-058b2bcd0dc8', 'diskPath': '/var/lib/longhorn/', 'failedAt': '', 'hostId': 'ip-10-0-2-24', 'image': 'longhornio/longhorn-engine:v1.6.1-rc3', 'instanceManagerName': 'instance-manager-18a31ac82119492eba7439c86b19ca8a', 'mode': 'RW', 'name': 'longhorn-testvol-nnpfbk-std-r-561f5575', 'running': True}], 'restoreInitiated': False, 'restoreRequired': False, 'restoreStatus': [{'backupURL': '', 'error': '', 'filename': '', 'isRestoring': False, 'lastRestored': '', 'progress': 0, 'replica': 'longhorn-testvol-nnpfbk-std-r-4ccb5cd9', 'state': ''}, {'backupURL': '', 'error': '', 'filename': '', 'isRestoring': False, 'lastRestored': '', 'progress': 0, 'replica': 'longhorn-testvol-nnpfbk-std-r-029af8cd', 'state': ''}, {'backupURL': '', 'error': '', 'filename': '', 'isRestoring': False, 'lastRestored': '', 'progress': 0, 'replica': 'longhorn-testvol-nnpfbk-std-r-561f5575', 'state': ''}], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'healthy', 'shareEndpoint': '', 'shareState': '', 'size': '1073741824', 'snapshotDataIntegrity': 'ignored', 'snapshotMaxCount': 250, 'snapshotMaxSize': '0', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'attached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'csi-5a636ff3c2e70793f4b544a2b8e93a0413562c0d40c6282956bf94b3a4982d24': {'attachmentID': 'csi-5a636ff3c2e70793f4b544a2b8e93a0413562c0d40c6282956bf94b3a4982d24', 'attachmentType': 'csi-attacher', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-03-27T15:27:05Z', 'message': '', 'reason': '', 'status': 'True'}], 'nodeID': 'ip-10-0-2-24', 'parameters': {'disableFrontend': 'false', 'lastAttachedBy': ''}, 'satisfied': True}}, 'volume': 'longhorn-testvol-nnpfbk-std'}}

common.py:3154: AssertionError

https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-amd64/81/testReport/junit/tests/test_basic/test_backup_lock_creation_during_deletion_nfs_/

It could be related to Delete the Backup and Create the Backup with very small time gap, this might happen.
It means to prevent the situation of concurrent deletion and creation.

To Reproduce

Run test case test_backup_lock_creation_during_deletion repeatedly.

Expected behavior

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.6.1-rc3
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: sles 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug reproduce/rare < 50% reproducible priority/1 Highly recommended to fix in this release (managed by PO) area/volume-backup-restore Volume backup restore severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Mar 28, 2024
@yangchiu yangchiu added this to the v1.6.1 milestone Mar 28, 2024
@ChanYiLin
Copy link
Contributor

When there is a backup being deleted, there will be a deletion lock and the creation will fail at the moment
So the failure of the test is expected

I think the problem is that wait_for_backup_delete should wait until the backup is deleted and the lock is released.
Or the test case doesn't make sense.

Investigating...

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Mar 28, 2024

longhorn/longhorn-tests#1837

Analysis

Background knowledege

  1. Longhorn needs to create snapshot first before creating Backup.
  2. Backup needs a Snapshot Name as Spec to trigger the backup process.
  3. Creating Backup during the previous Backup deletion will fail and the Backup.Status will be "Error" because of the deletion lock.
  4. Snapshot and Backup creation are async by creating the CR to trigger the process, then update the progress/status to the CRD status.

Analysis

  • This test has been changed and failed to follow the original test purpose: previous PR
    • There is also a bug in previous change that we should wait until previous Backup deletion to complete to create the next Backup.
    • Or it will fail to backup the second one and fail the test. (but it is expected, because of our lock mechanism)

Fix

Should change the code logic to:

  1. Create the first Snapshot1 and Backup1
  2. Create the second Snapshot2
  3. Delete the first Backup1
  4. Create the second Backup2
  5. Backup2 should be failed
  6. Wait for Backup1 to be deleted
  7. Create third Snapshot3 and Backup3
  8. Backup3 should be completed.

cc @innobead @derekbit @james-munson @mantissahz @roger-ryao @yangchiu

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 28, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
    • Run test: test_backup_lock_creation_during_deletion multiple times.
    • Test in v1.5.x, v1.6.x, master-head

No Longhorn code change
But Test changed: longhorn/longhorn-tests#1837

@innobead
Copy link
Member

@yangchiu reopen this if this is still an issue.

@roger-ryao
Copy link

Fix

Should change the code logic to:

  1. Create the first Snapshot1 and Backup1
  2. Create the second Snapshot2
  3. Delete the first Backup1
  4. Create the second Backup2
  5. Backup2 should be failed
  6. Wait for Backup1 to be deleted
  7. Create third Snapshot3 and Backup3
  8. Backup3 should be completed.

cc @innobead @derekbit @james-munson @mantissahz @roger-ryao @yangchiu

We observed a failure at step 5 on v1.5.5-rc2 regression. When Backup1 is deleted, Backup2 is successfully created. Currently, we don't have a good method to confirm whether Backup1 is in the process of being deleted. After discussing with @ChanYiLin , I'll leave the comment for the record.

cc @longhorn/qa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore kind/bug priority/1 Highly recommended to fix in this release (managed by PO) reproduce/rare < 50% reproducible severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

6 participants