[BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted #7997

yangchiu · 2024-02-22T00:51:42Z

Describe the bug

Test case test_dr_volume_with_all_backup_blocks_deleted failed on v1.5.x-head:

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6353/

set_random_backupstore = None
client = <longhorn.Client object at 0x7fc41ac46f70>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fc41aa9c910>
volume_name = 'longhorn-testvol-zkd5jc'

    def test_dr_volume_with_all_backup_blocks_deleted(set_random_backupstore, client, core_api, volume_name):  # NOQA
        """
        Test DR volume can be activate after delete all backups.
    
        Context:
    
        We want to make sure that DR volume can activate after delete all backups.
    
        Steps:
    
        1.  Create a volume and attach to the current node.
        2.  Write 4 MB to the beginning of the volume (2 x 2MB backup blocks).
        3.  Create backup(0) of the volume.
        6.  Verify backup block count == 2.
        7.  Create DR volume from backup(0).
        8.  Verify DR volume last backup is backup(0).
        9.  Delete backup(0).
        10. Verify backup block count == 0.
        11. Verify DR volume last backup is empty.
        15. Activate and verify DR volume data is data(0).
        """
        backupstore_cleanup(client)
    
        host_id = get_self_host_id()
    
        vol = create_and_check_volume(client, volume_name, 2, SIZE)
        vol.attach(hostId=host_id)
        vol = common.wait_for_volume_healthy(client, volume_name)
    
        data0 = {'pos': 0, 'len': 2 * BACKUP_BLOCK_SIZE,
                 'content': common.generate_random_data(2 * BACKUP_BLOCK_SIZE)}
        _, backup0, _, data0 = create_backup(
            client, volume_name, data0)
    
        backup_blocks_count = backupstore_count_backup_block_files(client,
                                                                   core_api,
                                                                   volume_name)
        assert backup_blocks_count == 2
    
        dr_vol_name = "dr-" + volume_name
        client.create_volume(name=dr_vol_name, size=SIZE,
                             numberOfReplicas=2, fromBackup=backup0.url,
                             frontend="", standby=True)
        check_volume_last_backup(client, dr_vol_name, backup0.name)
        wait_for_backup_restore_completed(client, dr_vol_name, backup0.name)
    
        delete_backup(client, volume_name, backup0.name)
        assert backupstore_count_backup_block_files(client,
                                                    core_api,
                                                    volume_name) == 0
        check_volume_last_backup(client, dr_vol_name, "")
    
>       activate_standby_volume(client, dr_vol_name)

test_basic.py:1027: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:4065: in activate_standby_volume
    wait_for_volume_detached(client, volume_name)
common.py:1873: in wait_for_volume_detached
    return wait_for_volume_status(client, name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7fc41ac46f70>
name = 'dr-longhorn-testvol-zkd5jc', key = 'state', value = 'detached'
retry_count = 150

    def wait_for_volume_status(client, name, key, value,
                               retry_count=RETRY_COUNTS):
        wait_for_volume_creation(client, name)
        for i in range(retry_count):
            volume = client.by_id_volume(name)
            if volume[key] == value:
                break
            time.sleep(RETRY_INTERVAL)
>       assert volume[key] == value, f" value={value}\n. \
                volume[key]={volume[key]}\n. volume={volume}"
E       AssertionError:  value=detached
E       .             volume[key]=attached
E       . volume={'accessMode': 'rwo', 'backendStoreDriver': 'v1', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:53Z', 'message': '', 'reason': 'RestoreInProgress', 'status': 'True'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:50Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:50Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:50Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '4194304', 'address': '10.42.1.10', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'endpoint': '', 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'hostId': 'ip-10-0-2-11', 'instanceManagerName': 'instance-manager-d338d59ca657fd5aad047c08c4a93ff8', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': 'backup-e4228a2b25b74c65', 'name': 'dr-longhorn-testvol-zkd5jc-e-0', 'requestedBackupRestore': 'backup-e4228a2b25b74c65', 'running': True, 'size': '16777216', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-02-22 00:28:50 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': True, 'diskSelector': [], 'encrypted': False, 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'fromBackup': 's3://backupbucket@us-east-1/backupstore?backup=backup-e4228a2b25b74c65&volume=longhorn-testvol-zkd5jc', 'frontend': 'blockdev', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': '', 'pvName': '', 'pvStatus': '', 'pvcName': '', 'workloadsStatus': None}, 'lastAttachedBy': '', 'lastBackup': '', 'lastBackupAt': '', 'migratable': False, 'name': 'dr-longhorn-testvol-zkd5jc', 'nodeSelector': [], 'numberOfReplicas': 2, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': [{'error': '', 'isPurging': False, 'progress': 0, 'replica': 'dr-longhorn-testvol-zkd5jc-r-0af13e3c', 'state': ''}, {'error': '', 'isPurging': False, 'progress': 0, 'replica': 'dr-longhorn-testvol-zkd5jc-r-fe6dffa5', 'state': ''}], 'ready': False, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '10.42.1.10', 'backendStoreDriver': 'v1', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'dataPath': '/var/lib/longhorn/replicas/dr-longhorn-testvol-zkd5jc-0c3e7d96', 'diskID': '8b064cfc-7e48-4cea-9e9a-a6cd86bd8f45', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'failedAt': '', 'hostId': 'ip-10-0-2-11', 'instanceManagerName': 'instance-manager-d338d59ca657fd5aad047c08c4a93ff8', 'mode': 'RW', 'name': 'dr-longhorn-testvol-zkd5jc-r-0af13e3c', 'running': True}, {'address': '10.42.3.9', 'backendStoreDriver': 'v1', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'dataPath': '/var/lib/longhorn/replicas/dr-longhorn-testvol-zkd5jc-5dddf1f5', 'diskID': 'c1da5a75-6f9e-48ac-a395-e16fbcfc218b', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'failedAt': '', 'hostId': 'ip-10-0-2-27', 'instanceManagerName': 'instance-manager-140f432c4b65ca9d48e227346bb38112', 'mode': 'RW', 'name': 'dr-longhorn-testvol-zkd5jc-r-fe6dffa5', 'running': True}], 'restoreInitiated': True, 'restoreRequired': True, 'restoreStatus': [{'backupURL': 's3://backupbucket@us-east-1/backupstore?backup=backup-e4228a2b25b74c65&volume=longhorn-testvol-zkd5jc', 'error': '', 'filename': 'volume-snap-2d6a297e-3d2a-40f1-82f8-b4be7c7bdd9a.img', 'isRestoring': False, 'lastRestored': 'backup-e4228a2b25b74c65', 'progress': 100, 'replica': 'dr-longhorn-testvol-zkd5jc-r-0af13e3c', 'state': 'complete'}, {'backupURL': 's3://backupbucket@us-east-1/backupstore?backup=backup-e4228a2b25b74c65&volume=longhorn-testvol-zkd5jc', 'error': '', 'filename': 'volume-snap-2d6a297e-3d2a-40f1-82f8-b4be7c7bdd9a.img', 'isRestoring': False, 'lastRestored': 'backup-e4228a2b25b74c65', 'progress': 100, 'replica': 'dr-longhorn-testvol-zkd5jc-r-fe6dffa5', 'state': 'complete'}], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'healthy', 'shareEndpoint': '', 'shareState': '', 'size': '16777216', 'snapshotDataIntegrity': 'ignored', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'attached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'volume-restore-controller-dr-longhorn-testvol-zkd5jc': {'attachmentID': 'volume-restore-controller-dr-longhorn-testvol-zkd5jc', 'attachmentType': 'volume-restore-controller', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:53Z', 'message': '', 'reason': '', 'status': 'True'}], 'nodeID': 'ip-10-0-2-11', 'parameters': {'disableFrontend': 'true'}, 'satisfied': True}}, 'volume': 'dr-longhorn-testvol-zkd5jc'}}

common.py:1932: AssertionError

The volume got stuck in Attached and Not Ready for workload forever.

And there are messages in longhorn-manager:

time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:29Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2

To Reproduce

Run test case test_dr_volume_with_all_backup_blocks_deleted

Expected behavior

Support bundle for troubleshooting

supportbundle_0fd28365-9049-4348-8608-680e94cc2f43_2024-02-22T00-59-33Z.zip

Environment

Longhorn version: v1.5.x-head
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

The text was updated successfully, but these errors were encountered:

innobead · 2024-02-22T03:45:31Z

Is this a regression or a day 1 issue?

shuo-wu · 2024-02-22T04:15:14Z

It's a regression as well. Fixing it.

longhorn-io-github-bot · 2024-02-22T12:11:04Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:
Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at
Allow DR volume activation if backup or backup volume is deleted longhorn-manager#2639
Allow DR volume activation if backup or backup volume is deleted (backport #2639) longhorn-manager#2640
controller: Allow DR volume activation if the latest backup is not found longhorn-manager#2628
Which areas/issues this PR might have potential impacts on?
Area: DR volume with backup/backup volume deletion
If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
The automation test case PR is at
integration: Improve DR volume with backup deletion case longhorn-tests#1745
integration: Improve DR volume with backup deletion case (backport #1745) longhorn-tests#1758
integration: Improve DR volume with backup deletion case (backport #1745) longhorn-tests#1759

yangchiu · 2024-02-23T06:30:11Z

Verified passed on v1.5.x-head (longhorn-manager 49619e2) by running test case test_dr_volume_with_backup_and_backup_volume_deleted.

Test result: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6405/

yangchiu added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible priority/0 Must be fixed in this release (managed by PO) area/volume-backup-restore Volume backup restore labels Feb 22, 2024

yangchiu added this to the v1.5.4 milestone Feb 22, 2024

yangchiu changed the title ~~[BUG][v1.5.x] DR volume unable to be activiated if the latest backup's been deleted~~ [BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted Feb 22, 2024

innobead assigned shuo-wu Feb 22, 2024

innobead added the area/volume-disaster-recovery Volume DR label Feb 22, 2024

shuo-wu mentioned this issue Feb 22, 2024

controller: Allow DR volume activation if the latest backup is not found longhorn/longhorn-manager#2628

Merged

This was referenced Feb 22, 2024

integration: Improve DR volume with backup deletion case longhorn/longhorn-tests#1745

Merged

Allow DR volume activation if backup or backup volume is deleted longhorn/longhorn-manager#2639

Merged

innobead assigned yangchiu Feb 23, 2024

yangchiu closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted #7997

[BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted #7997

yangchiu commented Feb 22, 2024 •

edited

innobead commented Feb 22, 2024

shuo-wu commented Feb 22, 2024

longhorn-io-github-bot commented Feb 22, 2024 •

edited by shuo-wu

yangchiu commented Feb 23, 2024

[BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted #7997

[BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted #7997

Comments

yangchiu commented Feb 22, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

innobead commented Feb 22, 2024

shuo-wu commented Feb 22, 2024

longhorn-io-github-bot commented Feb 22, 2024 • edited by shuo-wu

Pre Ready-For-Testing Checklist

yangchiu commented Feb 23, 2024

yangchiu commented Feb 22, 2024 •

edited

longhorn-io-github-bot commented Feb 22, 2024 •

edited by shuo-wu