Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted #7997

Closed
yangchiu opened this issue Feb 22, 2024 · 4 comments
Assignees
Labels
area/volume-backup-restore Volume backup restore area/volume-disaster-recovery Volume DR kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Feb 22, 2024

Describe the bug

Test case test_dr_volume_with_all_backup_blocks_deleted failed on v1.5.x-head:

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6353/

set_random_backupstore = None
client = <longhorn.Client object at 0x7fc41ac46f70>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fc41aa9c910>
volume_name = 'longhorn-testvol-zkd5jc'

    def test_dr_volume_with_all_backup_blocks_deleted(set_random_backupstore, client, core_api, volume_name):  # NOQA
        """
        Test DR volume can be activate after delete all backups.
    
        Context:
    
        We want to make sure that DR volume can activate after delete all backups.
    
        Steps:
    
        1.  Create a volume and attach to the current node.
        2.  Write 4 MB to the beginning of the volume (2 x 2MB backup blocks).
        3.  Create backup(0) of the volume.
        6.  Verify backup block count == 2.
        7.  Create DR volume from backup(0).
        8.  Verify DR volume last backup is backup(0).
        9.  Delete backup(0).
        10. Verify backup block count == 0.
        11. Verify DR volume last backup is empty.
        15. Activate and verify DR volume data is data(0).
        """
        backupstore_cleanup(client)
    
        host_id = get_self_host_id()
    
        vol = create_and_check_volume(client, volume_name, 2, SIZE)
        vol.attach(hostId=host_id)
        vol = common.wait_for_volume_healthy(client, volume_name)
    
        data0 = {'pos': 0, 'len': 2 * BACKUP_BLOCK_SIZE,
                 'content': common.generate_random_data(2 * BACKUP_BLOCK_SIZE)}
        _, backup0, _, data0 = create_backup(
            client, volume_name, data0)
    
        backup_blocks_count = backupstore_count_backup_block_files(client,
                                                                   core_api,
                                                                   volume_name)
        assert backup_blocks_count == 2
    
        dr_vol_name = "dr-" + volume_name
        client.create_volume(name=dr_vol_name, size=SIZE,
                             numberOfReplicas=2, fromBackup=backup0.url,
                             frontend="", standby=True)
        check_volume_last_backup(client, dr_vol_name, backup0.name)
        wait_for_backup_restore_completed(client, dr_vol_name, backup0.name)
    
        delete_backup(client, volume_name, backup0.name)
        assert backupstore_count_backup_block_files(client,
                                                    core_api,
                                                    volume_name) == 0
        check_volume_last_backup(client, dr_vol_name, "")
    
>       activate_standby_volume(client, dr_vol_name)

test_basic.py:1027: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:4065: in activate_standby_volume
    wait_for_volume_detached(client, volume_name)
common.py:1873: in wait_for_volume_detached
    return wait_for_volume_status(client, name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7fc41ac46f70>
name = 'dr-longhorn-testvol-zkd5jc', key = 'state', value = 'detached'
retry_count = 150

    def wait_for_volume_status(client, name, key, value,
                               retry_count=RETRY_COUNTS):
        wait_for_volume_creation(client, name)
        for i in range(retry_count):
            volume = client.by_id_volume(name)
            if volume[key] == value:
                break
            time.sleep(RETRY_INTERVAL)
>       assert volume[key] == value, f" value={value}\n. \
                volume[key]={volume[key]}\n. volume={volume}"
E       AssertionError:  value=detached
E       .             volume[key]=attached
E       . volume={'accessMode': 'rwo', 'backendStoreDriver': 'v1', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:53Z', 'message': '', 'reason': 'RestoreInProgress', 'status': 'True'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:50Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:50Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:50Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '4194304', 'address': '10.42.1.10', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'endpoint': '', 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'hostId': 'ip-10-0-2-11', 'instanceManagerName': 'instance-manager-d338d59ca657fd5aad047c08c4a93ff8', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': 'backup-e4228a2b25b74c65', 'name': 'dr-longhorn-testvol-zkd5jc-e-0', 'requestedBackupRestore': 'backup-e4228a2b25b74c65', 'running': True, 'size': '16777216', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-02-22 00:28:50 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': True, 'diskSelector': [], 'encrypted': False, 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'fromBackup': 's3://backupbucket@us-east-1/backupstore?backup=backup-e4228a2b25b74c65&volume=longhorn-testvol-zkd5jc', 'frontend': 'blockdev', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': '', 'pvName': '', 'pvStatus': '', 'pvcName': '', 'workloadsStatus': None}, 'lastAttachedBy': '', 'lastBackup': '', 'lastBackupAt': '', 'migratable': False, 'name': 'dr-longhorn-testvol-zkd5jc', 'nodeSelector': [], 'numberOfReplicas': 2, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': [{'error': '', 'isPurging': False, 'progress': 0, 'replica': 'dr-longhorn-testvol-zkd5jc-r-0af13e3c', 'state': ''}, {'error': '', 'isPurging': False, 'progress': 0, 'replica': 'dr-longhorn-testvol-zkd5jc-r-fe6dffa5', 'state': ''}], 'ready': False, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '10.42.1.10', 'backendStoreDriver': 'v1', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'dataPath': '/var/lib/longhorn/replicas/dr-longhorn-testvol-zkd5jc-0c3e7d96', 'diskID': '8b064cfc-7e48-4cea-9e9a-a6cd86bd8f45', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'failedAt': '', 'hostId': 'ip-10-0-2-11', 'instanceManagerName': 'instance-manager-d338d59ca657fd5aad047c08c4a93ff8', 'mode': 'RW', 'name': 'dr-longhorn-testvol-zkd5jc-r-0af13e3c', 'running': True}, {'address': '10.42.3.9', 'backendStoreDriver': 'v1', 'currentImage': 'longhornio/longhorn-engine:v1.5.x-head', 'dataPath': '/var/lib/longhorn/replicas/dr-longhorn-testvol-zkd5jc-5dddf1f5', 'diskID': 'c1da5a75-6f9e-48ac-a395-e16fbcfc218b', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.x-head', 'failedAt': '', 'hostId': 'ip-10-0-2-27', 'instanceManagerName': 'instance-manager-140f432c4b65ca9d48e227346bb38112', 'mode': 'RW', 'name': 'dr-longhorn-testvol-zkd5jc-r-fe6dffa5', 'running': True}], 'restoreInitiated': True, 'restoreRequired': True, 'restoreStatus': [{'backupURL': 's3://backupbucket@us-east-1/backupstore?backup=backup-e4228a2b25b74c65&volume=longhorn-testvol-zkd5jc', 'error': '', 'filename': 'volume-snap-2d6a297e-3d2a-40f1-82f8-b4be7c7bdd9a.img', 'isRestoring': False, 'lastRestored': 'backup-e4228a2b25b74c65', 'progress': 100, 'replica': 'dr-longhorn-testvol-zkd5jc-r-0af13e3c', 'state': 'complete'}, {'backupURL': 's3://backupbucket@us-east-1/backupstore?backup=backup-e4228a2b25b74c65&volume=longhorn-testvol-zkd5jc', 'error': '', 'filename': 'volume-snap-2d6a297e-3d2a-40f1-82f8-b4be7c7bdd9a.img', 'isRestoring': False, 'lastRestored': 'backup-e4228a2b25b74c65', 'progress': 100, 'replica': 'dr-longhorn-testvol-zkd5jc-r-fe6dffa5', 'state': 'complete'}], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'healthy', 'shareEndpoint': '', 'shareState': '', 'size': '16777216', 'snapshotDataIntegrity': 'ignored', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'attached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'volume-restore-controller-dr-longhorn-testvol-zkd5jc': {'attachmentID': 'volume-restore-controller-dr-longhorn-testvol-zkd5jc', 'attachmentType': 'volume-restore-controller', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-02-22T00:28:53Z', 'message': '', 'reason': '', 'status': 'True'}], 'nodeID': 'ip-10-0-2-11', 'parameters': {'disableFrontend': 'true'}, 'satisfied': True}}, 'volume': 'dr-longhorn-testvol-zkd5jc'}}

common.py:1932: AssertionError

The volume got stuck in Attached and Not Ready for workload forever.

And there are messages in longhorn-manager:

time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:28Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2
time="2024-02-22T00:45:29Z" level=info msg="Restore/DR volume needs to restore the latest backup , and the current restored backup is backup-5396071197464f82" func="controller.(*VolumeController).checkAndFinishVolumeRestore" file="volume_controller.go:3128" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-1-18 owner=ip-10-0-1-18 state=attached volume=test-2

To Reproduce

Run test case test_dr_volume_with_all_backup_blocks_deleted

Expected behavior

Support bundle for troubleshooting

supportbundle_0fd28365-9049-4348-8608-680e94cc2f43_2024-02-22T00-59-33Z.zip

Environment

  • Longhorn version: v1.5.x-head
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

@yangchiu yangchiu added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible priority/0 Must be fixed in this release (managed by PO) area/volume-backup-restore Volume backup restore labels Feb 22, 2024
@yangchiu yangchiu added this to the v1.5.4 milestone Feb 22, 2024
@yangchiu yangchiu changed the title [BUG][v1.5.x] DR volume unable to be activiated if the latest backup's been deleted [BUG][v1.5.x] DR volume unable to be activated if the latest backup's been deleted Feb 22, 2024
@innobead
Copy link
Member

Is this a regression or a day 1 issue?

@shuo-wu
Copy link
Contributor

shuo-wu commented Feb 22, 2024

It's a regression as well. Fixing it.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Feb 22, 2024

Pre Ready-For-Testing Checklist

@yangchiu
Copy link
Member Author

Verified passed on v1.5.x-head (longhorn-manager 49619e2) by running test case test_dr_volume_with_backup_and_backup_volume_deleted.

Test result: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6405/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore area/volume-disaster-recovery Volume DR kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
None yet
Development

No branches or pull requests

4 participants