[BUG] test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed #7540

yangchiu · 2024-01-04T07:40:00Z

Describe the bug (🐛 if you encounter this issue)

Test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed on master-head:

client = <longhorn.Client object at 0x7fe7bce32a30>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fe7bcc91e20>

    def test_engine_image_not_fully_deployed_perform_auto_upgrade_engine(client, core_api): # NOQA
        """
        Test auto upgrade engine feature when engine image DaemonSet is
        not fully deployed
    
        Prerequisite:
        Prepare system for the test by calling the method
        prepare_engine_not_fully_deployed_evnironment to have
        tainted node and not fully deployed engine.
    
        1. Create 2 volumes vol-1 and vol-2 with 2 replicas
        2. Deploy a new engine image, new-ei
        3. Upgrade vol-1 and vol-2 to the new-ei
        4. Attach vol-2 to current-node
        5. Set `Concurrent Automatic Engine Upgrade Per Node Limit` setting to 3
        6. In a 2-min retry, verify that Longhorn upgrades the engine image of
           vol-1 and vol-2.
        """
        prepare_engine_not_fully_deployed_environment(client, core_api)
    
        volume1 = create_and_check_volume(client, "vol-1", num_of_replicas=2,
                                          size=str(3 * Gi))
    
        volume2 = create_and_check_volume(client, "vol-2", num_of_replicas=2,
                                          size=str(3 * Gi))
    
        default_img = common.get_default_engine_image(client)
        # engine reference =
        # (1 volume + 1 engine + number of replicas) * volume count
        wait_for_engine_image_ref_count(client, default_img.name, 8)
    
        engine_upgrade_image, new_img = \
            prepare_upgrade_image_not_fully_deployed_environment(client)
    
        volume1.engineUpgrade(image=engine_upgrade_image)
        volume2.engineUpgrade(image=engine_upgrade_image)
        volume1 = wait_for_volume_current_image(client, volume1.name,
                                                engine_upgrade_image)
        volume2 = wait_for_volume_current_image(client, volume2.name,
                                                engine_upgrade_image)
    
        default_img = common.get_default_engine_image(client)
        wait_for_engine_image_ref_count(client, default_img.name, 0)
    
        volume2.attach(hostId=get_self_host_id())
>       volume2 = wait_for_volume_healthy(client, volume2.name)

test_ha.py:2922: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:1927: in wait_for_volume_healthy
    wait_for_volume_status(client, name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7fe7bce32a30>, name = 'vol-2'
key = 'state', value = 'attached', retry_count = 150

    def wait_for_volume_status(client, name, key, value,
                               retry_count=RETRY_COUNTS):
        wait_for_volume_creation(client, name)
        for i in range(retry_count):
            volume = client.by_id_volume(name)
            if volume[key] == value:
                break
            time.sleep(RETRY_INTERVAL)
>       assert volume[key] == value, f" value={value}\n. \
                volume[key]={volume[key]}\n. volume={volume}"
E       AssertionError:  value=attached
E       .             volume[key]=detached
E       . volume={'accessMode': 'rwo', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:59Z', 'message': '', 'reason': '', 'status': 'False'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:58Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:58Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:58Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '0', 'address': '', 'currentImage': '', 'endpoint': '', 'hostId': '', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'instanceManagerName': '', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'vol-2-e-0', 'requestedBackupRestore': '', 'running': False, 'size': '0', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-01-04 06:09:58 +0000 UTC', 'currentImage': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'dataEngine': 'v1', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'fromBackup': '', 'frontend': 'blockdev', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': '', 'pvName': '', 'pvStatus': '', 'pvcName': '', 'workloadsStatus': None}, 'lastAttachedBy': '', 'lastBackup': '', 'lastBackupAt': '', 'migratable': False, 'name': 'vol-2', 'nodeSelector': [], 'numberOfReplicas': 2, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': None, 'ready': False, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaDiskSoftAntiAffinity': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '', 'currentImage': '', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-2-dbd14aae', 'diskID': 'e4daa950-8fdd-40f5-a683-3c1d2c550732', 'diskPath': '/var/lib/longhorn/', 'failedAt': '2024-01-04T06:10:13Z', 'hostId': 'ip-10-0-2-192', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'instanceManagerName': '', 'mode': '', 'name': 'vol-2-r-e0d2f88a', 'running': False}, {'address': '', 'currentImage': '', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-2-bad05058', 'diskID': 'aa8e9403-a5e3-43b2-9011-a1cd791f86af', 'diskPath': '/var/lib/longhorn/', 'failedAt': '2024-01-04T06:10:13Z', 'hostId': 'ip-10-0-2-135', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'instanceManagerName': '', 'mode': '', 'name': 'vol-2-r-fd71b7ea', 'running': False}], 'restoreInitiated': False, 'restoreRequired': False, 'restoreStatus': [], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'faulted', 'shareEndpoint': '', 'shareState': '', 'size': '3221225472', 'snapshotDataIntegrity': 'ignored', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'detached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'': {'attachmentID': '', 'attachmentType': 'longhorn-api', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:10:12Z', 'message': '', 'reason': '', 'status': 'False'}], 'nodeID': 'ip-10-0-2-135', 'parameters': {'disableFrontend': 'false', 'lastAttachedBy': ''}, 'satisfied': False}}, 'volume': 'vol-2'}}

common.py:1973: AssertionError

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5835/

The volume became faulted and unable to recover:

More investigation needed to know which operation causes the problem.

To Reproduce

Run test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine

Expected behavior

Support bundle for troubleshooting

supportbundle_11b3004a-6254-4515-8a11-22655c1234b0_2024-01-04T07-38-51Z.zip

Environment

Longhorn version: master-head
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:
Impacted Longhorn resources:
- Volume names:

Additional context

The text was updated successfully, but these errors were encountered:

c3y1huang · 2024-01-05T06:23:09Z

During my investigation, I am unable to manually attach volumes using longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1 without the testing scenario.

Then I took the step to create an upgrade test image locally.
Here is the relevant modified code:

diff --git a/manager/integration/tests/test_ha.py b/manager/integration/tests/test_ha.py
index 099fcb95e..205c86ea8 100644
--- a/manager/integration/tests/test_ha.py
+++ b/manager/integration/tests/test_ha.py
@@ -2547,6 +2547,7 @@ def prepare_upgrade_image_not_fully_deployed_environment(client): # NOQA
     engine_upgrade_image = common.get_upgrade_test_image(cli_v, cli_minv,
                                                          ctl_v, ctl_minv,
                                                          data_v, data_minv)
+    engine_upgrade_image = "c3y1huang/research:upgrade-test.9-8.5-4.1-1"
 
     new_img = client.create_engine_image(image=engine_upgrade_image)
     wait_for_deployed_engine_image_count(client, new_img.name, 2)
diff --git a/manager/test_containers/upgrade/generate_live_upgrade_image.sh b/manager/test_containers/upgrade/generate_live_upgrade_image.sh
index ac0e1c119..c6592a0ea 100755
--- a/manager/test_containers/upgrade/generate_live_upgrade_image.sh
+++ b/manager/test_containers/upgrade/generate_live_upgrade_image.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 
 # make sure IMAGE wasn't used by any releases
-IMAGE="longhornio/longhorn-engine:live-upgrade-5-3-1"
+IMAGE="c3y1huang/research:2055-lh-ei"
 
 version=`docker run $IMAGE longhorn version --client-only`
 echo Image version output: $version
@@ -13,7 +13,7 @@ ControllerAPIMinVersion=`echo $version|jq -r ".clientVersion.controllerAPIMinVer
 DataFormatVersion=`echo $version|jq -r ".clientVersion.dataFormatVersion"`
 DataFormatMinVersion=`echo $version|jq -r ".clientVersion.dataFormatMinVersion"`
 
-test_image="longhornio/longhorn-test:upgrade-test.${CLIAPIVersion}-${CLIAPIMinVersion}"\
+test_image="c3y1huang/research:upgrade-test.${CLIAPIVersion}-${CLIAPIMinVersion}"\
 ".${ControllerAPIVersion}-${ControllerAPIMinVersion}"\
 ".${DataFormatVersion}-${DataFormatMinVersion}"

After running the test again, it passed successfully.

> ./run.sh -xs -k test_engine_image_not_fully_deployed_perform_auto_upgrade_engine
================================================ test session starts =================================================
platform linux -- Python 3.9.17, pytest-5.3.1, py-1.11.0, pluggy-0.13.1 -- /usr/bin/python3.9
cachedir: .pytest_cache
rootdir: /integration, inifile: pytest.ini
plugins: order-1.0.1, repeat-0.9.1
collected 368 items / 367 deselected / 1 selected                                                                    

test_ha.py::test_engine_image_not_fully_deployed_perform_auto_upgrade_engine PASSED

=================================== 1 passed, 367 deselected in 104.59s (0:01:44) ====================================

~~@chriscchien , @yangchiu could you help to check if the longhornio/longhorn-test:upgrade-test image is built on the correct version?~~

c3y1huang · 2024-01-05T06:40:39Z

This has been resolved in #7548 (comment).

Test result: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5869/

yangchiu added this to the v1.6.0 milestone Jan 4, 2024

innobead added priority/0 Must be fixed in this release (managed by PO) and removed priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 4, 2024

innobead assigned c3y1huang Jan 4, 2024

roger-ryao mentioned this issue Jan 4, 2024

[BUG] volume engine failed to live upgrade #7548

Closed

c3y1huang closed this as completed Jan 5, 2024

c3y1huang added the invalid label Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed #7540

[BUG] test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed #7540

yangchiu commented Jan 4, 2024 •

edited

c3y1huang commented Jan 5, 2024 •

edited

c3y1huang commented Jan 5, 2024

[BUG] test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed #7540

[BUG] test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed #7540

Comments

yangchiu commented Jan 4, 2024 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

c3y1huang commented Jan 5, 2024 • edited

c3y1huang commented Jan 5, 2024

yangchiu commented Jan 4, 2024 •

edited

c3y1huang commented Jan 5, 2024 •

edited