Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed #7540

Closed
yangchiu opened this issue Jan 4, 2024 · 2 comments
Assignees
Labels
invalid kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Jan 4, 2024

Describe the bug (馃悰 if you encounter this issue)

Test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine failed on master-head:

client = <longhorn.Client object at 0x7fe7bce32a30>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fe7bcc91e20>

    def test_engine_image_not_fully_deployed_perform_auto_upgrade_engine(client, core_api): # NOQA
        """
        Test auto upgrade engine feature when engine image DaemonSet is
        not fully deployed
    
        Prerequisite:
        Prepare system for the test by calling the method
        prepare_engine_not_fully_deployed_evnironment to have
        tainted node and not fully deployed engine.
    
        1. Create 2 volumes vol-1 and vol-2 with 2 replicas
        2. Deploy a new engine image, new-ei
        3. Upgrade vol-1 and vol-2 to the new-ei
        4. Attach vol-2 to current-node
        5. Set `Concurrent Automatic Engine Upgrade Per Node Limit` setting to 3
        6. In a 2-min retry, verify that Longhorn upgrades the engine image of
           vol-1 and vol-2.
        """
        prepare_engine_not_fully_deployed_environment(client, core_api)
    
        volume1 = create_and_check_volume(client, "vol-1", num_of_replicas=2,
                                          size=str(3 * Gi))
    
        volume2 = create_and_check_volume(client, "vol-2", num_of_replicas=2,
                                          size=str(3 * Gi))
    
        default_img = common.get_default_engine_image(client)
        # engine reference =
        # (1 volume + 1 engine + number of replicas) * volume count
        wait_for_engine_image_ref_count(client, default_img.name, 8)
    
        engine_upgrade_image, new_img = \
            prepare_upgrade_image_not_fully_deployed_environment(client)
    
        volume1.engineUpgrade(image=engine_upgrade_image)
        volume2.engineUpgrade(image=engine_upgrade_image)
        volume1 = wait_for_volume_current_image(client, volume1.name,
                                                engine_upgrade_image)
        volume2 = wait_for_volume_current_image(client, volume2.name,
                                                engine_upgrade_image)
    
        default_img = common.get_default_engine_image(client)
        wait_for_engine_image_ref_count(client, default_img.name, 0)
    
        volume2.attach(hostId=get_self_host_id())
>       volume2 = wait_for_volume_healthy(client, volume2.name)

test_ha.py:2922: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:1927: in wait_for_volume_healthy
    wait_for_volume_status(client, name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7fe7bce32a30>, name = 'vol-2'
key = 'state', value = 'attached', retry_count = 150

    def wait_for_volume_status(client, name, key, value,
                               retry_count=RETRY_COUNTS):
        wait_for_volume_creation(client, name)
        for i in range(retry_count):
            volume = client.by_id_volume(name)
            if volume[key] == value:
                break
            time.sleep(RETRY_INTERVAL)
>       assert volume[key] == value, f" value={value}\n. \
                volume[key]={volume[key]}\n. volume={volume}"
E       AssertionError:  value=attached
E       .             volume[key]=detached
E       . volume={'accessMode': 'rwo', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'Restore': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:59Z', 'message': '', 'reason': '', 'status': 'False'}, 'Scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:58Z', 'message': '', 'reason': '', 'status': 'True'}, 'TooManySnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:58Z', 'message': '', 'reason': '', 'status': 'False'}, 'WaitForBackingImage': {'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:09:58Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '0', 'address': '', 'currentImage': '', 'endpoint': '', 'hostId': '', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'instanceManagerName': '', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'vol-2-e-0', 'requestedBackupRestore': '', 'running': False, 'size': '0', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2024-01-04 06:09:58 +0000 UTC', 'currentImage': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'dataEngine': 'v1', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'fromBackup': '', 'frontend': 'blockdev', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'kubernetesStatus': {'lastPVCRefAt': '', 'lastPodRefAt': '', 'namespace': '', 'pvName': '', 'pvStatus': '', 'pvcName': '', 'workloadsStatus': None}, 'lastAttachedBy': '', 'lastBackup': '', 'lastBackupAt': '', 'migratable': False, 'name': 'vol-2', 'nodeSelector': [], 'numberOfReplicas': 2, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': None, 'ready': False, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaDiskSoftAntiAffinity': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '', 'currentImage': '', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-2-dbd14aae', 'diskID': 'e4daa950-8fdd-40f5-a683-3c1d2c550732', 'diskPath': '/var/lib/longhorn/', 'failedAt': '2024-01-04T06:10:13Z', 'hostId': 'ip-10-0-2-192', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'instanceManagerName': '', 'mode': '', 'name': 'vol-2-r-e0d2f88a', 'running': False}, {'address': '', 'currentImage': '', 'dataEngine': 'v1', 'dataPath': '/var/lib/longhorn/replicas/vol-2-bad05058', 'diskID': 'aa8e9403-a5e3-43b2-9011-a1cd791f86af', 'diskPath': '/var/lib/longhorn/', 'failedAt': '2024-01-04T06:10:13Z', 'hostId': 'ip-10-0-2-135', 'image': 'longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1', 'instanceManagerName': '', 'mode': '', 'name': 'vol-2-r-fd71b7ea', 'running': False}], 'restoreInitiated': False, 'restoreRequired': False, 'restoreStatus': [], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'faulted', 'shareEndpoint': '', 'shareState': '', 'size': '3221225472', 'snapshotDataIntegrity': 'ignored', 'staleReplicaTimeout': 0, 'standby': False, 'state': 'detached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'': {'attachmentID': '', 'attachmentType': 'longhorn-api', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2024-01-04T06:10:12Z', 'message': '', 'reason': '', 'status': 'False'}], 'nodeID': 'ip-10-0-2-135', 'parameters': {'disableFrontend': 'false', 'lastAttachedBy': ''}, 'satisfied': False}}, 'volume': 'vol-2'}}

common.py:1973: AssertionError

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5835/

The volume became faulted and unable to recover:
faulted

More investigation needed to know which operation causes the problem.

To Reproduce

Run test case test_engine_image_not_fully_deployed_perform_auto_upgrade_engine

Expected behavior

Support bundle for troubleshooting

supportbundle_11b3004a-6254-4515-8a11-22655c1234b0_2024-01-04T07-38-51Z.zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

@yangchiu yangchiu added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. reproduce/always 100% reproducible priority/1 Highly recommended to fix in this release (managed by PO) and removed require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Jan 4, 2024
@yangchiu yangchiu added this to the v1.6.0 milestone Jan 4, 2024
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) and removed priority/1 Highly recommended to fix in this release (managed by PO) labels Jan 4, 2024
@c3y1huang
Copy link
Contributor

c3y1huang commented Jan 5, 2024

During my investigation, I am unable to manually attach volumes using longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1 without the testing scenario.

Then I took the step to create an upgrade test image locally.
Here is the relevant modified code:

diff --git a/manager/integration/tests/test_ha.py b/manager/integration/tests/test_ha.py
index 099fcb95e..205c86ea8 100644
--- a/manager/integration/tests/test_ha.py
+++ b/manager/integration/tests/test_ha.py
@@ -2547,6 +2547,7 @@ def prepare_upgrade_image_not_fully_deployed_environment(client): # NOQA
     engine_upgrade_image = common.get_upgrade_test_image(cli_v, cli_minv,
                                                          ctl_v, ctl_minv,
                                                          data_v, data_minv)
+    engine_upgrade_image = "c3y1huang/research:upgrade-test.9-8.5-4.1-1"
 
     new_img = client.create_engine_image(image=engine_upgrade_image)
     wait_for_deployed_engine_image_count(client, new_img.name, 2)
diff --git a/manager/test_containers/upgrade/generate_live_upgrade_image.sh b/manager/test_containers/upgrade/generate_live_upgrade_image.sh
index ac0e1c119..c6592a0ea 100755
--- a/manager/test_containers/upgrade/generate_live_upgrade_image.sh
+++ b/manager/test_containers/upgrade/generate_live_upgrade_image.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 
 # make sure IMAGE wasn't used by any releases
-IMAGE="longhornio/longhorn-engine:live-upgrade-5-3-1"
+IMAGE="c3y1huang/research:2055-lh-ei"
 
 version=`docker run $IMAGE longhorn version --client-only`
 echo Image version output: $version
@@ -13,7 +13,7 @@ ControllerAPIMinVersion=`echo $version|jq -r ".clientVersion.controllerAPIMinVer
 DataFormatVersion=`echo $version|jq -r ".clientVersion.dataFormatVersion"`
 DataFormatMinVersion=`echo $version|jq -r ".clientVersion.dataFormatMinVersion"`
 
-test_image="longhornio/longhorn-test:upgrade-test.${CLIAPIVersion}-${CLIAPIMinVersion}"\
+test_image="c3y1huang/research:upgrade-test.${CLIAPIVersion}-${CLIAPIMinVersion}"\
 ".${ControllerAPIVersion}-${ControllerAPIMinVersion}"\
 ".${DataFormatVersion}-${DataFormatMinVersion}"

After running the test again, it passed successfully.

> ./run.sh -xs -k test_engine_image_not_fully_deployed_perform_auto_upgrade_engine
================================================ test session starts =================================================
platform linux -- Python 3.9.17, pytest-5.3.1, py-1.11.0, pluggy-0.13.1 -- /usr/bin/python3.9
cachedir: .pytest_cache
rootdir: /integration, inifile: pytest.ini
plugins: order-1.0.1, repeat-0.9.1
collected 368 items / 367 deselected / 1 selected                                                                    

test_ha.py::test_engine_image_not_fully_deployed_perform_auto_upgrade_engine PASSED

=================================== 1 passed, 367 deselected in 104.59s (0:01:44) ====================================

@chriscchien , @yangchiu could you help to check if the longhornio/longhorn-test:upgrade-test image is built on the correct version?

@c3y1huang
Copy link
Contributor

This has been resolved in #7548 (comment).

Test result: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5869/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid kind/bug priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible
Projects
None yet
Development

No branches or pull requests

3 participants