[BUG] Test case `test_volume_reattach_after_engine_sigkill` failed #6751

roger-ryao · 2023-09-22T03:17:54Z

Describe the bug (🐛 if you encounter this issue)

In Longhorn master-head, test case test_volume_reattach_after_engine_sigkill failed because there was an error while wait_pod_for_remount_request, and it appears to have the same issue as the after crash_engine_process_with_sigkill.

After discussing with @ChanYiLin , he will examine the support bundle to ascertain whether the root cause of this failure is similar to that of #6699. If not, I will attempt to modify the test case to include log printing in the console output

vol_name, pod_name, md5sum = \
            common.prepare_statefulset_with_data_in_mb(
                client, core_api, statefulset, sts_name, storage_class)
    
        crash_engine_process_with_sigkill(client, core_api, vol_name)
    
>       wait_pod_for_remount_request(client, core_api, vol_name, pod_name, md5sum)

test_ha.py:1977: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_ha.py:617: in wait_pod_for_remount_request
    common.wait_for_pod_phase(core_api, pod_name, pod_phase="Pending")
common.py:4912: in wait_for_pod_phase
    pod = core_api.read_namespaced_pod(name=pod_name,
/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py:23483: in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py:23570: in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py:348: in call_api
    return self.__call_api(resource_path, method,
/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py:180: in __call_api
    response_data = self.request(
/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py:373: in request
    return self.rest_client.GET(url,
/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py:241: in GET
    return self.request("GET", url,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <kubernetes.client.rest.RESTClientObject object at 0x7fe2d40265b0>
method = 'GET'
url = '[https://10.43.0.1:443/api/v1/namespaces/default/pods/longhorn-teststs-1k448e-0'](https://10.43.0.1/api/v1/namespaces/default/pods/longhorn-teststs-1k448e-0')
query_params = []
headers = {'Accept': 'application/json', 'Content-Type': 'application/json', 'User-Agent': 'OpenAPI-Generator/25.3.0/python', 'a...d5kNQaIh6vN0Oe8e8k7Tc5Akm0PXqo_i-V7-Z24GmAN0owxHW3s6CLQFsGhc2UxoMJUs9CxlUGXfVdV3NngQZrjgHSbz4tRxgWm5PNf8hw6DhcIJeDEGA'}
body = None, post_params = {}, _preload_content = True, _request_timeout = None

To Reproduce

Execute test case test_ha.py::test_volume_reattach_after_engine_sigkill

Expected behavior

Have consistent test result in all branches result

Support bundle for troubleshooting

longhorn-tests-sles-amd64-664-bundle.zip

Environment

Longhorn version: master-head
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.1+k3s1
- Number of management node in the cluster:1
- Number of worker node in the cluster:3
Node config
- OS type and version: SLES 15-sp5
- Kernel version:
- CPU per node: 4 cores
- Memory per node: 16GB
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster:1
Impacted Longhorn resources:
- Volume names:

Additional context

The text was updated successfully, but these errors were encountered:

ChanYiLin · 2023-09-26T08:17:07Z

https://longhorn.io/docs/1.5.1/references/settings/#automatically-delete-workload-pod-when-the-volume-is-detached-unexpectedly
Based on this setting when the volume is detached unexpectedly, we will help user to delete the workload pod
then the volume will self-recover and the pod will be recreated
volume should be remounted again

    def test_volume_reattach_after_engine_sigkill(client, core_api, storage_class, sts_name, statefulset):  # NOQA
        """
        [HA] Test if the volume can be reattached after using SIGKILL
        to crash the engine process
    
        1. Create StorageClass and StatefulSet.
        2. Write random data to the pod and get the md5sum.
        3. Crash the engine process by SIGKILL in the engine manager.
        4. Wait for volume to `faulted`, then `healthy`.
        5. Wait for K8s to terminate the pod and statefulset to bring pod to
           `Pending`, then `Running`.
        6. Check volume path exist in the pod.
        7. Check md5sum of the data in the pod.
        8. Check new data written to the volume is successful.
        """
        vol_name, pod_name, md5sum = \
            common.prepare_statefulset_with_data_in_mb(
                client, core_api, statefulset, sts_name, storage_class)
    
        crash_engine_process_with_sigkill(client, core_api, vol_name)
    
      > wait_pod_for_remount_request(client, core_api, vol_name, pod_name, md5sum)

so after kill the engine process, we are waiting the remount request to see if everything back to normal, then validate the data

But in wait_pod_for_remount_request , "waiting the remount request" is actually
get pod and wait for it's state="Pending"

test_ha.py:617: in wait_pod_for_remount_request
    common.wait_for_pod_phase(core_api, pod_name, pod_phase="Pending")
common.py:4912: in wait_for_pod_phase
    pod = core_api.read_namespaced_pod(name=pod_name,
/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py:23483: in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501

we try to get the pod by calling k8s api, and is not found

        if not 200 <= r.status <= 299:
>           raise ApiException(http_resp=r)
E           kubernetes.client.exceptions.ApiException: (404)
E           Reason: Not Found
E           HTTP response headers: HTTPHeaderDict({'Audit-Id': '4ebd7308-9e1d-4b65-9f7c-3fcaea91e0d2', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '6c1d5aad-1b9a-4888-873e-b689e110db86', 'X-Kubernetes-Pf-Prioritylevel-Uid': '28859ab2-5171-47ba-9d90-c7dcf9afc3ee', 'Date': 'Thu, 21 Sep 2023 17:17:41 GMT', 'Content-Length': '218'})

We actually retry a while to get the pod

def wait_for_pod_phase(core_api, pod_name, pod_phase, namespace="default"):
    is_phase = False
    for _ in range(RETRY_COUNTS):
        pod = core_api.read_namespaced_pod(name=pod_name,
                                           namespace=namespace)
        if pod.status.phase == pod_phase:
            is_phase = True
            break

        time.sleep(RETRY_INTERVAL_LONG)
    assert is_phase

That means the test case failed because it can't get the pod for a while after we delete the pod, the statefulset should recreate the pod

c3y1huang · 2024-01-04T10:14:55Z

Fixed in #7491.

roger-ryao added kind/bug reproduce/rare < 50% reproducible require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Sep 22, 2023

roger-ryao added this to the v1.6.0 milestone Sep 22, 2023

ChanYiLin self-assigned this Sep 22, 2023

ChanYiLin added priority/0 Must be fixed in this release (managed by PO) area/resilience System or volume resilience investigation-needed Need to identify the case before estimating and starting the development labels Sep 22, 2023

roger-ryao mentioned this issue Sep 15, 2023

[TEST] Analyze master-head daily regression failed test cases #6659

Open

33 tasks

innobead assigned c3y1huang and unassigned ChanYiLin Dec 13, 2023

c3y1huang closed this as completed Jan 4, 2024

c3y1huang added the duplicated label Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Test case `test_volume_reattach_after_engine_sigkill` failed #6751

[BUG] Test case `test_volume_reattach_after_engine_sigkill` failed #6751

roger-ryao commented Sep 22, 2023 •

edited

ChanYiLin commented Sep 26, 2023 •

edited

c3y1huang commented Jan 4, 2024

[BUG] Test case test_volume_reattach_after_engine_sigkill failed #6751

[BUG] Test case test_volume_reattach_after_engine_sigkill failed #6751

Comments

roger-ryao commented Sep 22, 2023 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

ChanYiLin commented Sep 26, 2023 • edited

c3y1huang commented Jan 4, 2024

[BUG] Test case `test_volume_reattach_after_engine_sigkill` failed #6751

[BUG] Test case `test_volume_reattach_after_engine_sigkill` failed #6751

roger-ryao commented Sep 22, 2023 •

edited

ChanYiLin commented Sep 26, 2023 •

edited