Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Test case test_volume_reattach_after_engine_sigkill failed #6751

Closed
roger-ryao opened this issue Sep 22, 2023 · 2 comments
Closed

[BUG] Test case test_volume_reattach_after_engine_sigkill failed #6751

roger-ryao opened this issue Sep 22, 2023 · 2 comments
Assignees
Labels
duplicated flaky-test kind/test Request for adding test reproduce/rare < 50% reproducible
Milestone

Comments

@roger-ryao
Copy link

roger-ryao commented Sep 22, 2023

Describe the bug (馃悰 if you encounter this issue)

In Longhorn master-head, test case test_volume_reattach_after_engine_sigkill failed because there was an error while wait_pod_for_remount_request, and it appears to have the same issue as the after crash_engine_process_with_sigkill.

After discussing with @ChanYiLin , he will examine the support bundle to ascertain whether the root cause of this failure is similar to that of #6699. If not, I will attempt to modify the test case to include log printing in the console output

vol_name, pod_name, md5sum = \
            common.prepare_statefulset_with_data_in_mb(
                client, core_api, statefulset, sts_name, storage_class)
    
        crash_engine_process_with_sigkill(client, core_api, vol_name)
    
>       wait_pod_for_remount_request(client, core_api, vol_name, pod_name, md5sum)

test_ha.py:1977: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_ha.py:617: in wait_pod_for_remount_request
    common.wait_for_pod_phase(core_api, pod_name, pod_phase="Pending")
common.py:4912: in wait_for_pod_phase
    pod = core_api.read_namespaced_pod(name=pod_name,
/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py:23483: in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501
/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py:23570: in read_namespaced_pod_with_http_info
    return self.api_client.call_api(
/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py:348: in call_api
    return self.__call_api(resource_path, method,
/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py:180: in __call_api
    response_data = self.request(
/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py:373: in request
    return self.rest_client.GET(url,
/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py:241: in GET
    return self.request("GET", url,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <kubernetes.client.rest.RESTClientObject object at 0x7fe2d40265b0>
method = 'GET'
url = '[https://10.43.0.1:443/api/v1/namespaces/default/pods/longhorn-teststs-1k448e-0'](https://10.43.0.1/api/v1/namespaces/default/pods/longhorn-teststs-1k448e-0')
query_params = []
headers = {'Accept': 'application/json', 'Content-Type': 'application/json', 'User-Agent': 'OpenAPI-Generator/25.3.0/python', 'a...d5kNQaIh6vN0Oe8e8k7Tc5Akm0PXqo_i-V7-Z24GmAN0owxHW3s6CLQFsGhc2UxoMJUs9CxlUGXfVdV3NngQZrjgHSbz4tRxgWm5PNf8hw6DhcIJeDEGA'}
body = None, post_params = {}, _preload_content = True, _request_timeout = None

To Reproduce

Execute test case test_ha.py::test_volume_reattach_after_engine_sigkill

Expected behavior

Have consistent test result in all branches result

Support bundle for troubleshooting

longhorn-tests-sles-amd64-664-bundle.zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.1+k3s1
    • Number of management node in the cluster:1
    • Number of worker node in the cluster:3
  • Node config
    • OS type and version: SLES 15-sp5
    • Kernel version:
    • CPU per node: 4 cores
    • Memory per node: 16GB
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:1
  • Impacted Longhorn resources:
    • Volume names:

Additional context

@roger-ryao roger-ryao added kind/bug reproduce/rare < 50% reproducible require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Sep 22, 2023
@roger-ryao roger-ryao added this to the v1.6.0 milestone Sep 22, 2023
@ChanYiLin ChanYiLin self-assigned this Sep 22, 2023
@ChanYiLin ChanYiLin added priority/0 Must be fixed in this release (managed by PO) area/resilience System or volume resilience investigation-needed Need to identify the case before estimating and starting the development labels Sep 22, 2023
@ChanYiLin
Copy link
Contributor

ChanYiLin commented Sep 26, 2023

https://longhorn.io/docs/1.5.1/references/settings/#automatically-delete-workload-pod-when-the-volume-is-detached-unexpectedly
Based on this setting when the volume is detached unexpectedly, we will help user to delete the workload pod
then the volume will self-recover and the pod will be recreated
volume should be remounted again

    def test_volume_reattach_after_engine_sigkill(client, core_api, storage_class, sts_name, statefulset):  # NOQA
        """
        [HA] Test if the volume can be reattached after using SIGKILL
        to crash the engine process
    
        1. Create StorageClass and StatefulSet.
        2. Write random data to the pod and get the md5sum.
        3. Crash the engine process by SIGKILL in the engine manager.
        4. Wait for volume to `faulted`, then `healthy`.
        5. Wait for K8s to terminate the pod and statefulset to bring pod to
           `Pending`, then `Running`.
        6. Check volume path exist in the pod.
        7. Check md5sum of the data in the pod.
        8. Check new data written to the volume is successful.
        """
        vol_name, pod_name, md5sum = \
            common.prepare_statefulset_with_data_in_mb(
                client, core_api, statefulset, sts_name, storage_class)
    
        crash_engine_process_with_sigkill(client, core_api, vol_name)
    
      > wait_pod_for_remount_request(client, core_api, vol_name, pod_name, md5sum)

so after kill the engine process, we are waiting the remount request to see if everything back to normal, then validate the data

But in wait_pod_for_remount_request , "waiting the remount request" is actually
get pod and wait for it's state="Pending"

test_ha.py:617: in wait_pod_for_remount_request
    common.wait_for_pod_phase(core_api, pod_name, pod_phase="Pending")
common.py:4912: in wait_for_pod_phase
    pod = core_api.read_namespaced_pod(name=pod_name,
/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py:23483: in read_namespaced_pod
    return self.read_namespaced_pod_with_http_info(name, namespace, **kwargs)  # noqa: E501

we try to get the pod by calling k8s api, and is not found

        if not 200 <= r.status <= 299:
>           raise ApiException(http_resp=r)
E           kubernetes.client.exceptions.ApiException: (404)
E           Reason: Not Found
E           HTTP response headers: HTTPHeaderDict({'Audit-Id': '4ebd7308-9e1d-4b65-9f7c-3fcaea91e0d2', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '6c1d5aad-1b9a-4888-873e-b689e110db86', 'X-Kubernetes-Pf-Prioritylevel-Uid': '28859ab2-5171-47ba-9d90-c7dcf9afc3ee', 'Date': 'Thu, 21 Sep 2023 17:17:41 GMT', 'Content-Length': '218'})

We actually retry a while to get the pod

def wait_for_pod_phase(core_api, pod_name, pod_phase, namespace="default"):
    is_phase = False
    for _ in range(RETRY_COUNTS):
        pod = core_api.read_namespaced_pod(name=pod_name,
                                           namespace=namespace)
        if pod.status.phase == pod_phase:
            is_phase = True
            break

        time.sleep(RETRY_INTERVAL_LONG)
    assert is_phase

That means the test case failed because it can't get the pod for a while after we delete the pod, the statefulset should recreate the pod

@innobead innobead assigned c3y1huang and unassigned ChanYiLin Dec 13, 2023
@c3y1huang c3y1huang added kind/test Request for adding test flaky-test and removed investigation-needed Need to identify the case before estimating and starting the development area/resilience System or volume resilience require/qa-review-coverage Require QA to review coverage priority/0 Must be fixed in this release (managed by PO) kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Jan 4, 2024
@c3y1huang
Copy link
Contributor

Fixed in #7491.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated flaky-test kind/test Request for adding test reproduce/rare < 50% reproducible
Projects
Status: Done
Development

No branches or pull requests

3 participants