New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gluster streaming IO test is flaky #49529
Comments
/assign @jeffvance |
[Slow] test suite was run. Glusterfs io stream test failed due to ns not being deleted, which was due to 1 pod remaining. The glusterfs-server pod was deleted. It appears that the glusterfs-client pod was not deleted even though Sometimes when the plugin io streaming tests pass you'll see My guess is that in this case the plugin's client pod did not terminate in 30s -- WHY NOT THE FULL 5m? -- but did terminate in time for the ns to be deleted. And sometimes the client pod cannot be deleted in 30s (again WHY NOT THE FULL 5m?) and is still present when the ns is deleted. Note: if we cannot detect that the client pod was not actually deleted and then we delete the server pod, there is no chance that the ns can be deleted. Once the server pod is gone the client pod will not delete. |
It looks like the client pod failed to terminate:
|
@msau42 30s may not always be enough time for the gluster client pod to terminate? |
@msau42 also, why is it so common to see ~29 seconds for the client pods (nfs, gluster) to supposedly terminate. The wait should be for up to 5m yet I never see >= 30s wait times... yet, we know that sometimes the gluster client pod is not terminated. Seems like maybe a bug in this area? |
Automatic merge from submit-queue (batch tested with PRs 49619, 49598, 47267, 49597, 49638) improve log for pod deletion poll loop **What this PR does / why we need it**: It improves some logging related to waiting for a pod to reach a passed-in condition. Specifically, related to issue [49529](#49529) where better logging may help to debug the root cause. **Release note**: ```release-note NONE ```
@humblec can you take a look? My findings so far are this test fails about 20% of the time. The gluster client pod's Status is set to Failed and therefore the pod won't be deleted. There is e2e code that waits for the pod to be deleted but escapes if the pod has "Failed" and if the passed-in reason matches the pod's Status.Reason. In the gluster pod test failures the pod's Status.Reason == "". This causes the test to incorrectly assume that the pod was deleted when it wasn't. This problem manifests itself when the namespace is deleted and that step fails. It seems like the Failed pod's Status.Reason should not be blank. |
It looks like gluster unmount failed:
|
@msau42 But in the log I looked at today (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/7688?log#log) I see the pod's Status=Running. Then the pod is deleted and the status remains Running for ~30s then changes to Failed. Even though the pod is Failed its Reason remains "". So 1) I wonder why the pod is marked as Running if its mount failed, and 2) once the pod is Failed why is Reason still blank? |
It was the unmount that failed, not the mount. |
@msau42 ah, ok thanks, I missed that. Still need to know why Pod.Status.Reason is blank... |
Automatic merge from submit-queue (batch tested with PRs 49651, 49707, 49662, 47019, 49747) improve detectability of deleted pods **What this PR does / why we need it**: Adds comment to `waitForPodTerminatedInNamespace` to better explain how it's implemented. ~~It improves pod deletion detection in the e2e framework as follows:~~ ~~1. the `waitForPodTerminatedInNamespace` func looks for pod.Status.Phase == _PodFailed_ or _PodSucceeded_ since both values imply that all containers have terminated.~~ ~~2. the `waitForPodTerminatedInNamespace` func also ignores the pod's Reason if the passed-in `reason` parm is "". Reason is not really relevant to the pod being deleted or not, but if the caller passes a non-blank `reason` then it will be lower-cased, de-blanked and compared to the pod's Reason (also lower-cased and de-blanked). The idea is to make Reason checking more flexible and to prevent a pod from being considered running when all of its containers have terminated just because of a Reason mis-match.~~ Releated to pr [49597](#49597) and issue [49529](#49529). **Release note**: ```release-note NONE ```
Below are some records from a kubelet.log related to the glusterfs-client pod, gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002) and the "e2e-tests-volume-io-6h140" ns:
|
I think the issue is that the test is deleting the gluster endpoints before deleting the pod. |
Nm, the test does seem to be cleaning up in the correct order: client pod, then server pod and endpoints. So the issue seems to be that during Pod termination, it goes to Failed state. Before, I thought it was because the unmount failed, but based on the timestamps of the logs, it seems like it fails before we start unmounting. |
Just spoke to @dashpole and he said that it's normal for a Pod to go to Failed state during the termination sequence. In between Failed state and Pod object deletion is when we unmount the volumes. So I think the issue here is that the WaitForPodTerminated function needs to wait only for the Pod object to be deleted, and not return early if Pod state is Failed. |
@jeffvance do you have time to look at this? The test is still flaking |
@msau42 yes I can take a look. Do you know if there has been a pr started to address this issue? BTW, I will be gone most of tomorrow but will start tomorrow night. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. wait for pod to be fully deleted **What this PR does / why we need it**: Fix flaky glusterfs io-streaming tests. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #49529 **Special notes for your reviewer**: 1) max potential wait for complete pod deletion is ~~15m~~ 5m. 2) ~~removed [Flaky] from HostCleanup, _e2e/node/kubelet.go_ since pod deletion is reliable now.~~ 3) ~~added tag [Slow] to HostCleanup due to long max wait for pod deletion.~~ After all CI tests run reliably we can consider removing the [Flaky] tag (2, above), or do that in a separate pr. ```release-note NONE ``` cc @msau42
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
See testgrid
One failure
/sig storage
The text was updated successfully, but these errors were encountered: