Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gluster streaming IO test is flaky #49529

Closed
msau42 opened this issue Jul 25, 2017 · 16 comments · Fixed by #53537
Closed

Gluster streaming IO test is flaky #49529

msau42 opened this issue Jul 25, 2017 · 16 comments · Fixed by #53537
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@msau42
Copy link
Member

msau42 commented Jul 25, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
See testgrid
One failure

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:131
Jul 24 19:00:00.524: Couldn't delete ns: "e2e-tests-volume-io-h415g": namespace e2e-tests-volume-io-h415g was not deleted with limit: timed out waiting for the condition, pods remaining: 1 (&errors.errorString{s:"namespace e2e-tests-volume-io-h415g was not deleted with limit: timed out waiting for the condition, pods remaining: 1"})
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:283

/sig storage

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jul 25, 2017
@msau42
Copy link
Member Author

msau42 commented Jul 25, 2017

/assign @jeffvance

@jeffvance
Copy link
Contributor

jeffvance commented Jul 25, 2017

[Slow] test suite was run. Glusterfs io stream test failed due to ns not being deleted, which was due to 1 pod remaining. The glusterfs-server pod was deleted. It appears that the glusterfs-client pod was not deleted even though framework.DeletePodWithWait returned no error.
In cases where this test succeeds (which is most of the time) DeletePodWithWait returns apierrs.IsNotFound which is seen in the log as INFO: Ignore "not found" error above. Pod "gluster-io-client" successfully deleted.

Sometimes when the plugin io streaming tests pass you'll see INFO: Ignore "not found" error above. Pod "<pod-name-here>" successfully deleted logged for the client pod. However, sometimes a client pod is deleted but the above log entry is not triggered. Eg in this log you'll see the nfs-client pod is deleted (the ns was able to be deleted) yet the above log msg stating to ignore the "not found" error was not generated. (scroll down to the line "I0724 20:27:20.364] Jul 24 20:27:00.559: INFO: Waiting for pod nfs-io-client in namespace 'e2e-tests-volume-io-pmdks' status to be 'terminated due to deadline exceeded'(found phase: "Running", readiness: true) (29.883170178s elapsed) I0724 20:27:20.365] [AfterEach] [sig-storage] NFS").

My guess is that in this case the plugin's client pod did not terminate in 30s -- WHY NOT THE FULL 5m? -- but did terminate in time for the ns to be deleted. And sometimes the client pod cannot be deleted in 30s (again WHY NOT THE FULL 5m?) and is still present when the ns is deleted.

Note: if we cannot detect that the client pod was not actually deleted and then we delete the server pod, there is no chance that the ns can be deleted. Once the server pod is gone the client pod will not delete.

@msau42
Copy link
Member Author

msau42 commented Jul 25, 2017

It looks like the client pod failed to terminate:

I0724 19:00:00.536] Jul 24 19:00:00.524: INFO: POD                NODE                             PHASE   GRACE  CONDITIONS
I0724 19:00:00.536] Jul 24 19:00:00.524: INFO: gluster-io-client  bootstrap-e2e-minion-group-7qjd  Failed  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-07-24 18:48:42 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-07-24 18:49:40 +0000 UTC ContainersNotReady containers with unready status: [gluster-io-client]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-07-24 18:48:39 +0000 UTC  }]

@jeffvance
Copy link
Contributor

@msau42 30s may not always be enough time for the gluster client pod to terminate?

@jeffvance
Copy link
Contributor

jeffvance commented Jul 25, 2017

@msau42 also, why is it so common to see ~29 seconds for the client pods (nfs, gluster) to supposedly terminate. The wait should be for up to 5m yet I never see >= 30s wait times... yet, we know that sometimes the gluster client pod is not terminated. Seems like maybe a bug in this area?

k8s-github-robot pushed a commit that referenced this issue Jul 28, 2017
Automatic merge from submit-queue (batch tested with PRs 49619, 49598, 47267, 49597, 49638)

improve log for pod deletion poll loop

**What this PR does / why we need it**:
It improves some logging related to waiting for a pod to reach a passed-in condition. Specifically, related to issue [49529](#49529) where better logging may help to debug the root cause.

**Release note**:
```release-note
NONE
```
@jeffvance
Copy link
Contributor

@humblec can you take a look? My findings so far are this test fails about 20% of the time. The gluster client pod's Status is set to Failed and therefore the pod won't be deleted. There is e2e code that waits for the pod to be deleted but escapes if the pod has "Failed" and if the passed-in reason matches the pod's Status.Reason. In the gluster pod test failures the pod's Status.Reason == "". This causes the test to incorrectly assume that the pod was deleted when it wasn't. This problem manifests itself when the namespace is deleted and that step fails. It seems like the Failed pod's Status.Reason should not be blank.

cc @msau42 @erinboyd

@msau42
Copy link
Member Author

msau42 commented Jul 31, 2017

It looks like gluster unmount failed:

kubelet.log

+0000 UTC (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "gluster-io-volume" (UniqueName: "kubernetes.io/glusterfs/b501a29c-70a0-11e7-846a-42010a840002-gluster-io-volume") pod "b501a29c-70a0-11e7-846a-42010a840002" (UID: "b501a29c-70a0-11e7-846a-42010a840002") : Error checking if path exists: stat /var/lib/kubelet/pods/b501a29c-70a0-11e7-846a-42010a840002/volumes/kubernetes.io~glusterfs/gluster-io-volume: transport endpoint is not connected

@jeffvance
Copy link
Contributor

@msau42 But in the log I looked at today (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/7688?log#log) I see the pod's Status=Running. Then the pod is deleted and the status remains Running for ~30s then changes to Failed. Even though the pod is Failed its Reason remains "". So 1) I wonder why the pod is marked as Running if its mount failed, and 2) once the pod is Failed why is Reason still blank?

@msau42
Copy link
Member Author

msau42 commented Jul 31, 2017

It was the unmount that failed, not the mount.

@jeffvance
Copy link
Contributor

@msau42 ah, ok thanks, I missed that. Still need to know why Pod.Status.Reason is blank...

k8s-github-robot pushed a commit that referenced this issue Jul 31, 2017
Automatic merge from submit-queue (batch tested with PRs 49651, 49707, 49662, 47019, 49747)

improve detectability of deleted pods

**What this PR does / why we need it**:
Adds comment to `waitForPodTerminatedInNamespace` to better explain how it's implemented.
~~It improves pod deletion detection in the e2e framework as follows:~~
~~1.  the `waitForPodTerminatedInNamespace` func looks for pod.Status.Phase ==  _PodFailed_ or _PodSucceeded_ since both values imply that all containers have terminated.~~
~~2.  the `waitForPodTerminatedInNamespace` func also ignores the pod's Reason if the passed-in `reason` parm is "". Reason is not really relevant to the pod being deleted or not, but if the caller passes a non-blank `reason` then it will be lower-cased, de-blanked and compared to the pod's Reason (also lower-cased and de-blanked). The idea is to make Reason checking more flexible and to prevent a pod from being considered running when all of its containers have terminated just because of a Reason mis-match.~~

Releated to pr [49597](#49597) and issue [49529](#49529).

**Release note**:
```release-note
NONE
```
@jeffvance
Copy link
Contributor

jeffvance commented Aug 2, 2017

Below are some records from a kubelet.log related to the glusterfs-client pod, gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002) and the "e2e-tests-volume-io-6h140" ns:

...
I0802 14:28:09.138246    1463 summary.go:389] Missing default interface "eth0" for pod:e2e-tests-volume-io-6h140_gluster-server
...
I0802 14:28:29.286398    1463 volume_manager.go:369] All volumes are attached and mounted for pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)"
...
I0802 14:28:29.308652    1463 summary.go:389] Missing default interface "eth0" for pod:e2e-tests-volume-io-6h140_gluster-io-client
...
I0802 14:28:30.015964    1463 volume_manager.go:340] Waiting for volumes to attach and mount for pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)"
...
I0802 14:28:31.363343    1463 volume_manager.go:369] All volumes are attached and mounted for pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)"
...
I0802 14:28:57.610511    1463 kubelet.go:1776] SyncLoop (DELETE, "api"): "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)"
...
I0802 14:28:57.621069    1463 kubelet_pods.go:749] Pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)" is terminated, but some containers are still running
...
I0802 14:29:27.536230    1463 kubelet_pods.go:749] Pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)" is terminated, but some containers are still running
...
I0802 14:29:27.647827    1463 plugins.go:405] Calling network plugin kubenet to tear down pod "gluster-io-client_e2e-tests-volume-io-6h140"
...
I0802 14:29:28.690100    1463 status_manager.go:453] Status for pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)" updated successfully: (6, {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-08-02 14:28:28 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-08-02 14:29:28 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [gluster-io-client]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-08-02 14:28:23 +0000 UTC Reason: Message:}] Message: Reason: HostIP:10.132.0.5 PodIP: StartTime:2017-08-02 14:28:23 +0000 UTC InitContainerStatuses:[{Name:gluster-io-init State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2017-08-02 14:28:26 +0000 UTC,FinishedAt:2017-08-02 14:28:28 +0000 UTC,ContainerID:docker://cbad4989561a21140eb52b87c058ad120fe66bc2ffded5601422481914ca852a,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:gcr.io/google_containers/busybox:1.24 ImageID:docker://sha256:0cb40641836c461bc97c793971d84d758371ed682042457523e4ae701efe7ec9 ContainerID:docker://cbad4989561a21140eb52b87c058ad120fe66bc2ffded5601422481914ca852a}] ContainerStatuses:[{Name:gluster-io-client State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2017-08-02 14:28:29 +0000 UTC,FinishedAt:2017-08-02 14:29:27 +0000 UTC,ContainerID:docker://3152bfee028fd450e14b28d4b9df396982336c602b25f2353b82b0ce4671ff50,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:gcr.io/google_containers/busybox:1.24 ImageID:docker://sha256:0cb40641836c461bc97c793971d84d758371ed682042457523e4ae701efe7ec9 ContainerID:docker://3152bfee028fd450e14b28d4b9df396982336c602b25f2353b82b0ce4671ff50}]
(note ^^ Phase=Failed but Reason="") ...
I0802 14:29:28.695779    1463 kubelet_pods.go:754] Pod "gluster-io-client_e2e-tests-volume-io-6h140(d738c84c-778e-11e7-afad-42010a840002)" is terminated, but some volumes have not been cleaned up
...
I0802 14:29:29.699108    1463 kubenet_linux.go:528] TearDownPod took 31.760896ms for e2e-tests-volume-io-6h140/gluster-io-client
...
I0802 14:29:29.803660    1463 manager.go:970] Destroyed container: "/kubepods/besteffort/podcaf8c26d-778e-11e7-afad-42010a840002/4c62ca47c6f38e332cbbd4051352c75c1b36b7c9cce19f3aa6974ca6a2c08e41" (aliases: [k8s_gluster-server_gluster-server_e2e-tests-volume-io-6h140_caf8c26d-778e-11e7-afad-42010a840002_0 4c62ca47c6f38e332cbbd4051352c75c1b36b7c9cce19f3aa6974ca6a2c08e41], namespace: "docker")
(Note ^^ now we're deleting the glusterfs-server pod which indicates the test thinks that the client pod was deleted but it wasn't)
...
I0802 14:29:29.914691    1463 reconciler.go:186] operationExecutor.UnmountVolume started for volume "gluster-io-volume" (UniqueName: "kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume") pod "d738c84c-778e-11e7-afad-42010a840002" (UID: "d738c84c-778e-11e7-afad-42010a840002")
...
E0802 14:29:29.917485    1463 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume\" (\"d738c84c-778e-11e7-afad-42010a840002\")" failed. No retries permitted until 2017-08-02 14:29:30.417374889 +0000 UTC (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "gluster-io-volume" (UniqueName: "kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume") pod "d738c84c-778e-11e7-afad-42010a840002" (UID: "d738c84c-778e-11e7-afad-42010a840002") : Error checking if path exists: stat /var/lib/kubelet/pods/d738c84c-778e-11e7-afad-42010a840002/volumes/kubernetes.io~glusterfs/gluster-io-volume: transport endpoint is not connected
...
E0802 14:29:30.437424    1463 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume\" (\"d738c84c-778e-11e7-afad-42010a840002\")" failed. No retries permitted until 2017-08-02 14:29:31.437384221 +0000 UTC (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "gluster-io-volume" (UniqueName: "kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume") pod "d738c84c-778e-11e7-afad-42010a840002" (UID: "d738c84c-778e-11e7-afad-42010a840002") : Error checking if path exists: stat /var/lib/kubelet/pods/d738c84c-778e-11e7-afad-42010a840002/volumes/kubernetes.io~glusterfs/gluster-io-volume: transport endpoint is not connected
...
E0802 14:29:37.489768    1463 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume\" (\"d738c84c-778e-11e7-afad-42010a840002\")" failed. No retries permitted until 2017-08-02 14:29:45.489720789 +0000 UTC (durationBeforeRetry 8s). Error: UnmountVolume.TearDown failed for volume "gluster-io-volume" (UniqueName: "kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume") pod "d738c84c-778e-11e7-afad-42010a840002" (UID: "d738c84c-778e-11e7-afad-42010a840002") : Error checking if path exists: stat /var/lib/kubelet/pods/d738c84c-778e-11e7-afad-42010a840002/volumes/kubernetes.io~glusterfs/gluster-io-volume: transport endpoint is not connected
...
E0802 14:30:12.671242    1463 glusterfs.go:150] glusterfs: failed to get endpoints gluster-io-volume[an empty namespace may not be set when a resource name is provided]
...
E0802 14:31:37.674017    1463 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume\" (\"d738c84c-778e-11e7-afad-42010a840002\")" failed. No retries permitted until 2017-08-02 14:33:39.673981891 +0000 UTC (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "gluster-io-volume" (UniqueName: "kubernetes.io/glusterfs/d738c84c-778e-11e7-afad-42010a840002-gluster-io-volume") pod "d738c84c-778e-11e7-afad-42010a840002" (UID: "d738c84c-778e-11e7-afad-42010a840002") : Error checking if path exists: stat /var/lib/kubelet/pods/d738c84c-778e-11e7-afad-42010a840002/volumes/kubernetes.io~glusterfs/gluster-io-volume: transport endpoint is not connected
...

@msau42
Copy link
Member Author

msau42 commented Sep 6, 2017

I think the issue is that the test is deleting the gluster endpoints before deleting the pod.

@msau42
Copy link
Member Author

msau42 commented Sep 6, 2017

Nm, the test does seem to be cleaning up in the correct order: client pod, then server pod and endpoints.

So the issue seems to be that during Pod termination, it goes to Failed state. Before, I thought it was because the unmount failed, but based on the timestamps of the logs, it seems like it fails before we start unmounting.

@msau42
Copy link
Member Author

msau42 commented Sep 8, 2017

Just spoke to @dashpole and he said that it's normal for a Pod to go to Failed state during the termination sequence. In between Failed state and Pod object deletion is when we unmount the volumes. So I think the issue here is that the WaitForPodTerminated function needs to wait only for the Pod object to be deleted, and not return early if Pod state is Failed.

@msau42
Copy link
Member Author

msau42 commented Oct 4, 2017

@jeffvance do you have time to look at this? The test is still flaking

@jeffvance
Copy link
Contributor

@msau42 yes I can take a look. Do you know if there has been a pr started to address this issue? BTW, I will be gone most of tomorrow but will start tomorrow night.

k8s-github-robot pushed a commit that referenced this issue Oct 10, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

wait for pod to be fully deleted

**What this PR does / why we need it**:
Fix flaky glusterfs io-streaming tests.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #49529 

**Special notes for your reviewer**:
1) max potential wait for complete pod deletion is ~~15m~~ 5m.
2) ~~removed [Flaky] from HostCleanup, _e2e/node/kubelet.go_ since pod deletion is reliable now.~~
3) ~~added tag [Slow] to HostCleanup due to long max wait for pod deletion.~~

After all CI tests run reliably we can consider removing the [Flaky] tag (2, above), or do that in a separate pr.

```release-note
NONE
```
cc @msau42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
3 participants