Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

Closed
jberkus opened this issue Nov 23, 2018 · 31 comments · Fixed by #71428
Closed

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

jberkus opened this issue Nov 23, 2018 · 31 comments · Fixed by #71428
Labels
kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@jberkus
Copy link

jberkus commented Nov 23, 2018

Which jobs are failing: gce-new-master-upgrade-cluster-new-parallel

Which test(s) are failing:

Varies, but all subpath failures, including:

  • [sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath with backstepping is outside the volume [Slow]
  • [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath directory is outside the volume [Slow]
  • [sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath file is outside the volume [Slow]
  • [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath file is outside the volume [Slow]

... and pretty much every other subpath test, but never all of them at once.

There's also a few other storage tests failing, such as:

  • [sig-storage] Volume expand [Slow] Verify if editing PVC allows resize
  • [sig-storage] Detaching volumes should not work when mount is in progress

Since when has it been failing: 11/22

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel&include-filter-by-regex=.*CSI.*&width=20

Reason for failure:

These flakes started around the time that #71314 merged, but doesn't match up with the exact merge stamp, so it's probably coincidental.

The subpath test failures seem to be mostly timeouts:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:254
while waiting for failed event to occur
Expected error:
    <*errors.errorString | 0xc0000d1860>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:601

... so possibly this is just GCE fail.

Anything else we need to know:

This test job has always been flaky, with around a 40% failure rate.

/kind flake
/sig storage
/priority important-soon

cc
@saad-ali @AishSundar @liggitt

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 23, 2018
@saad-ali
Copy link
Member

@msau42 @mkimuram can you please help investigate this.

It looks to me like the subPath should...outside the volume tests are flaking for csi-hostpath-v0 and csi-hostpath. I wonder if the v0 tests are racing with the non-v0 tests for something in this test?

@saad-ali
Copy link
Member

saad-ali commented Nov 26, 2018

The other subPath should...outside the volume tests including for the other CSI driver pd.csi.storage.gke.io and non-CSI in-tree plugins are solid green which also makes me think this is a issue with the hostpath tests or driver.

@liggitt liggitt added this to the v1.13 milestone Nov 26, 2018
@liggitt
Copy link
Member

liggitt commented Nov 26, 2018

adding to the milestone for triage since the flakes look new

@jberkus jberkus changed the title Flaking Test: subpath failures in Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs Nov 26, 2018
@mkimuram
Copy link
Contributor

@saad-ali @msau42

These all 4 subpath tests call testPodFailSubpathError and it fails in finding a specific failed event by using WaitTimeoutForPodEvent. As WaitTimeoutForPodEvent uses eventOccured to check if the specific error event happens, and eventOccured only checks first event, I guess that tests may flake if other failed event happens before the expected one.

Also, WaitTimeoutForPodEvent already wait for the event, we might be able to delete WaitForPodRunningInNamespace in testPodFailSubpathError.

I will create a PR to fix above.

@AishSundar
Copy link
Contributor

/priority critical-urgent
/remove-priority important-soon
/milestone v1.13

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 26, 2018
@msau42
Copy link
Member

msau42 commented Nov 26, 2018

@mkimuram while we should update the test with your PR, I'm not sure if it's going to fix the failure. Looking at one failing test log, these are the events I see on the failing pod:

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400

Nov 26 15:34:46.034: INFO: At 2018-11-26 15:27:34 +0000 UTC - event for pod-subpath-test-csi-hostpath-dynamicpv-mgnk: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" 
Nov 26 15:34:46.034: INFO: At 2018-11-26 15:27:50 +0000 UTC - event for pod-subpath-test-csi-hostpath-dynamicpv-mgnk: {kubelet bootstrap-e2e-minion-group-lztn} FailedMount: MountVolume.NewMounter initialization failed for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" : driver name csi-hostpath-e2e-tests-csi-volumes-92vsl not found in the list of registered CSI drivers

So it seems to be failing before even reaching the subpath logic.

@msau42
Copy link
Member

msau42 commented Nov 26, 2018

Timeline of events:

Test pod started getting processed by kubelet:

I1126 15:27:34.537403    1446 config.go:414] Receiving a new pod "pod-subpath-test-csi-hostpath-dynamicpv-mgnk_e2e-tests-csi-volumes-92vsl(cc5409be-f18f-11e8-9a1b-42010a800002)"

It took a minute after the test pod was created for the csi driver to finish registering and mount to succeed, and then subpath event to trigger:

I1126 15:28:32.392198    1446 csi_plugin.go:116] kubernetes.io/csi: Trying to validate a new CSI Driver with name: csi-hostpath-e2e-tests-csi-volumes-92vsl endpoint: /var/lib/kubelet/plugins/csi-hostpath-e2e-tests-csi-volumes-92vsl/csi.sock versions: 1.0.0, foundInDeprecatedDir: false
I1126 15:28:32.392254    1446 csi_plugin.go:137] kubernetes.io/csi: Register new plugin with name: csi-hostpath-e2e-tests-csi-volumes-92vsl at endpoint: /var/lib/kubelet/plugins/csi-hostpath-e2e-tests-csi-volumes-92vsl/csi.sock

I1126 15:28:32.433217    1446 operation_generator.go:567] MountVolume.SetUp succeeded for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" (UniqueName: "kubernetes.io/csi/csi-hostpath-e2e-tests-csi-volumes-92vsl^cbfcc214-f18f-11e8-9505-42010a800005") pod "pod-subpath-test-csi-hostpath-dynamicpv-mgnk" (UID: "cc5409be-f18f-11e8-9a1b-42010a800002")
I1126 15:28:32.465509    1446 volume_manager.go:383] All volumes are attached and mounted for pod "pod-subpath-test-csi-hostpath-dynamicpv-mgnk_e2e-tests-csi-volumes-92vsl(cc5409be-f18f-11e8-9a1b-42010a800002)"
I1126 15:28:34.082064    1446 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-92vsl", Name:"pod-subpath-test-csi-hostpath-dynamicpv-mgnk", UID:"cc5409be-f18f-11e8-9a1b-42010a800002", APIVersion:"v1", ResourceVersion:"47363", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-dynamicpv-mgnk}"}): type: 'Warning' reason: 'Failed' Error: failed to prepare subPath for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-dynamicpv-mgnk"

So #71428 should address the issue. I was initially confused because the test logs didn't show the subpath event for some reason

@msau42
Copy link
Member

msau42 commented Nov 26, 2018

I opened up #71433 to investigate if 1 minute for csi plugin registration is expected

@jberkus
Copy link
Author

jberkus commented Nov 26, 2018

I've added a tracker issue for the complete list of storage flakes: #71434

This is so that we can track whether we're addressing most of the flakes with the subpath fixes, or if additional fixes are required.

@liggitt
Copy link
Member

liggitt commented Nov 27, 2018

keeping open until flakes are verified cleaned up

@liggitt liggitt reopened this Nov 27, 2018
@AishSundar
Copy link
Contributor

/reopen

Reopening this issue until we can verify the fix in CI

@AishSundar
Copy link
Contributor

As of this morning we are seeing CSI "subpath should fail" flakes in the following job

@msau42 @saad-ali can you help us determine if these are cases are flaky tests that you are fixing or is there an underlying CSI bug that needs addressing?

@saad-ali
Copy link
Member

@AishSundar root cause of this is a delay in an external sidecar containers and the tests not waiting long enough to accommodate the delay. There are no changes required in kubernetes/kubernetes (outside of test) to handle this. These should not be 1.13 blocking. We'll continue to look in to getting them non-flaky today. In the meantime, we can mark these tests as flaky to reduce the noise, if you'd like.

@msau42 @mkimuram looks like the same issue (#71433) is still being hit.

See link:

Nov 27 15:15:14.290: INFO: At 2018-11-27 15:13:07 +0000 UTC - event for pod-subpath-test-csi-hostpath-dynamicpv-ps95: {kubelet bootstrap-e2e-minion-group-zstx} FailedMount: MountVolume.NewMounter initialization failed for volume "pvc-e58454c7-f256-11e8-98b9-42010a840002" : driver name csi-hostpath-e2e-tests-csi-volumes-25s6v not found in the list of registered CSI drivers

And logs from driver-registrar (link):

==== start of log for container csi-hostpathplugin-d87xg/driver-registrar ====
I1127 15:12:55.163691       1 main.go:138] Version: v1.0.1-0-g78b5c12
I1127 15:12:55.163794       1 main.go:145] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1127 15:12:55.163812       1 connection.go:68] Connecting to /csi/csi.sock
I1127 15:12:55.165359       1 connection.go:95] Still trying, connection is CONNECTING
I1127 15:12:55.165709       1 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I1127 15:12:56.166315       1 connection.go:95] Still trying, connection is CONNECTING
I1127 15:12:56.166544       1 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I1127 15:12:57.213386       1 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I1127 15:12:58.389699       1 connection.go:95] Still trying, connection is CONNECTING
I1127 15:13:55.165212       1 connection.go:88] Connection timed out
I1127 15:13:55.165274       1 main.go:153] Calling CSI driver to discover driver name.

We need to make the tests handle this condition more gracefully. The suggestion here #71433 (comment) is one possible solution.

@saad-ali
Copy link
Member

Would be good to capture this as a known issue in release notes.

@AishSundar
Copy link
Contributor

@marpaia to followup on capturing this in Release notes

@saad-ali
Copy link
Member

saad-ali commented Nov 27, 2018

@mkimuram, per #71474 (review), can you create a targeted fix increasing the timeout?

Looks like marking these tests as flaky will be difficult for 1.13, but if that doesn't fix it we will need to do that.

@msau42
Copy link
Member

msau42 commented Nov 28, 2018

Short term fix to bump the test timeout: #71483

@AishSundar
Copy link
Contributor

As per discussion in release burndown meeting, this is not a 1.13 blocker.

@jberkus please move this issue to 1.14 if there are no pending issues you are concerned about

@msau42
Copy link
Member

msau42 commented Nov 28, 2018

Ok looks like the test still flaked. Looking at https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/22043, the mount succeeded after a minute and proceeded to subpath processing, which (correctly) failed. However, for some resaon, the test is not seeing the subpath event and still failed after 4 minutes.

On kubelet, subpath processing failed after a minute and generated the event:

I1128 10:48:43.480646    1581 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-2kcwz", Name:"pod-subpath-test-csi-hostpath-v0-dynamicpv-g7c4", UID:"09cd59c8-f2fb-11e8-83d2-42010a840002", APIVersion:"v1", ResourceVersion:"9277", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-v0-dynamicpv-g7c4}"}): type: 'Warning' reason: 'Failed' Error: failed to create subPath directory for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-v0-dynamicpv-g7c4"

But the test doesn't see it for some reason, and still times out after 4 more minutes:

Nov 28 10:47:44.762: INFO: PersistentVolumeClaim pvc-r8nl9 found and phase=Bound (6.420787981s)
[It] should fail if non-existent subpath is outside the volume [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:262
STEP: Creating pod pod-subpath-test-csi-hostpath-v0-dynamicpv-g7c4
STEP: Checking for subpath error event
Nov 28 10:52:45.399: INFO: Deleting pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-g7c4" in namespace "e2e-tests-csi-volumes-2kcwz"

@saad-ali
Copy link
Member

I saw the same on https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/422

CSI volume mount succeeded at:

I1128 12:22:26.251674    1451 operation_generator.go:567] MountVolume.SetUp succeeded for volume "pvc-1e53cb5f-f308-11e8-b109-42010a800002" (UniqueName: "kubernetes.io/csi/csi-hostpath-v0-e2e-tests-csi-volumes-ghd78^2079fc5e-f308-11e8-89b9-426ec44f83b8") pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl" (UID: "20c6ef48-f308-11e8-b109-42010a800002")

Kubelet correctly failed the pod because of subpath

I1128 12:22:45.494961    1451 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-ghd78", Name:"pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl", UID:"20c6ef48-f308-11e8-b109-42010a800002", APIVersion:"v1", ResourceVersion:"38952", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-v0-dynamicpv-s8xl}"}): type: 'Warning' reason: 'Failed' Error: failed to prepare subPath for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-v0-dynamicpv-s8xl"

But test still failed:

Nov 28 12:26:27.070: INFO: Deleting pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl" in namespace "e2e-tests-csi-volumes-ghd78"

Because the test failed to find the pod event.

This looks like a test issue. Not a 1.13 blocker.

@saad-ali
Copy link
Member

@mkimuram can you PTAL at this

@msau42
Copy link
Member

msau42 commented Nov 28, 2018

Is there a way to check if events are overloaded and not able to handle new events? The MountVolume retries while the driver is not registered is very frequent and could have overloaded the event server.

@saad-ali
Copy link
Member

Checking for the existence of an event seems like the wrong way to test this. Would checking pod status alone be insufficient? Ultimately that is what we really care about.

@msau42
Copy link
Member

msau42 commented Nov 28, 2018

Checking for the specific event guarantees that we are failing with the correct error. But perhaps events are not reliable enough to depend on. Checking if the Pod never came up running would probably reduce the flakes but could potentially mask false negatives. cc @justinsb for any thoughts.

@msau42
Copy link
Member

msau42 commented Nov 29, 2018

Tests are still flaking and actually it seems like in general storage flakes in slow are flaking more since 11-21 or so: https://k8s-testgrid.appspot.com/sig-storage#gce-slow

My theory is that the driver registration delay is causing a flood of events, which causes tests that check for events to fail. I think this is revealing multiple issues in both the tests and product, and we should work on fixing them all, although any one of the fixes should mitigate the test flakes:

@msau42
Copy link
Member

msau42 commented Dec 18, 2018

#71570 and #71569 have been merged and will address the biggest issues. #71570 has been backported to 1.13 already, #71569 backport is still pending.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
9 participants