Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

jberkus · 2018-11-23T20:10:15Z

Which jobs are failing: gce-new-master-upgrade-cluster-new-parallel

Which test(s) are failing:

Varies, but all subpath failures, including:

[sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath with backstepping is outside the volume [Slow]
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath directory is outside the volume [Slow]
[sig-storage] CSI Volumes [Driver: csi-hostpath-v0] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath file is outside the volume [Slow]
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath file is outside the volume [Slow]

... and pretty much every other subpath test, but never all of them at once.

There's also a few other storage tests failing, such as:

[sig-storage] Volume expand [Slow] Verify if editing PVC allows resize
[sig-storage] Detaching volumes should not work when mount is in progress

Since when has it been failing: 11/22

Testgrid link: https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel&include-filter-by-regex=.*CSI.*&width=20

Reason for failure:

These flakes started around the time that #71314 merged, but doesn't match up with the exact merge stamp, so it's probably coincidental.

The subpath test failures seem to be mostly timeouts:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:254
while waiting for failed event to occur
Expected error:
    <*errors.errorString | 0xc0000d1860>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:601

... so possibly this is just GCE fail.

Anything else we need to know:

This test job has always been flaky, with around a 40% failure rate.

/kind flake
/sig storage
/priority important-soon

cc
@saad-ali @AishSundar @liggitt

The text was updated successfully, but these errors were encountered:

saad-ali · 2018-11-26T04:59:27Z

@msau42 @mkimuram can you please help investigate this.

It looks to me like the subPath should...outside the volume tests are flaking for csi-hostpath-v0 and csi-hostpath. I wonder if the v0 tests are racing with the non-v0 tests for something in this test?

saad-ali · 2018-11-26T05:01:18Z

The other subPath should...outside the volume tests including for the other CSI driver pd.csi.storage.gke.io and non-CSI in-tree plugins are solid green which also makes me think this is a issue with the hostpath tests or driver.

mariantalla · 2018-11-26T11:34:08Z

(just adding additional similar flakes here for tracking)

Similar flakes in gce-cos-1.13-slow:

liggitt · 2018-11-26T15:36:15Z

adding to the milestone for triage since the flakes look new

mkimuram · 2018-11-26T17:03:25Z

@saad-ali @msau42

These all 4 subpath tests call testPodFailSubpathError and it fails in finding a specific failed event by using WaitTimeoutForPodEvent. As WaitTimeoutForPodEvent uses eventOccured to check if the specific error event happens, and eventOccured only checks first event, I guess that tests may flake if other failed event happens before the expected one.

Also, WaitTimeoutForPodEvent already wait for the event, we might be able to delete WaitForPodRunningInNamespace in testPodFailSubpathError.

I will create a PR to fix above.

AishSundar · 2018-11-26T18:33:07Z

/priority critical-urgent
/remove-priority important-soon
/milestone v1.13

msau42 · 2018-11-26T19:09:45Z

@mkimuram while we should update the test with your PR, I'm not sure if it's going to fix the failure. Looking at one failing test log, these are the events I see on the failing pod:

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400

Nov 26 15:34:46.034: INFO: At 2018-11-26 15:27:34 +0000 UTC - event for pod-subpath-test-csi-hostpath-dynamicpv-mgnk: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" 
Nov 26 15:34:46.034: INFO: At 2018-11-26 15:27:50 +0000 UTC - event for pod-subpath-test-csi-hostpath-dynamicpv-mgnk: {kubelet bootstrap-e2e-minion-group-lztn} FailedMount: MountVolume.NewMounter initialization failed for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" : driver name csi-hostpath-e2e-tests-csi-volumes-92vsl not found in the list of registered CSI drivers

So it seems to be failing before even reaching the subpath logic.

msau42 · 2018-11-26T21:00:22Z

Timeline of events:

Test pod started getting processed by kubelet:

I1126 15:27:34.537403    1446 config.go:414] Receiving a new pod "pod-subpath-test-csi-hostpath-dynamicpv-mgnk_e2e-tests-csi-volumes-92vsl(cc5409be-f18f-11e8-9a1b-42010a800002)"

It took a minute after the test pod was created for the csi driver to finish registering and mount to succeed, and then subpath event to trigger:

I1126 15:28:32.392198    1446 csi_plugin.go:116] kubernetes.io/csi: Trying to validate a new CSI Driver with name: csi-hostpath-e2e-tests-csi-volumes-92vsl endpoint: /var/lib/kubelet/plugins/csi-hostpath-e2e-tests-csi-volumes-92vsl/csi.sock versions: 1.0.0, foundInDeprecatedDir: false
I1126 15:28:32.392254    1446 csi_plugin.go:137] kubernetes.io/csi: Register new plugin with name: csi-hostpath-e2e-tests-csi-volumes-92vsl at endpoint: /var/lib/kubelet/plugins/csi-hostpath-e2e-tests-csi-volumes-92vsl/csi.sock

I1126 15:28:32.433217    1446 operation_generator.go:567] MountVolume.SetUp succeeded for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" (UniqueName: "kubernetes.io/csi/csi-hostpath-e2e-tests-csi-volumes-92vsl^cbfcc214-f18f-11e8-9505-42010a800005") pod "pod-subpath-test-csi-hostpath-dynamicpv-mgnk" (UID: "cc5409be-f18f-11e8-9a1b-42010a800002")
I1126 15:28:32.465509    1446 volume_manager.go:383] All volumes are attached and mounted for pod "pod-subpath-test-csi-hostpath-dynamicpv-mgnk_e2e-tests-csi-volumes-92vsl(cc5409be-f18f-11e8-9a1b-42010a800002)"
I1126 15:28:34.082064    1446 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-92vsl", Name:"pod-subpath-test-csi-hostpath-dynamicpv-mgnk", UID:"cc5409be-f18f-11e8-9a1b-42010a800002", APIVersion:"v1", ResourceVersion:"47363", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-dynamicpv-mgnk}"}): type: 'Warning' reason: 'Failed' Error: failed to prepare subPath for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-dynamicpv-mgnk"

So #71428 should address the issue. I was initially confused because the test logs didn't show the subpath event for some reason

msau42 · 2018-11-26T21:13:11Z

I opened up #71433 to investigate if 1 minute for csi plugin registration is expected

jberkus · 2018-11-26T22:32:13Z

I've added a tracker issue for the complete list of storage flakes: #71434

This is so that we can track whether we're addressing most of the flakes with the subpath fixes, or if additional fixes are required.

liggitt · 2018-11-27T00:04:37Z

keeping open until flakes are verified cleaned up

AishSundar · 2018-11-27T00:05:08Z

/reopen

Reopening this issue until we can verify the fix in CI

AishSundar · 2018-11-27T16:42:24Z

As of this morning we are seeing CSI "subpath should fail" flakes in the following job

sig-release-master-blocking#gce-cos-master-slow
- Multiple "subpath should fail" flakes in this slow job alone
sig-release-1.13-blocking#gce-cos-1.13-slow
- Multiple "subpath should fail" flakes in the 1.13 slow job, similar to master.
sig-release-master-upgrade#gce-new-master-upgrade-cluster-new-parallel
- The following 2 tests are flaky only in this upgrade job
- [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath directory is outside the volume [Slow]
- [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should fail if subpath with backstepping is outside the volume [Slow]

@msau42 @saad-ali can you help us determine if these are cases are flaky tests that you are fixing or is there an underlying CSI bug that needs addressing?

saad-ali · 2018-11-27T17:03:01Z

@AishSundar root cause of this is a delay in an external sidecar containers and the tests not waiting long enough to accommodate the delay. There are no changes required in kubernetes/kubernetes (outside of test) to handle this. These should not be 1.13 blocking. We'll continue to look in to getting them non-flaky today. In the meantime, we can mark these tests as flaky to reduce the noise, if you'd like.

@msau42 @mkimuram looks like the same issue (#71433) is still being hit.

See link:

Nov 27 15:15:14.290: INFO: At 2018-11-27 15:13:07 +0000 UTC - event for pod-subpath-test-csi-hostpath-dynamicpv-ps95: {kubelet bootstrap-e2e-minion-group-zstx} FailedMount: MountVolume.NewMounter initialization failed for volume "pvc-e58454c7-f256-11e8-98b9-42010a840002" : driver name csi-hostpath-e2e-tests-csi-volumes-25s6v not found in the list of registered CSI drivers

And logs from driver-registrar (link):

==== start of log for container csi-hostpathplugin-d87xg/driver-registrar ====
I1127 15:12:55.163691       1 main.go:138] Version: v1.0.1-0-g78b5c12
I1127 15:12:55.163794       1 main.go:145] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1127 15:12:55.163812       1 connection.go:68] Connecting to /csi/csi.sock
I1127 15:12:55.165359       1 connection.go:95] Still trying, connection is CONNECTING
I1127 15:12:55.165709       1 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I1127 15:12:56.166315       1 connection.go:95] Still trying, connection is CONNECTING
I1127 15:12:56.166544       1 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I1127 15:12:57.213386       1 connection.go:95] Still trying, connection is TRANSIENT_FAILURE
I1127 15:12:58.389699       1 connection.go:95] Still trying, connection is CONNECTING
I1127 15:13:55.165212       1 connection.go:88] Connection timed out
I1127 15:13:55.165274       1 main.go:153] Calling CSI driver to discover driver name.

We need to make the tests handle this condition more gracefully. The suggestion here #71433 (comment) is one possible solution.

saad-ali · 2018-11-27T18:15:16Z

Would be good to capture this as a known issue in release notes.

AishSundar · 2018-11-27T18:51:31Z

@marpaia to followup on capturing this in Release notes

saad-ali · 2018-11-27T21:56:33Z

@mkimuram, per #71474 (review), can you create a targeted fix increasing the timeout?

Looks like marking these tests as flaky will be difficult for 1.13, but if that doesn't fix it we will need to do that.

msau42 · 2018-11-28T00:14:32Z

Short term fix to bump the test timeout: #71483

AishSundar · 2018-11-28T05:42:09Z

As per discussion in release burndown meeting, this is not a 1.13 blocker.

@jberkus please move this issue to 1.14 if there are no pending issues you are concerned about

msau42 · 2018-11-28T17:06:44Z

Ok looks like the test still flaked. Looking at https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/22043, the mount succeeded after a minute and proceeded to subpath processing, which (correctly) failed. However, for some resaon, the test is not seeing the subpath event and still failed after 4 minutes.

On kubelet, subpath processing failed after a minute and generated the event:

I1128 10:48:43.480646    1581 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-2kcwz", Name:"pod-subpath-test-csi-hostpath-v0-dynamicpv-g7c4", UID:"09cd59c8-f2fb-11e8-83d2-42010a840002", APIVersion:"v1", ResourceVersion:"9277", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-v0-dynamicpv-g7c4}"}): type: 'Warning' reason: 'Failed' Error: failed to create subPath directory for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-v0-dynamicpv-g7c4"

But the test doesn't see it for some reason, and still times out after 4 more minutes:

Nov 28 10:47:44.762: INFO: PersistentVolumeClaim pvc-r8nl9 found and phase=Bound (6.420787981s)
[It] should fail if non-existent subpath is outside the volume [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:262
STEP: Creating pod pod-subpath-test-csi-hostpath-v0-dynamicpv-g7c4
STEP: Checking for subpath error event
Nov 28 10:52:45.399: INFO: Deleting pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-g7c4" in namespace "e2e-tests-csi-volumes-2kcwz"

saad-ali · 2018-11-28T17:27:38Z

I saw the same on https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/422

CSI volume mount succeeded at:

I1128 12:22:26.251674    1451 operation_generator.go:567] MountVolume.SetUp succeeded for volume "pvc-1e53cb5f-f308-11e8-b109-42010a800002" (UniqueName: "kubernetes.io/csi/csi-hostpath-v0-e2e-tests-csi-volumes-ghd78^2079fc5e-f308-11e8-89b9-426ec44f83b8") pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl" (UID: "20c6ef48-f308-11e8-b109-42010a800002")

Kubelet correctly failed the pod because of subpath

I1128 12:22:45.494961    1451 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-ghd78", Name:"pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl", UID:"20c6ef48-f308-11e8-b109-42010a800002", APIVersion:"v1", ResourceVersion:"38952", FieldPath:"spec.containers{test-container-subpath-csi-hostpath-v0-dynamicpv-s8xl}"}): type: 'Warning' reason: 'Failed' Error: failed to prepare subPath for volumeMount "test-volume" of container "test-container-subpath-csi-hostpath-v0-dynamicpv-s8xl"

But test still failed:

Nov 28 12:26:27.070: INFO: Deleting pod "pod-subpath-test-csi-hostpath-v0-dynamicpv-s8xl" in namespace "e2e-tests-csi-volumes-ghd78"

Because the test failed to find the pod event.

This looks like a test issue. Not a 1.13 blocker.

saad-ali · 2018-11-28T23:42:28Z

@mkimuram can you PTAL at this

msau42 · 2018-11-28T23:45:39Z

Is there a way to check if events are overloaded and not able to handle new events? The MountVolume retries while the driver is not registered is very frequent and could have overloaded the event server.

saad-ali · 2018-11-28T23:48:56Z

Checking for the existence of an event seems like the wrong way to test this. Would checking pod status alone be insufficient? Ultimately that is what we really care about.

msau42 · 2018-11-28T23:55:30Z

Checking for the specific event guarantees that we are failing with the correct error. But perhaps events are not reliable enough to depend on. Checking if the Pod never came up running would probably reduce the flakes but could potentially mask false negatives. cc @justinsb for any thoughts.

msau42 · 2018-11-29T18:48:41Z

Tests are still flaking and actually it seems like in general storage flakes in slow are flaking more since 11-21 or so: https://k8s-testgrid.appspot.com/sig-storage#gce-slow

My theory is that the driver registration delay is causing a flood of events, which causes tests that check for events to fail. I think this is revealing multiple issues in both the tests and product, and we should work on fixing them all, although any one of the fixes should mitigate the test flakes:

Driver registrar can sometimes take a minute to come up: Driver registrar can sometimes take 1 minute to start kubernetes-csi/driver-registrar#76
Volume manager operation executor doesn't seem to be throttling retries or backing off: Operation executor doesn't seem to be throttling or backing off on retries #71569
Tests should not rely on events: Don't rely on events in storage e2e tests #71570
Tests should wait for CSI driver to be registered on all nodes before creating test pods: Fix CreateDriver to wait until driver-registrar becomes ready #71474

msau42 · 2018-12-18T18:16:43Z

#71570 and #71569 have been merged and will address the biggest issues. #71570 has been backported to 1.13 already, #71569 backport is still pending.

fejta-bot · 2019-03-18T18:32:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-04-17T19:05:04Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-05-17T19:48:40Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-05-17T19:48:48Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/storage Categorizes an issue or PR as relevant to SIG Storage. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 23, 2018

liggitt added this to the v1.13 milestone Nov 26, 2018

jberkus changed the title ~~Flaking Test: subpath failures in~~ Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs Nov 26, 2018

mkimuram mentioned this issue Nov 26, 2018

Fix subpath test flake that is caused in testPodFailSubpathError #71428

Merged

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 26, 2018

msau42 mentioned this issue Nov 26, 2018

CSI plugin registration can be slow #71433

Closed

k8s-ci-robot closed this as completed in #71428 Nov 26, 2018

liggitt reopened this Nov 27, 2018

AishSundar mentioned this issue Nov 27, 2018

Tracking: Complete list of Storage-related flakes for 1.13 #71434

Closed

mkimuram mentioned this issue Nov 27, 2018

Fix CreateDriver to wait until driver-registrar becomes ready #71474

Closed

AishSundar mentioned this issue Nov 28, 2018

1.13 Release Notes: "Known Issues" #70955

Closed

msau42 mentioned this issue Nov 29, 2018

Don't rely on events in storage e2e tests #71570

Closed

This was referenced Mar 5, 2019

[Umbrella issue for tracking] gce-cos-<branch>-default is flaky #74960

Closed

[Umbrella issue for tracking] ci-kubernetes-e2e-gce-master-new-downgrade-cluster is flaky #75047

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2019

k8s-ci-robot closed this as completed May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

jberkus commented Nov 23, 2018

saad-ali commented Nov 26, 2018

saad-ali commented Nov 26, 2018 •

edited

mariantalla commented Nov 26, 2018

liggitt commented Nov 26, 2018

mkimuram commented Nov 26, 2018

AishSundar commented Nov 26, 2018

msau42 commented Nov 26, 2018

msau42 commented Nov 26, 2018

msau42 commented Nov 26, 2018

jberkus commented Nov 26, 2018

liggitt commented Nov 27, 2018

AishSundar commented Nov 27, 2018

AishSundar commented Nov 27, 2018

saad-ali commented Nov 27, 2018

saad-ali commented Nov 27, 2018

AishSundar commented Nov 27, 2018

saad-ali commented Nov 27, 2018 •

edited

msau42 commented Nov 28, 2018

AishSundar commented Nov 28, 2018

msau42 commented Nov 28, 2018

saad-ali commented Nov 28, 2018

saad-ali commented Nov 28, 2018

msau42 commented Nov 28, 2018

saad-ali commented Nov 28, 2018

msau42 commented Nov 28, 2018

msau42 commented Nov 29, 2018 •

edited

msau42 commented Dec 18, 2018

fejta-bot commented Mar 18, 2019

fejta-bot commented Apr 17, 2019

fejta-bot commented May 17, 2019

k8s-ci-robot commented May 17, 2019

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

Comments

jberkus commented Nov 23, 2018

saad-ali commented Nov 26, 2018

saad-ali commented Nov 26, 2018 • edited

mariantalla commented Nov 26, 2018

liggitt commented Nov 26, 2018

mkimuram commented Nov 26, 2018

AishSundar commented Nov 26, 2018

msau42 commented Nov 26, 2018

msau42 commented Nov 26, 2018

msau42 commented Nov 26, 2018

jberkus commented Nov 26, 2018

liggitt commented Nov 27, 2018

AishSundar commented Nov 27, 2018

AishSundar commented Nov 27, 2018

saad-ali commented Nov 27, 2018

saad-ali commented Nov 27, 2018

AishSundar commented Nov 27, 2018

saad-ali commented Nov 27, 2018 • edited

msau42 commented Nov 28, 2018

AishSundar commented Nov 28, 2018

msau42 commented Nov 28, 2018

saad-ali commented Nov 28, 2018

saad-ali commented Nov 28, 2018

msau42 commented Nov 28, 2018

saad-ali commented Nov 28, 2018

msau42 commented Nov 28, 2018

msau42 commented Nov 29, 2018 • edited

msau42 commented Dec 18, 2018

fejta-bot commented Mar 18, 2019

fejta-bot commented Apr 17, 2019

fejta-bot commented May 17, 2019

k8s-ci-robot commented May 17, 2019

saad-ali commented Nov 26, 2018 •

edited

saad-ali commented Nov 27, 2018 •

edited

msau42 commented Nov 29, 2018 •

edited