Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce pd WaitForAttach never returned #63156

Open
msau42 opened this issue Apr 25, 2018 · 13 comments
Open

gce pd WaitForAttach never returned #63156

msau42 opened this issue Apr 25, 2018 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@msau42
Copy link
Member

msau42 commented Apr 25, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
@kubernetes/sig-storage-bugs

What happened:
Encountered a strange test failure, where WaitForAttach never returned

Test logs: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/63045/pull-kubernetes-e2e-gce/32404/

The volume got successfully attached to the node:

I0425 00:03:50.491024       1 pv_controller.go:1432] volume "pvc-1f70b335-481c-11e8-a540-42010a280002" for claim "e2e-tests-subpath-bhnsz/pvc-c66qg" created
I0425 00:03:50.491047       1 pv_controller.go:1449] provisionClaimOperation [e2e-tests-subpath-bhnsz/pvc-c66qg]: trying to save volume pvc-1f70b335-481c-11e8-a540-42010a280002

I0425 00:03:53.196916       1 reconciler.go:285] attacherDetacher.AttachVolume started for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") from node "e2e-63045-674b9-minion-group-rcbf" 
0425 00:03:58.170436       1 operation_generator.go:332] AttachVolume.Attach succeeded for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") from node "e2e-63045-674b9-minion-group-rcbf" 
I0425 00:03:58.170499       1 actual_state_of_world.go:464] Report volume "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002" as attached to node "e2e-63045-674b9-minion-group-rcbf"
I0425 00:03:58.172518       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-subpath-bhnsz", Name:"pod-subpath-test-gcepdpvc-j2v7", UID:"1f71bc74-481c-11e8-a540-42010a280002", APIVersion:"v1", ResourceVersion:"6789", FieldPath:""}): type: 'Normal' reason: 'SuccessfulAttachVolume' AttachVolume.Attach succeeded for volume "pvc-1f70b335-481c-11e8-a540-42010a280002"

Kubelet logs show it detected the attached volume, however WaitForAttach never returned:

I0425 00:04:01.880943    1163 operation_generator.go:1170] Controller attach succeeded for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002") device path: "/dev/disk/by-id/google-e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002"
I0425 00:04:01.895746    1163 reconciler.go:237] Starting operationExecutor.MountVolume for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002")
I0425 00:04:01.895856    1163 reconciler.go:252] operationExecutor.MountVolume started for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002")

0425 00:04:01.895978    1163 operation_generator.go:486] MountVolume.WaitForAttach entering for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002") DevicePath "/dev/disk/by-id/google-e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002"


I0425 00:04:01.996191    1163 reconciler.go:237] Starting operationExecutor.MountVolume for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002")
I0425 00:04:02.096800    1163 reconciler.go:237] Starting operationExecutor.MountVolume for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002")
...
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels Apr 25, 2018
@saad-ali
Copy link
Member

saad-ali commented Apr 25, 2018

Weird. MountVolume should not start unless WaitForAttach returns without error.

But it does appear to be triggered multiple times:

I0425 00:04:01.996191    1163 reconciler.go:237] Starting operationExecutor.MountVolume for volume "pvc-1f70b335-481c-11e8-a540-42010a280002" (UniqueName: "kubernetes.io/gce-pd/e2e-63045-674b9-dynami-pvc-1f70b335-481c-11e8-a540-42010a280002") pod "pod-subpath-test-gcepdpvc-j2v7" (UID: "1f71bc74-481c-11e8-a540-42010a280002")

And there is no Successfully found attached GCE PD message indicating WaitForAttach completed successfully nor any message indicating it errored. Curious indeed.

Log: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/63045/pull-kubernetes-e2e-gce/32404/artifacts/e2e-63045-674b9-minion-group-rcbf/kubelet.log

@saad-ali
Copy link
Member

This flake only appears to happen on Subpath [Volume type: gcePDPVC] should support existing single file test. It would be interesting to compare a good run with a bad run to determine if this is a real issue or a red herring (possible logging issue).

@msau42
Copy link
Member Author

msau42 commented Apr 25, 2018

I don't think a logging issue would cause the test to fail due to Mount not happening.

I do see other gce pd tests flaking, but I haven't looked into the others to see if they're the same issue: https://k8s-testgrid.appspot.com/sig-storage#gce

@jingxu97
Copy link
Contributor

jingxu97 commented May 9, 2018

I checked a couple of failed tests. Both of them has the log of "MountVolume.WaitForAttach entering for volume ....", but after that, there is no fail or succeed message. It seems like WaitForAttach was hanging. All the following operationExecutor.MountVolume will fail to start because the first operation still exist.
I checked WaitForAttach function in GCE PD attacher, there are some file system calls including filepath.Glob(), maybe this call takes very long time?

@msau42
Copy link
Member Author

msau42 commented May 9, 2018

Another suspicious operation I see that could block is invoking udevadm here.

I think we probably need to manually try to stress and reproduce this, so we have a chance to login and see where it's stuck.

@saad-ali
Copy link
Member

I ran the test [sig-storage] Subpath [Volume type: gcePDPVC] should support existing single file test back to back in a loop for 18 hours without failure.

@msau42 suggests the flake may repro if we run multiple volume tests in parallel. Will try that when I have some time.

@verult
Copy link
Contributor

verult commented May 21, 2018

Ran the tests with the following command:
for i in $(seq 1 500); do GINKGO_PARALLEL=y go run hack/e2e.go -- --test --test_args="--ginkgo.skip=\[Serial\]|\[Disruptive\] --ginkgo.focus=Pod\sDisks|PD --delete-namespace-on-failure=false"; done

A good number of iterations have failed tests, and all of the failures I've seen so far are from Subpath [Volume type: gcePDPVC].

@jingxu97
Copy link
Contributor

@verult thanks for the test. Maybe we can try to add some logs into WaitForAttach() function since that's where we suspect to have problems.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2018
@msau42
Copy link
Member Author

msau42 commented Aug 23, 2018

/remove-lifecycle stale

We should at least add some logs. Currently we cannot tell the difference between waitForAttach hanged, or the device path never appeared.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2018
@msau42
Copy link
Member Author

msau42 commented Nov 22, 2018

Logs are being added here: #68022

We also know of an instance where format could hang if the disk got detached during the operation: #71145.

@msau42
Copy link
Member Author

msau42 commented Nov 22, 2018

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

6 participants