external-attacher does not see nodeID annotation for 2 min? #102

saad-ali · 2018-11-29T02:35:38Z

This is an analysis of the following test flake: https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4398

Kubelet sees new plugin and registers it:
I1128 12:33:16.974128    6672 csi_plugin.go:116] kubernetes.io/csi: Trying to validate a new CSI Driver with name: pd.csi.storage.gke.io endpoint: /var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock versions: 1.0.0, foundInDeprecatedDir: false
I1128 12:33:16.974342    6672 csi_plugin.go:137] kubernetes.io/csi: Register new plugin with name: pd.csi.storage.gke.io at endpoint: /var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock

attach-detach controller is waiting for attach:
E1128 12:34:31.327964       1 csi_attacher.go:227] kubernetes.io/csi: attachment for projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002 failed: cannot find NodeID for driver "pd.csi.storage.gke.io" for node "bootstrap-e2e-minion-group-r7dh"
E1128 12:34:31.328073       1 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\"" failed. No retries permitted until 2018-11-28 12:36:33.328041594 +0000 UTC m=+4563.274204929 (durationBeforeRetry 2m2s). Error: "AttachVolume.Attach failed for volume \"pvc-a2f50d29-f309-11e8-802b-42010a800002\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\") from node \"bootstrap-e2e-minion-group-r7dh\" : cannot find NodeID for driver \"pd.csi.storage.gke.io\" for node \"bootstrap-e2e-minion-group-r7dh\""

external-attacher finally sees the annotation and attaches the volume:
I1128 12:35:35.359295       1 csi_handler.go:193] NodeID annotation added to "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:35.365952       1 csi_handler.go:203] VolumeAttachment "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea" updated with finalizer and/or NodeID annotation
I1128 12:35:35.366006       1 connection.go:235] GRPC call: /csi.v1.Controller/ControllerPublishVolume
I1128 12:35:35.366013       1 connection.go:236] GRPC request: volume_id:"projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002" node_id:"projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/instances/bootstrap-e2e-minion-group-r7dh" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1543408336401-8081-pd.csi.storage.gke.io" > 
I1128 12:35:47.982680       1 connection.go:238] GRPC response: 
I1128 12:35:47.982733       1 connection.go:239] GRPC error: <nil>
I1128 12:35:47.982744       1 csi_handler.go:131] Attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.982757       1 util.go:33] Marking as attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.998078       1 util.go:43] Marked as attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.998099       1 csi_handler.go:137] Fully attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"

Due to exponential back off attach-detach controller won't "retry" attach until 12:36:33.328041594

kubelet is also waiting for attach-detach controller to ack volume is attached:

E1128 12:36:33.114836    6672 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\"" failed. No retries permitted until 2018-11-28 12:38:35.11478352 +0000 UTC m=+4788.780131927 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-a2f50d29-f309-11e8-802b-42010a800002\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\") pod \"pod-subpath-test-pd-csi-storage-gke-io-dynamicpv-ghvv\" (UID: \"a7db48f2-f309-11e8-802b-42010a800002\") "

attach-detach controller wakes up and sees the volume is attached:
I1128 12:36:33.332854       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-a2f50d29-f309-11e8-802b-42010a800002" (UniqueName: "kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002") from node "bootstrap-e2e-minion-group-r7dh" 
I1128 12:36:33.345070       1 csi_attacher.go:110] kubernetes.io/csi: attachment [csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea] for volume [projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002] already exists (will not be recreated)
I1128 12:36:33.345100       1 csi_attacher.go:148] kubernetes.io/csi: probing for updates from CSI driver for [attachment.ID=csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea]
I1128 12:36:33.345113       1 csi_attacher.go:157] kubernetes.io/csi: probing VolumeAttachment [id=csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea]
I1128 12:36:33.348650       1 csi_attacher.go:119] kubernetes.io/csi: attacher.Attach finished OK with VolumeAttachment object [csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea]
I1128 12:36:33.348731       1 operation_generator.go:335] AttachVolume.Attach succeeded for volume "pvc-a2f50d29-f309-11e8-802b-42010a800002" (UniqueName: "kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002") from node "bootstrap-e2e-minion-group-r7dh" 

Due to exponential back off on kubelet, it will not check for attached volume again until 2018-11-28 12:38:35.11478352

Before kubelet can check, the test timesout and gives up

Nov 28 12:37:23.159: INFO: Deleting pod "pod-subpath-test-pd-csi-storage-gke-io-dynamicpv-ghvv" in namespace "e2e-tests-csi-volumes-65prz"

So for the most part seems like everything is operating as expected.
Outstanding question is why did external-attacher not see the nodeID annotation until 2 min after it was added to node?

The text was updated successfully, but these errors were encountered:

saad-ali · 2018-11-29T02:39:35Z

CC @msau42 @jsafrane

jsafrane · 2018-11-29T10:05:24Z

Looking at the logs, exponential backoff is weird. Controller got synced at these intervals:

I1128 12:32:23.206454       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.233230       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.247519       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.257638       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.317422       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.420913       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.587099       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.913460       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:24.561902       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:24.572704       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:25.852872       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:25.860689       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:30.980237       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:31.009714       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:31.030087       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:51.490190       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:51.508139       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:35.348500       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.998133       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:37:28.546844       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"

Recalculated to 10 of milliseconds between calls:

It shows some exponential growth from 16 through 32, 64, 128, but it misses 256, 1024 and lot of numbers after 2048.

Unfortunately, informer does not log anything.

As side note, these short 1-2ms intervals in between are caused by timestamp in VolumeAttachment.Status.Error - when backoff timer fires up, the controller re-tries attach and writes new timestamp to Status. This triggers new "VolumeAttachment changed" event and the controller reconciles the object again. It tries to attach the volume again and writes new .Status. At this point, the timestamp is the same as before (it's rounded to seconds), so it does not generate new "VolumeAttachment changed" event. IMO we could do something about it and not to try attaching the volume so quickly, however, it most probably is not root cause of this bug.

jsafrane · 2018-11-29T12:49:42Z

This "VolumeAttachment changed" is related to the bug. New attach fails and VolumeAttachment is re-enqueued with new interval, overwriting the old one. So VolumeAttachment is enqueued with timeout 2*N instead of N, as we can see aboce. I thought that workqueue is smart enough not to enqueue an item if the item is still in the queue. I'll ignore "VolumeAttachment updated" events if the only difference is VolumeAttachment.Status.Error.

That does not explain why there is such a long gap between 2046.0 (=20s) and 16384 (=168s) though. And that's where these 2.5 minutes are, the previous syncs take ~30 seconds together.

jsafrane · 2018-11-29T15:05:08Z

I create #104, it does not retry to attach/detach on VolumeAttachment changed events caused by AttachError write.

fejta-bot · 2019-04-27T05:46:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-27T06:28:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-26T07:19:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-26T07:20:08Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane mentioned this issue Nov 29, 2018

Skip processing of Attach/DetachError changes #104

Merged

pohly mentioned this issue Nov 30, 2018

gubernator: support logs in nested directories kubernetes/test-infra#10289

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019

k8s-ci-robot closed this as completed Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external-attacher does not see nodeID annotation for 2 min? #102

external-attacher does not see nodeID annotation for 2 min? #102

saad-ali commented Nov 29, 2018

saad-ali commented Nov 29, 2018

jsafrane commented Nov 29, 2018 •

edited

jsafrane commented Nov 29, 2018

jsafrane commented Nov 29, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

fejta-bot commented Jun 26, 2019

k8s-ci-robot commented Jun 26, 2019

external-attacher does not see nodeID annotation for 2 min? #102

external-attacher does not see nodeID annotation for 2 min? #102

Comments

saad-ali commented Nov 29, 2018

saad-ali commented Nov 29, 2018

jsafrane commented Nov 29, 2018 • edited

jsafrane commented Nov 29, 2018

jsafrane commented Nov 29, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

fejta-bot commented Jun 26, 2019

k8s-ci-robot commented Jun 26, 2019

jsafrane commented Nov 29, 2018 •

edited