Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-attacher does not see nodeID annotation for 2 min? #102

Closed
saad-ali opened this issue Nov 29, 2018 · 8 comments
Closed

external-attacher does not see nodeID annotation for 2 min? #102

saad-ali opened this issue Nov 29, 2018 · 8 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@saad-ali
Copy link
Member

This is an analysis of the following test flake: https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4398

Kubelet sees new plugin and registers it:
I1128 12:33:16.974128    6672 csi_plugin.go:116] kubernetes.io/csi: Trying to validate a new CSI Driver with name: pd.csi.storage.gke.io endpoint: /var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock versions: 1.0.0, foundInDeprecatedDir: false
I1128 12:33:16.974342    6672 csi_plugin.go:137] kubernetes.io/csi: Register new plugin with name: pd.csi.storage.gke.io at endpoint: /var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock

attach-detach controller is waiting for attach:
E1128 12:34:31.327964       1 csi_attacher.go:227] kubernetes.io/csi: attachment for projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002 failed: cannot find NodeID for driver "pd.csi.storage.gke.io" for node "bootstrap-e2e-minion-group-r7dh"
E1128 12:34:31.328073       1 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\"" failed. No retries permitted until 2018-11-28 12:36:33.328041594 +0000 UTC m=+4563.274204929 (durationBeforeRetry 2m2s). Error: "AttachVolume.Attach failed for volume \"pvc-a2f50d29-f309-11e8-802b-42010a800002\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\") from node \"bootstrap-e2e-minion-group-r7dh\" : cannot find NodeID for driver \"pd.csi.storage.gke.io\" for node \"bootstrap-e2e-minion-group-r7dh\""

external-attacher finally sees the annotation and attaches the volume:
I1128 12:35:35.359295       1 csi_handler.go:193] NodeID annotation added to "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:35.365952       1 csi_handler.go:203] VolumeAttachment "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea" updated with finalizer and/or NodeID annotation
I1128 12:35:35.366006       1 connection.go:235] GRPC call: /csi.v1.Controller/ControllerPublishVolume
I1128 12:35:35.366013       1 connection.go:236] GRPC request: volume_id:"projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002" node_id:"projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/instances/bootstrap-e2e-minion-group-r7dh" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1543408336401-8081-pd.csi.storage.gke.io" > 
I1128 12:35:47.982680       1 connection.go:238] GRPC response: 
I1128 12:35:47.982733       1 connection.go:239] GRPC error: <nil>
I1128 12:35:47.982744       1 csi_handler.go:131] Attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.982757       1 util.go:33] Marking as attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.998078       1 util.go:43] Marked as attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.998099       1 csi_handler.go:137] Fully attached "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"

Due to exponential back off attach-detach controller won't "retry" attach until 12:36:33.328041594

kubelet is also waiting for attach-detach controller to ack volume is attached:

E1128 12:36:33.114836    6672 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\"" failed. No retries permitted until 2018-11-28 12:38:35.11478352 +0000 UTC m=+4788.780131927 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-a2f50d29-f309-11e8-802b-42010a800002\" (UniqueName: \"kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002\") pod \"pod-subpath-test-pd-csi-storage-gke-io-dynamicpv-ghvv\" (UID: \"a7db48f2-f309-11e8-802b-42010a800002\") "

attach-detach controller wakes up and sees the volume is attached:
I1128 12:36:33.332854       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-a2f50d29-f309-11e8-802b-42010a800002" (UniqueName: "kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002") from node "bootstrap-e2e-minion-group-r7dh" 
I1128 12:36:33.345070       1 csi_attacher.go:110] kubernetes.io/csi: attachment [csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea] for volume [projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002] already exists (will not be recreated)
I1128 12:36:33.345100       1 csi_attacher.go:148] kubernetes.io/csi: probing for updates from CSI driver for [attachment.ID=csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea]
I1128 12:36:33.345113       1 csi_attacher.go:157] kubernetes.io/csi: probing VolumeAttachment [id=csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea]
I1128 12:36:33.348650       1 csi_attacher.go:119] kubernetes.io/csi: attacher.Attach finished OK with VolumeAttachment object [csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea]
I1128 12:36:33.348731       1 operation_generator.go:335] AttachVolume.Attach succeeded for volume "pvc-a2f50d29-f309-11e8-802b-42010a800002" (UniqueName: "kubernetes.io/csi/pd.csi.storage.gke.io^projects/k8s-jkns-gci-gce-flaky/zones/us-central1-f/disks/pvc-a2f50d29-f309-11e8-802b-42010a800002") from node "bootstrap-e2e-minion-group-r7dh" 

Due to exponential back off on kubelet, it will not check for attached volume again until 2018-11-28 12:38:35.11478352

Before kubelet can check, the test timesout and gives up

Nov 28 12:37:23.159: INFO: Deleting pod "pod-subpath-test-pd-csi-storage-gke-io-dynamicpv-ghvv" in namespace "e2e-tests-csi-volumes-65prz"

So for the most part seems like everything is operating as expected.
Outstanding question is why did external-attacher not see the nodeID annotation until 2 min after it was added to node?

@saad-ali
Copy link
Member Author

CC @msau42 @jsafrane

@jsafrane
Copy link
Contributor

jsafrane commented Nov 29, 2018

Looking at the logs, exponential backoff is weird. Controller got synced at these intervals:

I1128 12:32:23.206454       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.233230       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.247519       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.257638       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.317422       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.420913       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.587099       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:23.913460       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:24.561902       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:24.572704       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:25.852872       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:25.860689       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:30.980237       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:31.009714       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:31.030087       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:51.490190       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:32:51.508139       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:35.348500       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:35:47.998133       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"
I1128 12:37:28.546844       1 controller.go:167] Started VA processing "csi-1fe974024220d00085419a3757b0ccee5ebadeff7672fc3fe0ed469132a41aea"

Recalculated to 10 of milliseconds between calls:

2.7
1.4
1.0
6.0
10.3
16.6
32.6
64.8
1.1
128.0
0.8
512.0
2.9
2.0
2046.0
1.8
16384.0
1265.0
10054.9

It shows some exponential growth from 16 through 32, 64, 128, but it misses 256, 1024 and lot of numbers after 2048.

Unfortunately, informer does not log anything.

As side note, these short 1-2ms intervals in between are caused by timestamp in VolumeAttachment.Status.Error - when backoff timer fires up, the controller re-tries attach and writes new timestamp to Status. This triggers new "VolumeAttachment changed" event and the controller reconciles the object again. It tries to attach the volume again and writes new .Status. At this point, the timestamp is the same as before (it's rounded to seconds), so it does not generate new "VolumeAttachment changed" event. IMO we could do something about it and not to try attaching the volume so quickly, however, it most probably is not root cause of this bug.

@jsafrane
Copy link
Contributor

This "VolumeAttachment changed" is related to the bug. New attach fails and VolumeAttachment is re-enqueued with new interval, overwriting the old one. So VolumeAttachment is enqueued with timeout 2*N instead of N, as we can see aboce. I thought that workqueue is smart enough not to enqueue an item if the item is still in the queue. I'll ignore "VolumeAttachment updated" events if the only difference is VolumeAttachment.Status.Error.

That does not explain why there is such a long gap between 2046.0 (=20s) and 16384 (=168s) though. And that's where these 2.5 minutes are, the previous syncs take ~30 seconds together.

@jsafrane
Copy link
Contributor

I create #104, it does not retry to attach/detach on VolumeAttachment changed events caused by AttachError write.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants