Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change default timeout value in csi plugin #79529

Merged
merged 1 commit into from Jul 3, 2019

Conversation

@andyzhangx
Copy link
Member

commented Jun 28, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR changes timeout value in csi plugin from 15s to 2min which fixes the timeout issue

attach/WaitForAttach timeout value is too small in CSI driver, it's all 15s which is really too small:

csiTimeout = 15 * time.Second

Compared to built-in driver timeout, it's really too small:
podAttachAndMountTimeout is 2 minutes, total waitForAttachTimeout is 10 minutes

Which issue(s) this PR fixes:

Fixes #79528

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

changes timeout value in csi plugin from 15s to 2min which fixes the timeout issue

/kind bug
/assign @msau42
/priority important-soon
/sig storage

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 28, 2019

@@ -54,7 +54,7 @@ const (
// TODO (vladimirvivien) would be nice to name socket with a .sock extension
// for consistency.
csiAddrTemplate = "/var/lib/kubelet/plugins/%v/csi.sock"
csiTimeout = 15 * time.Second
csiTimeout = 2 * time.Minute

This comment has been minimized.

Copy link
@hoyho

hoyho Jun 28, 2019

Member

Not sure what is the best default value. If the csiTimeout can be configurable like other csi sidecar do that will be better.

This comment has been minimized.

Copy link
@andyzhangx

andyzhangx Jun 28, 2019

Author Member

@hoyho yes, but now I would like cherry-pick this fix to old release from 1.13 to 1.15, 15s is really too small, it makes the timeout a lot. And when user switch to use CSI driver, they won't be happy to see that timeout issue every time.

This comment has been minimized.

Copy link
@hoyho

hoyho Jun 28, 2019

Member

agree with you. For our CSI driver, 15s is not enough.

This comment has been minimized.

Copy link
@cwdsuzhou

cwdsuzhou Jul 3, 2019

Member

I also meet some issues caused by the default time.

LGTM

@davidz627

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

The same timeout is used for Attach/Mount/Stage/Block etc. Do we want to split it up into different timeouts for different operations that tend to take different amount of times? I also agree with the premise 15 seconds is far too short.

@msau42

This comment has been minimized.

Copy link
Member

commented Jun 29, 2019

Also keep in mind that the CSI driver should be idempotent and handle another call coming in on the same volume even if one is already in progress.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jun 29, 2019

The same timeout is used for Attach/Mount/Stage/Block etc. Do we want to split it up into different timeouts for different operations that tend to take different amount of times? I also agree with the premise 15 seconds is far too short.

I think currently let's change the default timeout value to 2min, for other operations, like Mount, 15s is still too short if there is a large disk to format, 2min is much safer. If we found we need different timeouts in the future, then let's split them up at that time.

@andyzhangx andyzhangx changed the title fix: change timeout value in csi plugin fix: change default timeout value in csi plugin Jun 29, 2019

@davidz627

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

/lgtm

@cwdsuzhou
Copy link
Member

left a comment

could you also cherrypick it to 1.12 1.11

@@ -54,7 +54,7 @@ const (
// TODO (vladimirvivien) would be nice to name socket with a .sock extension
// for consistency.
csiAddrTemplate = "/var/lib/kubelet/plugins/%v/csi.sock"
csiTimeout = 15 * time.Second
csiTimeout = 2 * time.Minute

This comment has been minimized.

Copy link
@cwdsuzhou

cwdsuzhou Jul 3, 2019

Member

I also meet some issues caused by the default time.

LGTM

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

cc @jsafrane @saad-ali for approval, thanks.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

could you also cherrypick it to 1.12 1.11

1.12, 1.11 is already code freezed, so the latest cherry pick version should be 1.13. @cwdsuzhou

@jsafrane

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

I must miss something. The linked issue mentions Attach/WaitForAttach timeouts. Taking Attach as example, each Attach() call waits for 15 seconds for a volume to get attached and then it returns an error to A/D controller. A/D controller waits for an exponential backoff and re-tries calling Attach(). Initial backoff delay is 0.5s and doubles with each error. So for the first two minutes we have:

15s waiting for attachment
0.5s sleep
15s waiting for attachment
1s sleep
...
15s waiting for attachment
16s sleep

(=121.5s total)
In these two minutes, 90 seconds the controller is actively waiting for the attachment and only 31.5s sleeping.

If the volume is attached in first 48.5 seconds, it takes only 2 second max for the controller to notice a volume is attached (that's the max sleep in 3 iterations).

The real difference is the last sleep. If a volume gets attached 90.5s after start, the controller waits for extra 16 seconds to notice it. Is it really the case we are trying to fix here? What storage backend takes 90s to attach a volume?

WaitForAttach: I may be mistaken here, I think kubelet calls WaitForAttach after A/D controller reports a volume as attached. A VolumeAttachment should already be in the right state and it should be only a matter of syncing API server watch(), which should take fraction of a second in average case.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

15s waiting for attachment

Hi @jsafrane we are talking about the 15s waiting for attachment value, for a few CSI drivers, the first 15s is too short, there would be timeout error, while it could succeed finally by retry. So make the default timeout value as 120s(identical to built-in driver) would be better.

@cwdsuzhou

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

@jsafrane @andyzhangx

I have sent a PR #75963 in the past to expose the maxBackoff time which may help decrease the time mount waitis if attach lasts for long time. However, it is hard to be merged.

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

@jsafrane @andyzhangx

I have sent a PR #75963 in the past to expose the maxBackoff time which may help decrease the time mount waitis if attach lasts for long time. However, it is hard to be merged.

Actually timeout value in the built-in driver is 2min (per

// podAttachAndMountTimeout is the maximum amount of time the
// WaitForAttachAndMount call will wait for all volumes in the specified pod
// to be attached and mounted. Even though cloud operations can take several
// minutes to complete, we set the timeout to 2 minutes because kubelet
// will retry in the next sync iteration. This frees the associated
// goroutine of the pod to process newer updates if needed (e.g., a delete
// request to the pod).
// Value is slightly offset from 2 minutes to make timeouts due to this
// constant recognizable.
podAttachAndMountTimeout = 2*time.Minute + 3*time.Second
), while csi driver plugin makes it as only 15s, I think this is a regression. 2min is a suitable value for most drivers, although retry will succeed in the end.

@jsafrane

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

So the only problem here is a timeout error visible in kubectl describe pod?

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

So the only problem here is a timeout error visible in kubectl describe pod?

Yes

@jsafrane

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

Oh, I expected there is some real issue :-)

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andyzhangx

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

Oh, I expected there is some real issue :-)

/approve

Thanks, this will make some users nervous if first attach volume failed every time.

@k8s-ci-robot k8s-ci-robot merged commit a3be4b6 into kubernetes:master Jul 3, 2019

23 checks passed

cla/linuxfoundation andyzhangx authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

k8s-ci-robot added a commit that referenced this pull request Jul 4, 2019

Merge pull request #79759 from andyzhangx/automated-cherry-pick-of-#7…
…9529-upstream-release-1.15

Automated cherry pick of #79529: fix: change timeout value in csi plugin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.