Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

livenessprobe should use CSI connection lib & only connect at startup #37

Merged
merged 3 commits into from
Feb 20, 2019

Conversation

andrewsykim
Copy link
Contributor

/assign @msau42 @jsafrane @vladimirvivien

Updating CSI sidecars to use the common connection library in csi-lib-utils.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 14, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 14, 2019
@andrewsykim
Copy link
Contributor Author

andrewsykim commented Feb 14, 2019

E2E is hanging because the new connection lib dials forever but the e2e tests are expecting a 500 at some point due to connection timeout from hostpath. A few options to address this:

  • Update the E2E tests to succeed if no response is given in X amount of seconds. This seems reasonable to me given we would assume a timeout with liveness probes from the kubelet.
  • Add a --probe-timeout flag for this sidecar to indicate a duration after which we return a 500. This seems redundant given the timeout would be specified on the pod's liveness probe.
  • Have connection lib error out at some point (configurable or with a high default)

@msau42
Copy link
Collaborator

msau42 commented Feb 14, 2019

This might be related? kubernetes-csi/csi-test#165

@andrewsykim
Copy link
Contributor Author

@msau42 @jsafrane adding some extra context on why the e2e tests are failing.

From Travis CI I'm seeing

Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:46:59 --:--:--     0W0214 05:05:09.587030    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:47:09 --:--:--     0W0214 05:05:19.586904    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:47:19 --:--:--     0W0214 05:05:29.586718    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:47:29 --:--:--     0W0214 05:05:39.586728    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:47:39 --:--:--     0W0214 05:05:49.586723    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:47:49 --:--:--     0W0214 05:05:59.586720    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:47:59 --:--:--     0W0214 05:06:09.586723    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:48:09 --:--:--     0W0214 05:06:19.586791    6298 connection.go:146] Still connecting to unix:///tmp/e2e-csi-sanity.sock
  0     0    0     0    0     0      0      0 --:--:--  0:48:13 --:--:--     0

The job exceeded the maximum time limit for jobs, and has been terminated.

What I think is happening is that the failure test case hangs forever because gRPC dial from livenessprobe to the CSI driver (hostpath in this case) will retry forever (due to no timeout being specified in the connection lib and using the WithBlock dial option).

If there are no intentions to add timeout in the common connection lib, I think an acceptable solution here is to modify the failure test case to succeed if no response is provided within 10 seconds.

Alternatively, we can follow @msau42's suggestion of wrapping checkHealth under a timeout or we can have the common connection lib accept an optional timeout for cases like this one. My only concern with passing a timeout in the livenessprobe sidecar is that the user is now responsible for ensuring the probe timeout on the sidecar is in line with the timeout that may be specified on the liveness probe on the pod spec. If we dial forever, then the probe timeout is automatically the timeout set on the pod spec.

Thoughts?

@msau42
Copy link
Collaborator

msau42 commented Feb 15, 2019

I think the simplest solution would be to add an optional timeout to the connect and probe methods and have the liveness probe use a small value.

@andrewsykim
Copy link
Contributor Author

Sounds good to me! I'll get started on that, thanks!

@msau42 msau42 added this to In progress in Consistent sidecars Feb 16, 2019
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 20, 2019
Gopkg.toml Outdated
@@ -39,7 +39,7 @@

[[constraint]]
name = "github.com/kubernetes-csi/csi-lib-utils"
version = "0.1.0"
branch = "master"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this back once a new version of csi-lib-utils is released

@andrewsykim andrewsykim changed the title livenessprobe should use CSI connection lib livenessprobe should use CSI connection lib & only connect at startup Feb 20, 2019
@andrewsykim
Copy link
Contributor Author

@msau42 @jsafrane made updates based on our previous discussions:

  • removed usage of pkg/connection and replace with csi-lib-utils
  • update livenessprobe to only connect at startup and only send probe requests for each /healthz request.

@msau42
Copy link
Collaborator

msau42 commented Feb 20, 2019

https://github.com/kubernetes-csi/csi-lib-utils/releases/tag/v0.3.1-rc1

@msau42
Copy link
Collaborator

msau42 commented Feb 20, 2019

/lgtm
Thanks for working through this!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2019
@msau42
Copy link
Collaborator

msau42 commented Feb 20, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2019
@k8s-ci-robot k8s-ci-robot merged commit ef27b95 into kubernetes-csi:master Feb 20, 2019
Consistent sidecars automation moved this from In progress to Done Feb 20, 2019
jsafrane pushed a commit to jsafrane/livenessprobe that referenced this pull request Jul 20, 2023
STOR-1020: Rebase `csi-livenessprobe` to v2.9.0 for OCP 4.13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants