Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config for timeouts and delays in probes #11458

Merged

Conversation

jan-kantert
Copy link
Contributor

Proposed change to implement #11453

@jan-kantert jan-kantert requested a review from a team as a code owner October 5, 2023 12:53
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jan-kantert!

The default probe timeouts are 1s, so that's the current behavior. Changing the defaults from 1s to 30s and 5s respectively feels like a big change. I'm not sure how we would determine if these defaults are better.

What do you think about leaving the defaults as the current value of 1s? At least this change will allow users to configure these timeouts explicitly without changing the default behavior.

@jan-kantert
Copy link
Contributor Author

Thanks @jan-kantert!

The default probe timeouts are 1s, so that's the current behavior. Changing the defaults from 1s to 30s and 5s respectively feels like a big change. I'm not sure how we would determine if these defaults are better.

What do you think about leaving the defaults as the current value of 1s? At least this change will allow users to configure these timeouts explicitly without changing the default behavior.

Sure we can do that. However, those small timeouts cause instability on clusters with high load. I personally would still at least increase the default for the livenessProbe (as Kubernetes recommends to have a longer timeout for your livenessProbe than your readinessProbe). This timeout also has been introduced more or less silently with kubernetes 1.20 (https://kubernetes.io/blog/2020/12/08/kubernetes-1-20-release-announcement/#exec-probe-timeout-handling).

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jan-kantert 👍
Please complete the Proxy struct in ./pkg/charts/linkerd2/values.go with these new entries for the linkerd install CLI to be able to render them.
Also, please run ./bin/helm-docs for the comments added into the values.yaml file to be picked up in the chart's README.md file, and go test ./... -update to update the unit test golden files.

@adleong
Copy link
Member

adleong commented Nov 13, 2023

Hi @jan-kantert, just checking in to see if you have time to make the above changes?

@mateiidavid mateiidavid self-assigned this Nov 30, 2023
@mateiidavid
Copy link
Member

Hey @jan-kantert, just to give you an update: I pushed the requested changes to your branch. I did not change anything in the implementation (but I did modify the file to resolve a merge conflict) -- most of my work has been to fix the tests and get this reviewable.

Your effort is super appreciated! I think due to the nature of the issue itself (that you have described in #11453) we felt that it would be better to get this fix in soon so we can get it tested asap.

The contribution is solely yours :)

Comment on lines 162 to 163
initialDelaySeconds: {{.Values.proxy.livenessProbe.initialDelaySeconds | default 10}}
timeoutSeconds: {{.Values.proxy.livenessProbe.timeoutSeconds | default 1}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get rid of the default statements, as we already have defaults in the values.yaml file, and to be consistent with the rest of the file (see startupProbe).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I did not want to touch the impl but I agree.

@@ -198,6 +198,14 @@ proxy:
# "all-unauthenticated", "cluster-authenticated", "cluster-unauthenticated", "deny"
# @default -- "all-unauthenticated"
defaultInboundPolicy: "all-unauthenticated"
# -- LivenessProbe timeout and delay configuration
livenessProbe:
initialDelaySeconds: 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the default here used to be zero? Setting the new default to 10 is going to significantly delay the startup time for pods. How about leaving the default as it was? Pathological cases would still have an escape hatch with this new setting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was hardcoded in to 10, see the _proxy.tpl change. The default did not change. The only defaults we have changed are timeouts since we've added them in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😊

@mateiidavid
Copy link
Member

@jan-kantert we've noticed your DCO check wasn't properly signed. We can take this across the finish line if you'd like, but we still need to have all of the commits signed to pass the check :)

@kflynn
Copy link
Member

kflynn commented Dec 22, 2023

@jan-kantert Hey! We'd like to take this, if you can confirm that you agree with the whole DCO thing. A comment on this issue is fine. 🙂

@jan-kantert
Copy link
Contributor Author

I agree to the DCO! Sorry that I did not get to it before Christmas. Thanks for taking care!

jan-kantert and others added 7 commits January 5, 2024 10:38
Add timeout config for readinessProbe and livenessProbes
Add defaults for timeouts and delays
Changed to previous default
changed to previous default
Signed-off-by: Matei David <matei@buoyant.io>
Signed-off-by: Matei David <matei@buoyant.io>
@alpeb alpeb force-pushed the config_for_timeouts_and_delays_in_probes branch from 08d7a34 to c39b6db Compare January 5, 2024 16:06
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased and pushed some minor changes. Good to go IMO 👍

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a way to configure probe initialDelay and timeout. Do we also anticipate needing to configure probe period? Should we just add that as well while we're here? Worth a thought but not a blocker.

@mateiidavid mateiidavid merged commit af402a3 into linkerd:main Feb 8, 2024
36 checks passed
@mateiidavid mateiidavid mentioned this pull request Feb 8, 2024
mateiidavid added a commit that referenced this pull request Feb 9, 2024
This release addresses some issues in the destination service that could cause
it to behave unexpectedly when processing updates.

* Fixed a race condition in the destination service that could cause panics
  under very specific conditions ([#12022]; fixes [#12010])
* Changed how updates to a `Server` selector are handled in the destination
  service. When a `Server` that marks a port as opaque no longer selects a
  resource, the resource's opaqueness will reverted to default settings
  ([#12031]; fixes [#11995])
* Introduced Helm configuration values for liveness and readiness probe
  timeouts and delays ([#11458]; fixes [#11453]) (thanks @jan-kantert!)

[#12010]: #12010
[#12022]: #12022
[#11995]: #11995
[#12031]: #12031
[#11453]: #11453
[#11458]: #11458

Signed-off-by: Matei David <matei@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants