Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply default livenessProbe and readinessProbe to the user container #4014

Closed
chizhg opened this issue May 6, 2019 · 15 comments · Fixed by #4731

Comments

@chizhg
Copy link
Member

@chizhg chizhg commented May 6, 2019

In what area(s)?

/area test-and-release

Describe the feature

According to the second paragraph of Meta Requests in Knative Runtime Contract, we should apply default livenessProbe and readinessProbe to the user container when they are not specified, and have them the following settings:

  • tcpSocket set to the container's port
  • initialDelaySeconds set to 0
  • periodSeconds set to platform-specific value

@tcnghia @mattmoor @markusthoemmes

@markusthoemmes

This comment has been minimized.

Copy link
Contributor

@markusthoemmes markusthoemmes commented May 6, 2019

In theory we have this (we don't do liveness probes though) implemented via the queue-proxy, which does this exact probe.

@dgerd

This comment has been minimized.

Copy link
Contributor

@dgerd dgerd commented May 6, 2019

QueueProxy Probe on Admin Interface:

queueReadinessProbe = &corev1.Probe{

QueueProxyHandler Setup on Admin Interface:

func createAdminHandlers() *http.ServeMux {

Probe Handler:

queueReadinessProbe = &corev1.Probe{

@dgerd

This comment has been minimized.

Copy link
Contributor

@dgerd dgerd commented May 6, 2019

We do have something here. Some areas where this doesn't match well with the description:

  1. LivenessProbe is not configured
  2. "Platform-specific value" makes me think this is configuration; however, periodSeconds is hard-coded in the queueProxy
  3. "Default" makes me think that user specified probes override this behavior, but instead they are applied concurrently.

I think we could go either way on whether to update the contract, or to update the implementation for each of these things. I do however want to make them consistent.

@dgerd dgerd added this to To do in Conformance via automation May 6, 2019
dgerd added a commit to dgerd/serving that referenced this issue May 7, 2019
This change makes numerous cleanups to the runtime contract in an
attempt to improve the readability of the document and make the document
more useful for the intended auidence.

* Moves developer facing statements to a new `runtime-user-guide`.
Focuses `runtime-contract` on operator/platform-provider.
* Add links to Conformance tests that test Runtime Contract statements.
* Corrects, updates, or removes statements to more accurately represent
today's Knative runtime.
* Updates to informative or removes most untestable statements
* Copies in important OCI runtime requirements we previously referenced
* Removes reference to OCI specification that didn't bring new
requirements.

Ref: knative#2539, knative#2973, knative#4014, knative#4027
@dgerd

This comment has been minimized.

Copy link
Contributor

@dgerd dgerd commented May 8, 2019

After discussion in the API working group meeting, this is my understanding of where we landed. Please correct if I missed something or misunderstood.

On the contract/specification side:

  1. LivenessProbe should not be configured by default. This part of the statement should be removed from the runtime contract.

On the implementation side:

  1. The default TCP probe behavior should be disabled when a user specifies a custom readinessProbe.
  2. The default TCP readinessProbe should be added by the webhook and appear on the Revision spec.
  3. Custom readiness probes defined by the user should be translated to health checks that are done BY the queue-proxy against the user-container, and removed from the user container.
  4. We should understand if our queue-proxy http server startup time is causing #3308. If it is we should consider using an execProbe on the queue-proxy to perform the healthcheck against the user-container. If it is not, we should take a look at how fast we timeout on health checking the user container.
@mattmoor mattmoor added this to the Serving 0.7 milestone May 10, 2019
@joshrider

This comment has been minimized.

Copy link
Member

@joshrider joshrider commented May 13, 2019

/assign @joshrider

@joshrider

This comment has been minimized.

Copy link
Member

@joshrider joshrider commented Jun 7, 2019

/assign @shashwathi

@joshrider

This comment has been minimized.

Copy link
Member

@joshrider joshrider commented Jun 11, 2019

I've run into some questions about how to translate a RevisionSpec's probes over into something we'd want to execute from the queue-proxy against the user-container.

periodSeconds, successThreshold, and failureThreshold do not cleanly map over to the way we do things now.

At present, our hardcoded readinessProbe will make a GET request to the queue-proxy, which then fires TCP probes at the user-container at 50 ms intervals. This gives us the chance to pick up the user-container as soon as possible. This 50ms could be the "platform-specific" periodSeconds value referred to by the runtime-contract, but the sub-second value puts us into a bit of a sticky spot.

The periodSeconds value on the probe is an integer representing a number of seconds. This makes it difficult to: a) include an accurate description of the default probe in the Revision spec, and b) allow the user to configure their own sub-second periodSeconds values.

@dgerd pointed out that 0 is not a valid value for a probe, and may be an option to signify some special value. It could be that 0 is a stand-in for some smaller period of time (like 50ms), or it could indicate that we need to look elsewhere for the desired value. This could get us where we're going, but does not seem straightforward for a user. I'd be happy to get some feedback on ways to tackle this.

successThreshold and failureThreshold are both a little bit awkward, and may result in us reimplementing a bunch of logic from the kubelet's prober. I also have some questions about their meaningfulness given that, as the default probe is currently implemented, once the TCP probe from the queue-proxy against the user-container succeeds once, subsequent probes of the queue-proxy return successfully without ever re-checking the user-container.

Sidenote: there is a comment where we build the default probe in the queue-proxy that suggests we want to get the PreStop going as soon as possible. Given that the failureThreshold isn't set and that the default value for the failureThreshold is 3, would we benefit by setting it ourselves to a lower value?

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jun 12, 2019

Few thoughts:

  1. I think periodSeconds: 0 should mean do it as the system wants to.
  2. I don't think the system default needs to be reflected back into the yaml (and here in fact cannot be)
  3. If periodSeconds is specified we should respect it.
  4. I think that we should disallow failureThreshold if periodSeconds is unspecified, but I think successThreshold still makes sense.
@joshrider

This comment has been minimized.

Copy link
Member

@joshrider joshrider commented Jun 19, 2019

@mattmoor @dgerd

Clarifying question: if the user specifies an HTTP probe with periodSeconds: 0, do we want that to be invalid or do we want to translate that to the aggressive retry configuration?

@mattmoor mattmoor modified the milestones: Serving 0.7, Serving 0.8 Jun 19, 2019
@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jun 19, 2019

aggressive retries.

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jun 25, 2019

The base change is in. I'd like to see this land as early in 0.8 as we can, so it can bake. Please LMK if you have questions or need reviews.

@joshrider

This comment has been minimized.

Copy link
Member

@joshrider joshrider commented Jun 26, 2019

Not to jinx anything, but the e2e tests just passed. Have some housekeeping and cleanup to do, but hope to have something up for feedback at the end of the week.

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jun 26, 2019

@joshrider That's awesome news, LMK when things are RFAL.

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jul 2, 2019

@joshrider did you jinx it? 😲

@joshrider

This comment has been minimized.

Copy link
Member

@joshrider joshrider commented Jul 2, 2019

Guess it's hard to argue that I didn't at this point...

Just put something up with a WIP tag. #4600
Lots to look at and feedback greatly appreciated. Will also need to be shepherded through CLA-land manually.

@mattmoor @dgerd

joshrider added a commit to joshrider/serving that referenced this issue Jul 8, 2019


Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
knative-prow-robot added a commit that referenced this issue Jul 9, 2019
* prepare queue healthHandler for optional aggressive retries #4014

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* add documentation for IsHTTPProbeReady

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* Address comments

* add kubelet header to http probe, add status code to failed probe error

* use kubelet user-agent in http probe

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
shashwathi added a commit to joshrider/serving that referenced this issue Jul 9, 2019
This PR builds the adapter which converts a user-defined probe to a probe that
will be executed by queue proxy against user container.

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
knative-prow-robot added a commit that referenced this issue Jul 10, 2019
* Prep for queue health handler for aggressive retries #4014

This PR builds the adapter which converts a user-defined probe to a probe that
will be executed by queue proxy against user container.

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* Address golang linter comments

* fix comment

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>

* return probe encoding error

* factor out queue probe retry logic, clean up

Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
Conformance automation moved this from To do to Done Jul 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.