Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use user-defined readiness probes through queue-proxy #4600

Closed
wants to merge 32 commits into from

Conversation

@joshrider
Copy link
Member

joshrider commented Jul 2, 2019

Fixes #4014

TODO

  • check defaults and validations => lots of questions about desired behaviour here. Make a list.
  • lots of general cleanup
  • make sure zap errors don't produce lengthy errors that destroy readability of event log
  • does activator still keep using probeUserContainer?

Proposed Changes

  • move user-defined readinessProbes from kubelet to the queue-proxy
  • show default probe in Revision spec if no probe provided
  • allow the possibility of either our existing, aggressive probe retry semantics (which should speed up cold start), or standard k8s probe semantics

Release Note


Copy link

knative-prow-robot left a comment

@joshrider: 3 warnings.

In response to this:

Fixes #4014

TODO

  • check defaults and validations => lots of questions about desired behaviour here. Make a list.
  • investigate flaky TestAutoScaleUpDownUp
  • lots of general cleanup
  • pkg/reconciler/revision/resources/deploy_test.go only remaining user of queueReadinessProbe
  • pkg/reconciler/revision/resources/queue_test.go probe stuff into table tests?
  • make sure zap errors don't produce lengthy errors that destroy readability of event log
  • does activator still keep using probeUserContainer?

Proposed Changes

  • move user-defined readinessProbes from kubelet to the queue-proxy
  • show default probe in Revision spec if no probe provided
  • allow the possibility of either our existing, aggressive probe retry semantics (which should speed up cold start), or standard k8s probe semantics

Release Note


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/queue/readiness/probe.go Show resolved Hide resolved
pkg/queue/readiness/probe.go Show resolved Hide resolved
pkg/queue/readiness/probe.go Show resolved Hide resolved
@googlebot

This comment has been minimized.

Copy link

googlebot commented Jul 2, 2019

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@joshrider

This comment has been minimized.

Copy link
Member Author

joshrider commented Jul 2, 2019

/assign @mattmoor @dgerd

@knative-prow-robot

This comment has been minimized.

Copy link

knative-prow-robot commented Jul 2, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshrider
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mux.HandleFunc(queue.RequestQueueDrainPath, healthState.DrainHandler())

return mux
}

func probeQueueHealthPath(port int, timeout time.Duration) error {
func probeQueueHealthPath(port int, timeout int) error {

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

Since we are now dealing with an int can be useful to make clear that we are dealing with seconds.

Suggested change
func probeQueueHealthPath(port int, timeout int) error {
func probeQueueHealthPath(port int, timeoutSeconds int) error {
url := fmt.Sprintf(healthURLTemplate, port)

// use aggressive retries

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

nit:

Suggested change
// use aggressive retries
// Use aggressive sub-second retries.
res, err := httpClient.Get(url)

if err != nil {
return errors.Wrap(err, "failed to probe")

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

If the original error does not return the URL this could be valuable for debugging.

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 4, 2019

Author Member

I'm not sure I understand this comment. Is the suggestion to include the URL in the error message? We build the URL on the fly, but (at present) it will always be http://127.0.0.1:8022/health.

Maybe something better than "failed to probe" is called for?

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 8, 2019

Contributor

Yep all I meant was fail to probe: http://127.0.0.1:8022/health provides a little bit more detail to someone looking at the error in the logs without having to trace through the code.

}

var lastErr error

// The 25 millisecond retry interval is an unscientific compromise between wanting to get
// started as early as possible while still wanting to give the container some breathing
// room to get up and running.
timeoutErr := wait.PollImmediate(25*time.Millisecond, timeout, func() (bool, error) {
timeoutErr := wait.PollImmediate(25*time.Millisecond, readiness.PollTimeout, func() (bool, error) {

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

nit: Make the 25ms a constant rather than a magic value here

@@ -353,9 +395,16 @@ func main() {
StatChan: statChan,
}, time.Now())

coreProbe, err := parseProbe(*ucProbe)
if err != nil {
logger.Fatalf("Queue container failed to parse readiness probe %#v", err)

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor
Suggested change
logger.Fatalf("Queue container failed to parse readiness probe %#v", err)
logger.Fatalw("Queue container failed to parse readiness probe", zap.Error(err))
@@ -293,11 +325,21 @@ func probeQueueHealthPath(port int, timeout time.Duration) error {
return nil
}

// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

nit: comments for declarations should be sentences and end in periods.

Suggested change
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error.
}

func (p *Probe) timeout() time.Duration {
if p.IsStandardProbe() {

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

Is this supposed to be inversed?

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 4, 2019

Author Member

I think it's correct, but defaultProbeTimeout is poorly named.

This represents the maximum length of time waited for a TCP connection/HTTP response. For a user-supplied probe, we are using the value given to us for our single attempt. For our aggressive retries, we timeout much more quickly and just try again.

mux := http.NewServeMux()
mux.HandleFunc(requestQueueHealthPath, healthState.HealthHandler(probeUserContainer))

if p.IsStandardProbe() {

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

There seems to be lots of commonality between HealthHandler and HealthHandlerKProbe. Why the switch here rather than just having a HealthHandler function that configures based on StandardProbe flag?

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 4, 2019

Author Member

Made a change. LMK what you think.

}

errs := validateProbe(p)

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

Looks like we have previously not validated the bounds of periodSeconds and let the K8s API server handle the validation. Since we are using the periodSeconds in our controllers now we probably should add the validation that this is a positive number.

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

I'd probably also split the changes to pkg/apis into it's own change, since it's fairly independent.

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 8, 2019

Author Member

Added a check to make sure periodSeconds is non-negative (allowing 0 to signal the aggressive probe) and checks to make sure timeoutSeconds and failureThreshold are non-negative in the case where we're not using the aggressive probe.

)

const (
defaultProbeTimeout = 100 * time.Millisecond

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 3, 2019

Contributor

This is defined here and in main.go. Is this supposed to stay in sync? If so we should make this public and share it between the two places.

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 4, 2019

Author Member

The probe timeout in probe.go and main.go are used for different things that don't necessary need to sync.

defaultProbeTimeout in main.go is used in probeUserContainer which is used as a signal to the activator that the pod is up and ready to accept traffic. If we want to keep using the pre-existing, hard-coded TCP connect as the activator's test for readiness, it seems like the constants are two similar, but different, things.

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 4, 2019

Author Member

Should the activator's test be the same as the one that's set off from kubelet?

This comment has been minimized.

Copy link
@dgerd

dgerd Jul 8, 2019

Contributor

@vagababov @mattmoor Any thoughts here?

@joshrider joshrider force-pushed the joshrider:custom-probes branch from 4a5f26a to 2e3e5e5 Jul 4, 2019
Copy link
Member

mattmoor left a comment

Thanks for the change @joshrider

Left a few comments throughout, but mainly this is a big change. I think it'll be clearer to tackle in pieces now that you seem to have things working. I made a couple suggestions on places this might split nicely.

@@ -134,8 +134,13 @@ func makePodSpec(rev *v1alpha1.Revision, loggingConfig *logging.Config, observab
userContainer.TerminationMessagePolicy = corev1.TerminationMessageFallbackToLogsOnError
}

if userContainer.ReadinessProbe != nil {
if userContainer.ReadinessProbe.HTTPGet != nil || userContainer.ReadinessProbe.TCPSocket != nil {
userContainer.ReadinessProbe = nil

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

This is worth a comment for those without the context of this code review.

}

// IsKProbe indicates whether the default Knative probe with aggressive retries should be used.
func (p *Probe) IsKProbe() bool {

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

Perhaps use a more descriptive name like IsAggressive or HasUnrestrictedInterval?

} else if p.TCPSocket != nil {
err = p.tcpProbe()
} else if p.Exec != nil {
// Assumes the true readinessProbe is being executed directly on user container. See (#4086).

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

I wonder if we should still TCP probe in this case, since we otherwise have no verification of the networking path from the queue to the user-container.

This comment has been minimized.

Copy link
@mattmoor
return true
} else {
// using Fprintf for a concise error message in the event log
fmt.Fprintf(os.Stderr, "unimplemented probe type.")

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

Is there any state we can add to this to surface what type we did get?

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 5, 2019

Author Member

We can print the whole corev1.Probe here, but all this really means at this point is that all 3 of the extant probe types are nil. Is it worth printing the whole object to catch future cases where there may be an additional probe type?

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 5, 2019

Author Member

While leads me to the possibility that the error message may be less precise than it could be...

This comment has been minimized.

Copy link
@vagababov

vagababov Jul 5, 2019

Contributor

Also fmt.Fprint, since no formatting.

// TCPProbe checks that a TCP socket to the address can be opened.
// Did not reuse k8s.io/kubernetes/pkg/probe/tcp to not create a dependency
// on klog.
func TCPProbe(addr string, socketTimeout time.Duration) error {
conn, err := net.DialTimeout("tcp", addr, socketTimeout)
func TCPProbe(config TCPProbeConfigOptions) error {

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

I would also consider splitting just the changes needed for this package into its own PR.

if container.ReadinessProbe == nil {
container.ReadinessProbe = &corev1.Probe{}
}
if container.ReadinessProbe.TCPSocket == nil && container.ReadinessProbe.HTTPGet == nil {

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

What if it is an exec probe?

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 5, 2019

Author Member

good catch 👍

}

errs := validateProbe(p)

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

I'd probably also split the changes to pkg/apis into it's own change, since it's fairly independent.


probeJSON, err := json.Marshal(rp)
if err != nil {
probeJSON = []byte{}

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

🤔 I don't like swallowing errors :(

logger.Fatalw("Queue container failed to parse readiness probe", zap.Error(err))
}

rp := readiness.NewProbe(coreProbe, logger.With(zap.String(logkey.Key, "readinessProbe")))

This comment has been minimized.

Copy link
@mattmoor

mattmoor Jul 5, 2019

Member

Consider exposing readiness.EncodeProbe() and readiness.DecodeProbe to [de]serializing the probe.

// The 25 millisecond retry interval is an unscientific compromise between wanting to get
// started as early as possible while still wanting to give the container some breathing
// room to get up and running.
kProbePollInterval = 25 * time.Millisecond

This comment has been minimized.

Copy link
@vagababov

vagababov Jul 5, 2019

Contributor

the naming here seems non-unified.

// Do not use the cached connection
DisableKeepAlives: true,
},
Timeout: time.Duration(timeoutSeconds) * time.Second,

This comment has been minimized.

Copy link
@vagababov

vagababov Jul 5, 2019

Contributor

Do you really need this cast?

This comment has been minimized.

Copy link
@joshrider

joshrider Jul 5, 2019

Author Member

I think so?

Timeout: timeoutSeconds * time.Second,

cmd/queue/main.go:278:27: invalid operation: timeoutSeconds * time.Second (mismatched types int and time.Duration)
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error.
func parseProbe(ucProbe string) (*corev1.Probe, error) {
p := &corev1.Probe{}
err := json.Unmarshal([]byte(ucProbe), p)

This comment has been minimized.

Copy link
@vagababov

vagababov Jul 5, 2019

Contributor

if err := ...; err != nil {...}
is nicer.

joshrider and others added 9 commits May 31, 2019
- Refactor http probe into health package & add tests
- Refactor probe function for simplicity
- reduced the timings on the unit test from 39s-> 27s
- Add support of http probe scheme
- Add support for parsing json probe in queue
@joshrider joshrider force-pushed the joshrider:custom-probes branch from 2e582df to 4a959a1 Jul 8, 2019
@knative-metrics-robot

This comment has been minimized.

Copy link

knative-metrics-robot commented Jul 8, 2019

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/k8s_validation.go 98.9% 99.0% 0.1
pkg/apis/serving/v1beta1/revision_defaults.go 87.5% 89.5% 2.0
pkg/queue/health/probe.go 100.0% 90.9% -9.1
pkg/queue/readiness/probe.go Do not exist 100.0%
pkg/queue/readiness/probe_encoding.go Do not exist 87.5%
@joshrider joshrider closed this Jul 16, 2019
@joshrider joshrider deleted the joshrider:custom-probes branch Aug 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.