Status reflecting the state of Envoy programming #2021

mattmoor · 2019-12-10T00:48:12Z

tl;dr I'd like to know when the changes I have posted to spec: have been distributed to all the envoys.

Background

In Knative, one of the early problems we had with Istio (still has this problem) is that "Ready" doesn't mean "Ready" (there is a programming delay). What we ultimately had to build (and may need to generalize for use here) was a system that would probe each gateway pod to determine whether the programming had landed.

Recommendation

There are two key elements of any good status: block:

observedGeneration: the value of metadata.generation that the status block reflects (note that metadata.generation is bumped by Kubernetes on any spec update).
Enough fields to determine the state of the system, for the given generation.

Folks observing resources would then check:

if metadata.generation != status.observedGeneration {
  // The controller needs to catch up, so NOT READY.
} else if !status.IsReady() {
  // The fields in status indicate that it isn't ready at the latest generation, so NOT READY.
} else {
  // The controller has both caught up and indicated that things are ready, so READY.
}

Now, the question is what's needed for IsReady()? Well, at a minimum the controller would need to know how many of the Envoys have received that version of the configuration and either report that information directly to status or summarize it. At the end of the day it should be possible to write an IsReady() that checks that {envoys programmed @ status.obsGen} == {number of envoys} with reasonable confidence.

The text was updated successfully, but these errors were encountered:

davecheney · 2019-12-10T00:56:24Z

@mattmoor contour is a lot better than Istio -- we moved away from polling nearly two years ago. The expected update delay is in the range [100ms, 600ms)

mattmoor · 2019-12-10T01:22:22Z

@davecheney I'm talking about the fact that we have to probe the gateway Envoys to determine readiness because Istio doesn't surface anything (at all) in status. I was not referring to the mechanism by which the Envoy's receive programming.

davecheney · 2019-12-10T01:24:46Z

@mattmoor how could we change contour so you didn't have to probe? Ie, if the configuration is right, it should be programmed in envoy fast enough that a human cannot observe a delay. if the configuration is not right, then waiting or probing won't fix it.

mattmoor · 2019-12-10T01:35:09Z

@davecheney it's not human observation I'm worried about. We want to be sure that things are ready when we say things our ready, and our e2e tests immediately check things when they observe that we've reported things as ready. At small scales, Istio is also quite fast, but when the system is put under load (e.g. tons of services) the programming times climb.

FWIW, if I have our e2e tests retry 404s, then things get pretty far, but otherwise get a fair number of failures (though it obviously varies run to run).

davecheney · 2019-12-10T01:42:43Z

There are a few problems I see with implementing this request.

We don't track the ACK/NACK status of a request we send to envoy back to the object that caused it.
Even if we did, HTTPProxy configurations are spread over many xDS tables, there would have to be some kind of ordering notion to know when Envoy has a complete set of configuration. There is also ADS which we don't use at the moment which there is a belief that will help some, I'm not sure its as simple as "just use ADS". Figuring out the signal "envoy has all the configuration needed" is complicated.
Timeouts. If we're tracking the "up to date" status of each envoy via the connections it makes to contour, what happens if one envoy goes away? Each discovery request/response cycle would have to have a timeout associated with it when we give up waiting for envoy to ack the response, and how would we report that back to the status field?
3a. Tracking this gets harder when multiple contours are deployed with leader election. Currently all contours serve read requests but only one writes status back to k8s.

I'm not saying no to this request, but I want to explain my desire to solve your problem in a different way.

jpeach · 2020-01-14T22:41:46Z

at a minimum the controller would need to know how many of the Envoys have received that version of the configuration

Contour can know how many envoys have received an update (perhaps we could persist this in Envoy with tags or something?), but I'm not sure we can know how many envoys should have received the update without over specifying the envoy deployment model.

bgagnon · 2020-01-15T16:08:47Z

There are metrics on the Envoy side for xDS versions:

envoy_http_rds_version
envoy_cluster_manager_cds_version
envoy_listener_manager_lds_version

Assuming Contour has some level of control on the versions of the xDS objects, I think it is possible to observe the convergence of Contour and Envoy on a per-pod basis.

Also, I think the ACK system (see #1176) that is part of the xDS protocol could signal the Envoy state updates from Contour.

As for setting a status value on the Contour pod, I'm not sure what it means if we need to consider N Envoys -- should it reflect the fastest or slowest update?

BTW, we've built an extensive detection system to alert on Envoy desynchronization as part of the investigation for #1523. It has proven valuable beyond the resolution of this issue and could prove useful to others.

stevesloka · 2020-01-15T19:15:33Z

@mattmoor is this feature to solve e2e testing and wouldn't be used in production?

mattmoor · 2020-01-15T19:51:39Z

I would say that it’s useful for any automated consumption of Contour where you’d ideally get an informer notification when the programming is ready.

We have generalized the probing logic we wrote to solve this for Istio so that we can do something similar for establishing Contour’s readiness as well, so this particular issue isn’t blocking.

mattmoor · 2020-08-19T14:07:41Z

FYI, I believe Istio is adding status now, though I haven't been tracking it too closely.

github-actions · 2024-05-03T00:20:10Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

youngnick added the area/status Issues or PRs related to exposing status for APIs. label Dec 10, 2019

michmike added this to To do in Contour Project Board Mar 24, 2020

youngnick mentioned this issue May 4, 2020

design: Add a pattern for adding extensible status information to HTTPProxy #2495

Closed

3 tasks

youngnick added the blocked Blocked waiting on a dependency label May 4, 2020

jpeach mentioned this issue Aug 19, 2020

Enable better Envoy debugging #2792

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status reflecting the state of Envoy programming #2021

Status reflecting the state of Envoy programming #2021

mattmoor commented Dec 10, 2019

davecheney commented Dec 10, 2019

mattmoor commented Dec 10, 2019

davecheney commented Dec 10, 2019 •

edited

mattmoor commented Dec 10, 2019

davecheney commented Dec 10, 2019

jpeach commented Jan 14, 2020

bgagnon commented Jan 15, 2020 •

edited

stevesloka commented Jan 15, 2020

mattmoor commented Jan 15, 2020

mattmoor commented Aug 19, 2020

github-actions bot commented May 3, 2024

Status reflecting the state of Envoy programming #2021

Status reflecting the state of Envoy programming #2021

Comments

mattmoor commented Dec 10, 2019

Background

Recommendation

davecheney commented Dec 10, 2019

mattmoor commented Dec 10, 2019

davecheney commented Dec 10, 2019 • edited

mattmoor commented Dec 10, 2019

davecheney commented Dec 10, 2019

jpeach commented Jan 14, 2020

bgagnon commented Jan 15, 2020 • edited

stevesloka commented Jan 15, 2020

mattmoor commented Jan 15, 2020

mattmoor commented Aug 19, 2020

github-actions bot commented May 3, 2024

davecheney commented Dec 10, 2019 •

edited

bgagnon commented Jan 15, 2020 •

edited