Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status reflecting the state of Envoy programming #2021

Open
mattmoor opened this issue Dec 10, 2019 · 11 comments
Open

Status reflecting the state of Envoy programming #2021

mattmoor opened this issue Dec 10, 2019 · 11 comments
Labels
area/status Issues or PRs related to exposing status for APIs. blocked Blocked waiting on a dependency lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@mattmoor
Copy link
Contributor

tl;dr I'd like to know when the changes I have posted to spec: have been distributed to all the envoys.

Background

In Knative, one of the early problems we had with Istio (still has this problem) is that "Ready" doesn't mean "Ready" (there is a programming delay). What we ultimately had to build (and may need to generalize for use here) was a system that would probe each gateway pod to determine whether the programming had landed.

Recommendation

There are two key elements of any good status: block:

  1. observedGeneration: the value of metadata.generation that the status block reflects (note that metadata.generation is bumped by Kubernetes on any spec update).
  2. Enough fields to determine the state of the system, for the given generation.

Folks observing resources would then check:

if metadata.generation != status.observedGeneration {
  // The controller needs to catch up, so NOT READY.
} else if !status.IsReady() {
  // The fields in status indicate that it isn't ready at the latest generation, so NOT READY.
} else {
  // The controller has both caught up and indicated that things are ready, so READY.
}

Now, the question is what's needed for IsReady()? Well, at a minimum the controller would need to know how many of the Envoys have received that version of the configuration and either report that information directly to status or summarize it. At the end of the day it should be possible to write an IsReady() that checks that {envoys programmed @ status.obsGen} == {number of envoys} with reasonable confidence.

@youngnick youngnick added the area/status Issues or PRs related to exposing status for APIs. label Dec 10, 2019
@davecheney
Copy link
Contributor

@mattmoor contour is a lot better than Istio -- we moved away from polling nearly two years ago. The expected update delay is in the range [100ms, 600ms)

@mattmoor
Copy link
Contributor Author

@davecheney I'm talking about the fact that we have to probe the gateway Envoys to determine readiness because Istio doesn't surface anything (at all) in status. I was not referring to the mechanism by which the Envoy's receive programming.

@davecheney
Copy link
Contributor

davecheney commented Dec 10, 2019

@mattmoor how could we change contour so you didn't have to probe? Ie, if the configuration is right, it should be programmed in envoy fast enough that a human cannot observe a delay. if the configuration is not right, then waiting or probing won't fix it.

@mattmoor
Copy link
Contributor Author

@davecheney it's not human observation I'm worried about. We want to be sure that things are ready when we say things our ready, and our e2e tests immediately check things when they observe that we've reported things as ready. At small scales, Istio is also quite fast, but when the system is put under load (e.g. tons of services) the programming times climb.

FWIW, if I have our e2e tests retry 404s, then things get pretty far, but otherwise get a fair number of failures (though it obviously varies run to run).

@davecheney
Copy link
Contributor

There are a few problems I see with implementing this request.

  1. We don't track the ACK/NACK status of a request we send to envoy back to the object that caused it.
  2. Even if we did, HTTPProxy configurations are spread over many xDS tables, there would have to be some kind of ordering notion to know when Envoy has a complete set of configuration. There is also ADS which we don't use at the moment which there is a belief that will help some, I'm not sure its as simple as "just use ADS". Figuring out the signal "envoy has all the configuration needed" is complicated.
  3. Timeouts. If we're tracking the "up to date" status of each envoy via the connections it makes to contour, what happens if one envoy goes away? Each discovery request/response cycle would have to have a timeout associated with it when we give up waiting for envoy to ack the response, and how would we report that back to the status field?
    3a. Tracking this gets harder when multiple contours are deployed with leader election. Currently all contours serve read requests but only one writes status back to k8s.

I'm not saying no to this request, but I want to explain my desire to solve your problem in a different way.

@jpeach
Copy link
Contributor

jpeach commented Jan 14, 2020

at a minimum the controller would need to know how many of the Envoys have received that version of the configuration

Contour can know how many envoys have received an update (perhaps we could persist this in Envoy with tags or something?), but I'm not sure we can know how many envoys should have received the update without over specifying the envoy deployment model.

@bgagnon
Copy link
Contributor

bgagnon commented Jan 15, 2020

There are metrics on the Envoy side for xDS versions:

  • envoy_http_rds_version
  • envoy_cluster_manager_cds_version
  • envoy_listener_manager_lds_version

Assuming Contour has some level of control on the versions of the xDS objects, I think it is possible to observe the convergence of Contour and Envoy on a per-pod basis.

Also, I think the ACK system (see #1176) that is part of the xDS protocol could signal the Envoy state updates from Contour.

As for setting a status value on the Contour pod, I'm not sure what it means if we need to consider N Envoys -- should it reflect the fastest or slowest update?

BTW, we've built an extensive detection system to alert on Envoy desynchronization as part of the investigation for #1523. It has proven valuable beyond the resolution of this issue and could prove useful to others.

@stevesloka
Copy link
Member

@mattmoor is this feature to solve e2e testing and wouldn't be used in production?

@mattmoor
Copy link
Contributor Author

I would say that it’s useful for any automated consumption of Contour where you’d ideally get an informer notification when the programming is ready.

We have generalized the probing logic we wrote to solve this for Istio so that we can do something similar for establishing Contour’s readiness as well, so this particular issue isn’t blocking.

@mattmoor
Copy link
Contributor Author

FYI, I believe Istio is adding status now, though I haven't been tracking it too closely.

Copy link

github-actions bot commented May 3, 2024

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/status Issues or PRs related to exposing status for APIs. blocked Blocked waiting on a dependency lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
No open projects
Contour Project Board
  
Unprioritized
Development

No branches or pull requests

6 participants