Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate a new Route "Readiness" condition #1156

Closed
carlisia opened this issue May 10, 2022 · 12 comments
Closed

Indicate a new Route "Readiness" condition #1156

carlisia opened this issue May 10, 2022 · 12 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@carlisia
Copy link
Contributor

carlisia commented May 10, 2022

Goals

  • indicate if all configuration needed to perform Route related operations has been fully propagated through the entire data plane.

  • otherwise, indicate if the configuration had a point of failure, or if it is still converging towards being "ready"

  • indicate the semantics for what a correct implementation of what it means for Routes to be "ready" for configuration updates as well

Constraint

Some implementations will never be able to provide this status. For this reason, this should be considered as "extended support".

Description

For implementations that are able to provide this status, doing so according to a uniform spec would greatly unburden Gateway API consumers from:

  • having to figure out how to implement their own readiness check

  • implementing readiness checks that don't conform uniformly to what it means to be "ready"

  • not having means to easily decide when it should continue to retry connecting or halt trying for not having a way to identify if failures and time-outs originate from an application level or from the Gateway controller being not yet "ready"

Suggestions

Name: RouteConditionType = "Ready"

Possible RouteConditionReason options:

  • True

    • "Ready"
  • False

    • "Unsupported" (indicates it is not implemented intentionally)

    • "InProgress" or "Pending" (still attempting to converge into a "readiness" state

    • "Failed" (some step in the "readiness" check failed)

    • "TBD" (see below; if it wasn't the readiness check that failed, but another condition that was a pre-requisite for a "Ready:True"

Considerations

  • Once a Route is in a "Ready:False", Reason: Failed state, what should trigger a retry? Arbitrarily looping over the readiness check logic, maybe at user configurable intervals? Or should that only be triggered once there is a new change in the configuration?

  • In face of other condition types for Routes, existing as well as others yet to be added, ex "Accepted", "Programmed", etc, how should the condition reason for the "Ready:False" be indicated? Examples:

    • one of the Route conditions becomes "False" => The RouteConditionReasoncould be the same as the condition for that RouteConditionType

    • multiple RouteConditionType become "False" => The RouteConditionReasoncould be the same as the condition for the RouteConditionType that is closer to the "Ready" type IF there is a natural order/sequence of operations. Otherwise, if multiple reasons exist


Alternative consideration

  • Readiness could be expressed as a non-condition status field.
@carlisia carlisia added the kind/feature Categorizes issue or PR as related to a new feature. label May 10, 2022
@carlisia carlisia changed the title Indicate a "Readiness" state Indicate a new Route "Readiness" state May 10, 2022
@carlisia carlisia changed the title Indicate a new Route "Readiness" state Indicate a new Route "Readiness" condition May 10, 2022
@evankanderson
Copy link
Contributor

Conditions are quad-state, generally simplified to tri-state (imma gonna make Tim cry here):

  • True: The condition has been checked and is currently fulfilled
  • False: The condition has been checked, and cannot be met
  • Unknown: The controller has observed this resource, but could not (yet) determine the status of this condition, possibly because the resource is still changing.
  • "" (Condition not reported or empty string): The controller has not observed this resource, or has not filled in this condition. Generally treated the same as Unknown, but may provide a hint that the controller cannot provide this condition.

I'd suggest that the "Unsupported" reason should explicitly map to Unknown state, rather than False.

@carlisia
Copy link
Contributor Author

Adding related discussions here for convenience:

@robscott
Copy link
Member

I'd suggest that the "Unsupported" reason should explicitly map to Unknown state, rather than False.

I think there's some nuance here. I think there will be a set of features that implementations are aware of but can't or don't support. In those cases, I think Ready == False with a Unsupported reason is appropriate. I think Unknown could be used to indicate if an implementation really didn't know if they supported a feature, but that seems unlikely.

@robscott
Copy link
Member

As far as naming, there's been some earlier discussion that a "Ready" condition may be too precise for an eventually consistent world. It may be more realistic to use a term like "Programmed" to indicate that the dataplane has been programmed but that it may take some period of time for those changes to propagate.

@evankanderson
Copy link
Contributor

I'd suggest that the "Unsupported" reason should explicitly map to Unknown state, rather than False.

I think there's some nuance here. I think there will be a set of features that implementations are aware of but can't or don't support. In those cases, I think Ready == False with a Unsupported reason is appropriate. I think Unknown could be used to indicate if an implementation really didn't know if they supported a feature, but that seems unlikely.

I think Ready == False suggests that the controller knows that the resource isn't Ready, rather than that it's unable to measure whether or not the resource is Ready. I worry about downstream software taking that as a signal that the system is broken (and possibly blocking further reconciliation), rather than empty/unknown meaning "figure it out yourself, I can't help".

As a programmatic user of the Gateway API, I'm hoping that the Conditions provide a summary of logic that's either complicated to extract from the status or not otherwise exposed in the status, both for the happy case and the error case. There's too much Kubernetes debugging today that ends up requiring the user read the controller logs, which absolutely doesn't work for users with limited cluster privileges or for other programs attempting to consume the resources as an API.

@robscott
Copy link
Member

I worry about downstream software taking that as a signal that the system is broken (and possibly blocking further reconciliation), rather than empty/unknown meaning "figure it out yourself, I can't help".

That makes sense. I'd actually been assuming something else - a Gateway would not accept a Route that included configuration that was recognized but unsupported. As I'm thinking about that more, it seems like this specific reason may actually make more sense if tied to a different condition such as "Accepted".

As a programmatic user of the Gateway API, I'm hoping that the Conditions provide a summary of logic that's either complicated to extract from the status or not otherwise exposed in the status, both for the happy case and the error case. There's too much Kubernetes debugging today that ends up requiring the user read the controller logs, which absolutely doesn't work for users with limited cluster privileges or for other programs attempting to consume the resources as an API.

💯 completely agree

@mikemorris
Copy link
Contributor

mikemorris commented May 24, 2022

a Gateway would not accept a Route that included configuration that was recognized but unsupported. As I'm thinking about that more, it seems like this specific reason may actually make more sense if tied to a different condition such as "Accepted".

I explored a specific case of this a bit in #1146, and was leaning towards RouteConditionType Accepted with status False and adding an appropriate RouteConditionReason.

I'd agree with @evankanderson's suggestion that an implementation unable to provide RouteConditionType Ready could either not set it, or set it with status Unknown.

Once a Route is in a "Ready:False", Reason: Failed state, what should trigger a retry? Arbitrarily looping over the readiness check logic, maybe at user configurable intervals? Or should that only be triggered once there is a new change in the configuration?

I'd think the latter suggestion of re-validating only when the route config (and possibly related config like any Gateway listener to which the route is attached) changes could make sense for initial implementation, with specific interval/timeout being an internal implementation detail rather than user-facing (maybe consider exposing this later if sufficient end-user need is demonstrated).

👍 to RouteConditionReason Ready for status True, RouteConditionReasonPending and Failed for status False, and to speccing this at Extended conformance level.

multiple RouteConditionType become "False" => The RouteConditionReasoncould be the same as the condition for the RouteConditionType that is closer to the "Ready" type IF there is a natural order/sequence of operations. Otherwise, if multiple reasons exist

It's a bit under-specified how multiple errors for a single condition should be handled currently (is there any convention in other projects for setting the same condition type multiple times with different reasons? this might deserve a separate issue). I think though that I'd lean more towards a self-contained reason, something like Blocked or NotChecked instead of trying to bubble up reason(s) from other conditions - thoughts?

Thanks for starting to scope this out @carlisia, definitely feels on the right track!

@youngnick
Copy link
Contributor

That makes sense. I'd actually been assuming something else - a Gateway would not accept a Route that included configuration that was recognized but unsupported. As I'm thinking about that more, it seems like this specific reason may actually make more sense if tied to a different condition such as "Accepted".

I think the most likely case for an implementation setting "Ready" to "Unknown" is that it's sent off the config, but can't guarantee that the Ready criteria are met. Then it can set "Ready" to "Unknown" to signal that things watching the Route should perform their own checking of exactly when it's Ready (if that matters to them like it does to Knative).

I'd think the latter suggestion of re-validating only when the route config (and possibly related config like any Gateway listener to which the route is attached) changes could make sense for initial implementation, with specific interval/timeout being an internal implementation detail rather than user-facing (maybe consider exposing this later if sufficient end-user need is demonstrated).

I feel like there should definitely be a check when the spec changes, but Ready is a Condition that's kind of expected to fluctuate without those changes (since it can be affected by having no endpoints etc).

It's a bit under-specified how multiple errors for a single condition should be handled currently (is there any convention in other projects for setting the same condition type multiple times with different reasons? this might deserve a separate issue).

In the past, I've kind of assumed that either the error behavior should short-circuit and only have one error, or they would be concatenated somehow. I think that if this becomes common, that's a sign that we should break those Conditions apart into more specific ones.

@robscott
Copy link
Member

We had a similar discussion about a potential "Reconciled" condition previously, adding a reference here in case any of that context is helpful in this discussion: #642

@youngnick
Copy link
Contributor

Doing this as part of #1364.

/assign

@youngnick
Copy link
Contributor

I think that we can count this work done under #1454, closing this one out.

/close

@k8s-ci-robot
Copy link
Contributor

@youngnick: Closing this issue.

In response to this:

I think that we can count this work done under #1454, closing this one out.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants