Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting Experimental CRDs into separate API Group and Names #2912

Closed
wants to merge 1 commit into from

Conversation

robscott
Copy link
Member

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR is a follow up to #2844. As I've been considering how we'll handle the graduation of GRPCRoute, it's become clear to me that our current experimental and standard channel separation is flawed. This is an attempt to fix that.

Essentially the problem is that once someone chooses to install an experimental version of a CRD, they have no safe path to go back to standard channel. GRPCRoute did not cause this problem, but it did highlight it. Essentially, we'll need to include "v1alpha2" in our standard channel version of GRPCRoute simply to ensure that it can actually be installed in clusters that previously had GRPCRoute.

This PR proposes a big change. It moves all experimental channel CRDs to a separate API group gateway.networking.x-k8s.io and gives all resources an X prefix to denote their experimental status. This has the result of completely separating the resources. Practically that means that experimental and standard channel Gateways can coexist in the same cluster, but that the only possible migration path between channels involves recreating resources.

This would admittedly be annoying for controller authors, but I'm hoping only moderately. This approach relies on type aliases to minimize the friction. Here's what I'd expect most controllers to do:

  1. Watch standard channel by default, provide an option to watch experimental channel resources
  2. When watching experimental channel resources, funnel the results of both informers into shared logic (essentially everything from informer event handlers on down would be shared, would just need separate informers)
  3. Develop an updated naming scheme for generated resources to ensure that resources generated for experimental GW API resources do not collide with resources generated for standard channel. (This new approach would mean you could have standard and experimental channel Gateways of the same name for example).

Take a look at hack/sample-client in this PR for an overly simple example of using experimental and standard channel types together.

All of this may sound like a huge pain, so why bother? I think this approach comes with some pretty important benefits:

  1. Very clearly signals that "experimental" resources are experimental and by extension not meant to be trusted in production
  2. Allows experimental and standard channel resources to coexist in the same cluster, allowing experimentation in one namespace and production ready standard channel usage in another
  3. Avoids the possibility of someone getting stuck on experimental channel. (Before this PR, it was impossible to safely migrate from experimental channel to standard channel, so moving to experimental channel was a one way operation).

This PR is still very much a WIP, opening it early to get some feedback on the direction.

Does this PR introduce a user-facing change?:

Experimental channel CRDs have been moved to a separate API group and now have `x` as a prefix for kind and resource names.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 30, 2024
@dprotaso
Copy link
Contributor

I'm not really a fan of a new experimental API group. It breaks client applications when something is promoted.

Copy link
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not a fan of this. The solution to the "alpha in standard" channel seems clear to me - just remove alpha from the standard CRDs. The downsides of that approach are far more palatable and only impact extremely niche cases, while this causes widespread pain.

I'm also not convinced the sample controller is real enough to be meaningful, I've done a real controller using a similar approach (for to solve half of the problems here) and it was an astronomical pain. I cannot imagine adding support for experimental CRDs with this approach.

@keithmattix
Copy link
Contributor

keithmattix commented Mar 31, 2024

Not to pile on, but I'm also a -1 on this; separate groups makes the resources in the channel separate resources for all intents and purposes. This feels like a big burden on controllers for not a lot of gain IMO. Why not include v1alpha2 as a server version while stripping out any alpha fields?

@robscott
Copy link
Member Author

robscott commented Apr 1, 2024

It breaks client applications when something is promoted.

What would this break? Presumably most controllers would still be supporting both experimental and standard channel even with this change.

The solution to the "alpha in standard" channel seems clear to me - just remove alpha from the standard CRDs. The downsides of that approach are far more palatable and only impact extremely niche cases, while this causes widespread pain.

This feels like a big burden on controllers for not a lot of gain IMO. Why not include v1alpha2 as a server version while stripping out any alpha fields?

I completely agree that this solution would be entirely overkill if we were just trying to solve for how to graduate GRPCRoute to standard channel. Although that's definitely where this thought process started, I think there are much more compelling reasons in favor of an approach like what I've proposed here.

Specifically I think experimental channel as it stands today is a trap that more and more people are going to fall into unless we make some kind of change. Here are the problems with the model today:

  1. There's no way to safely transition from experimental to standard. Let's say someone wanted to try out a new feature in experimental channel temporarily - now they're stuck on experimental channel in that cluster ~forever. The only truly safe path is to uninstall the experimental CRD and then install the standard one, but that's very disruptive.
  2. There's no way to just try out an experimental CRD in some portion of your cluster. You either try it for everything in your cluster or nothing. Per point 1, if you do try an experimental CRD, it's essentially a one-way transition as it stands today, there's no safe path back to stability.
  3. As we're seeing with GRPCRoute, standard channel CRDs have to adopt some characteristics of experimental channel CRDs, like providing alpha API versions, if we want it to be possible to install them in clusters that currently have experimental CRDs. This is the specific problem we're running into with GRPCRoute, but I think it's the least critical by far.
  4. Some providers like GKE are have an option to install standard channel CRDs as part of cluster management. Based on discussions at KubeCon, it seems likely that this will start to extend to other cluster provisioners. It is ~impossible to install experimental channel CRDs in clusters with corresponding standard channel CRDs managed by the cluster provisioner.
  5. Although our versioning model expressly allows for breaking changes in experimental channel, it's likely that they would be very disruptive because some have installed experimental channel CRDs without realizing it (same name, fields, etc) and could end up with things inexplicably breaking on them the next time they upgrade CRDs.

In my opinion, this leaves us with a couple options:

  1. We can document all these problems and limitations of experimental channel. Unfortunately I think this would not be enough to keep people from getting burned on at least one of the problems described above. It would also likely significantly limit the usage of experimental channel CRDs. The success of Gateway API is entirely built on the idea that we can get feedback early via experimental channel, but that all goes away if no one uses it because it's so unsafe (see above).
  2. We can move forward with an approach like I've proposed here, introducing stronger separation between experimental and standard channel CRDs and largely resolving all of the problems I've described above. Admittedly this does come with some additional work for controller authors, but I'm hopeful that would be limited to setting up an additional set of informers and all the code below that can stay the same.

@dprotaso
Copy link
Contributor

dprotaso commented Apr 1, 2024

What would this break? Presumably most controllers would still be supporting both experimental and standard channel even with this change.

Clients that are authoring gateway resources (eg. Knative) that have typed clients would break. We wouldn't be able to work with both the standard channel and the experimental channel easily.

@robscott
Copy link
Member Author

robscott commented Apr 1, 2024

Clients that are authoring gateway resources (eg. Knative) that have typed clients would break. We wouldn't be able to work with both the standard channel and the experimental channel easily.

Wouldn't you already have this issue? If a cluster only has standard channel CRDs installed and you try to install config that has experimental fields, won't that break? It seems like you'd already need to be aware of the channel of CRDs that is present when you're deciding what to configure. Or if everything fits in standard channel CRDs, just use those because they'll be far more stable and widely available.

@howardjohn
Copy link
Contributor

There's no way to safely transition from experimental to standard. Let's say someone wanted to try out a new feature in experimental channel temporarily - now they're stuck on experimental channel in that cluster ~forever. The only truly safe path is to uninstall the experimental CRD and then install the standard one, but that's very disruptive.

  • Install experimental new version with "v1+v1alpha1"
  • Storage version migrate everything to v1 (k8s will block you from skipping this step, so its not as error-prone as it seems)
  • Install standard (removes v1alpha1)

Seems safe to me?

There's no way to just try out an experimental CRD in some portion of your cluster.

I don't think this is a desired state. Nor common in other projects, including Kubernetes core.

As a controller implementation, I would certainly not allow this; if the experimental code is enabled in our central controller it impacts the entire cluster, not just some namespaces that are using the experimental ones. There is a shared fate in a shared controller.

Some providers like GKE are have an option to install standard channel CRDs as part of cluster management. Based on discussions at KubeCon, it seems likely that this will start to extend to other cluster provisioners. It is ~impossible to install experimental channel CRDs in clusters with corresponding standard channel CRDs managed by the cluster provisioner.

The same exists for "Alpha" API features in most Kubernetes providers. I don't see why we need new solutions here.

@dprotaso
Copy link
Contributor

dprotaso commented Apr 1, 2024

Wouldn't you already have this issue? If a cluster only has standard channel CRDs installed and you try to install config that has experimental fields, won't that break? It seems like you'd already need to be aware of the channel of CRDs that is present when you're deciding what to configure. Or if everything fits in standard channel CRDs, just use those because they'll be far more stable and widely available.

It's more about the go types and client code. If I start using an experimental feature/CRD and then it's promoted to standard channel that's a breaking change for me to support.

@robscott
Copy link
Member Author

robscott commented Apr 1, 2024

  • Install experimental new version with "v1+v1alpha1"
  • Storage version migrate everything to v1 (k8s will block you from skipping this step, so its not as error-prone as it seems)
  • Install standard (removes v1alpha1)

Seems safe to me?

This is true in the case of GRPCRoute, but not likely to be true in many other cases. For example, HTTPRoute will often have several different experimental fields, and only some of them will graduate to standard in a given release. Some may also have breaking changes along the way.

I don't think this is a desired state. Nor common in other projects, including Kubernetes core.

Disagree. Kubernetes upstream APIs have long had the problem that no one tests them while in alpha. Gateway API + CRDs were intended to be a way to get a shorter feedback loop on API design. Repeating the problematic patterns of upstream Kubernetes APIs is not desirable here IMO.

As a controller implementation, I would certainly not allow this; if the experimental code is enabled in our central controller it impacts the entire cluster, not just some namespaces that are using the experimental ones. There is a shared fate in a shared controller.

+1 completely agree, each controller should decide if it's going to support experimental resources or not. What we've found with Gateway API is that it's very common to have multiple implementations of the API running in the same cluster, and some may offer production readiness, while others may be more experimental in nature.

The same exists for "Alpha" API features in most Kubernetes providers. I don't see why we need new solutions here.

This has resulted in near-zero feedback for any Kubernetes alpha APIs which is very painful (coming from someone who's had to deal with this cycle multiple times). In Gateway API we have an opportunity to have a demonstrably better feedback loop, which I believe should lead to a demonstrably better API. If no one uses or implements experimental channel because it's either too unsafe or just impossible to access on any of the managed Kubernetes providers, we've just unnecessarily recreated the same problems that upstream Kubernetes APIs have.

@robscott
Copy link
Member Author

robscott commented Apr 1, 2024

It's more about the go types and client code. If I start using an experimental feature/CRD and then it's promoted to standard channel that's a breaking change for me to support.

This proposal continues to use the same go types for both experimental and standard channel (just with type aliasing like we're already doing). The only thing you'd need to change is the API group you're point to, which I think should be relatively straightforward and also not that common of a transition. I'm assuming Knative already needs some kind of flag of whether or not to attempt to use experimental fields/CRDs, this seems like it would be a natural extension of that?

@howardjohn
Copy link
Contributor

This proposal continues to use the same go types for both experimental and standard channel (just with type aliasing like we're already doing). The only thing you'd need to change is the API group you're point to, which I think should be relatively straightforward and also not that common of a transition. I'm assuming Knative already needs some kind of flag of whether or not to attempt to use experimental fields/CRDs, this seems like it would be a natural extension of that?

IMO Its only simple because the example you showed only uses a simple List. Once you pull in real machinery like informers, controller-runtime, custom abstractions, etc it becomes far more complex.

This is speaking from experience when we implemented a "multi version" read support in Istio for Gateway API transition from alpha -> beta.

@robscott
Copy link
Member Author

robscott commented Apr 1, 2024

IMO Its only simple because the example you showed only uses a simple List. Once you pull in real machinery like informers, controller-runtime, custom abstractions, etc it becomes far more complex.

That's fair, I'm curious if there are any shims or reference code that we could provide that would help here. My guess here is that the vast majority of controllers would need the following:

  1. Event handlers for informers from both channels
  2. Interface that could update status of resources from either channels

Is there anything else I'm missing here?

@howardjohn
Copy link
Contributor

Here is an example of us handling it: https://github.com/istio/istio/pull/41238/files. You'll note we had to duplicate some of our controllers entirely. This was only acceptable because it was short lived and caused by our own mistake in Istio rather than in the upstream API forcing it upon us.

If our concern is we will not get people trying out experimental features, I don't get how this helps. It requires both a user AND controller opt-into supporting it, and both are painful. I don't expect every controller to have tons of code to handle this or expect Helm charts to update to have if gateway.experimental.enabled { ... }

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kflynn
Copy link
Contributor

kflynn commented Apr 3, 2024

I managed to originally write the following comment on #2919. 🤦‍♂️ I'm repeating it here, along with @robscott's responses. Sorry for the confusion! (@robscott, please sing out if you think I'm misrepresenting you here.)


@robscott, ultimately I think we're falling a bit into the how-before-what trap here -- I kind of feel like we're wrangling about how to do things without having a clear sense of exactly what we need to support. Could we back up a moment and lay out some use cases here?

from @robscott:
Agreed, in my opinion, we're trying to accomplish the following goals:

  1. Optimize for stability in standard channel. This means avoiding exposure of alpha API versions in standard channel to avoid future painful deprecations or long term support of alpha.
  2. Do what we can to enable users that have tried an experimental channel CRD to have an upgrade path to standard channel. ...[T]his is actually rather difficult with our current model...

A few that come immediately to my mind:

  1. The experimental channel has GRPCRoute v1alpha2 which should be promoted to v1. Ana has been installing the experimental channel in her clusters, since she's developing against GRPCRoute; she would rather use the standard channel instead, though. What should she do?

from @robscott:
A. Upgrade to experimental v1.1 CRDs that have both v1alpha2 and v1 API versions
B. Upgrade to controller that supports v1
C. Upgrade to standard channel CRDs

  1. Same situation as (1), but Ian has been managing the CRDs for Ana. How does the migration happen smoothly?

from @robscott:
Same as above, just need to make sure the above order of operations happen, doesn't matter how much time passes in between each step though.

  1. Ana has been using the experimental channel's support for the TeapotResponse stanza in HTTPRoute v1alpha7. TeapotResponse is being promoted to standard channel, and Ana wants to stop installing the experimental channel CRDs. How does TeapotResponse get moved to standard exactly? What does Ana need to do? (Assume that we currently have HTTPRoute v1 as standard.)

from @robscott:
This gets to the root of the problem... There's probably no safe path from experimental to standard in this scenario. HTTPRoute in particular usually has several experimental things at once, and not all of them will graduate at the same time. If you were to upgrade from experimental to standard channel, you'd almost certainly end up losing fields and data from experimental channel with unpredictable outcomes.

  1. Same situation as (3), but again, Ian is managing the CRDs instead of Ana having control over this.

from @robscott:
Still awful unless we split CRDs... The current state is that there's no safe path from experimental to standard. GRPCRoute is the exception because the entire resource is graduating and it has been unchanged for many releases.

  1. Shortly before TeapotResponse was proposed, the Bertrand Gateway controller wrote GEP-BERTRAND proposing RussellsTeapot with the semantics that eventually became TeapotResponse. GEP-BERTRAND was accepted into the experimental channel, but now that TeapotResponse is the accepted standard, the Bertrand folks need to cope with the fact that they have users of RussellsTeapot that need to be migrated to TeapotResponse. What should they do, exactly? Assume that they're catching this while TeapotResponse is still experimental, and that, again, we currently have HTTPRoute v1 as standard.

from @robscott:
I think unfortunately you're stuck supporting both APIs for a long time, depending on the support guarantees of your implementation.

  1. Same situation as (5), but suppose that the Bertrand folks wait until TeapotResponse has been promoted to standard.

from @robscott:
I think the same largely applies.

(1) is, I think, what we've been discussing with GRPCRoute.
(3) is, I think, a hypothetical that @robscott proposed, with concrete names and versions so we can talk about concrete solutions.
(5) is a thing that happened to me recently with a Linkerd-specific CRD, and is likely about to happen to Envoy Gateway with TLS validation. (Thankfully, the LInkerd situation happened before it was released to the world, though after I started working with it for demos.)

What other situations come to mind?

@candita
Copy link
Contributor

candita commented Apr 3, 2024

The only truly safe path is to uninstall the experimental CRD and then install the standard one, but that's very disruptive.

The introduction of yet another version is also disruptive. I'm still not clear on how adding another version solves the problem of disruption for GRPC or other cases.

I don't know the historical reasoning behind why the CRDs are not vendored, like other APIs we consume. To me that would be an alternate solution, and we haven't talked about it.

@costinm
Copy link

costinm commented Apr 12, 2024

I will pile on to what appears to be the broad consensus of all comments: I completely agree with all that another 'experimental' CRD is harmful - however for the same reasons, the current experimental CRD model is even more harmful and broken.

Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace. And each vendor does have 'beta' or 'public preview' or 'GA' labeling for each API they define.

It seems there is also agreement in this thread that this project ( gateway-api ) should not define some other space for new APIs. I completely agree - it will lead to confusion and attempts to define APIs in a void, without an implementation.

The only thing missing is a space ( or spaces ) where different vendors can collaborate on a common API ( after they have their own implementation ), build interoperability tests - similar to IETF - before that API can be proposed for merger into this repository and part of the core.

Of course that depends on sets of vendors or other orgs doing this -
it could be a WASM-oriented repo driving a common WASM API, telemetry repo defining telemetry APIs. In the absence of such collaboration or orgs - it is also possible to continue the current process of collecting 'prior art' - in the form of existing vendor extensions - and define the common API based on that. Which is the process we already used for HttpRoute itself ( VirtualService, Ingress and serveral other vendor-specific APIs were considered).

TL;DR: it seems we all agree - in words and actions - with what I consider the spirit of the proposal, which is to have new CRDs and features implemented in separate API Group and using different Names.

@costinm
Copy link

costinm commented Apr 12, 2024

I would note that for each API defined by a vendor or independent organization - the status ( GA or private preview or whatever the vendors use for their feature stability definition - doesn't have to be a version ) is associated with a specific feature. If a vendor defines a OTel API as v1 and marks it stable - it is certainly not an 'experimental' API, just a single-vendor API. Users can safely use the API - as well with similar stable APIs from other vendors - with the deprecation policy and guarantees of each vendor.

The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on.

The process is very similar to IETF model.

@mikemorris
Copy link
Contributor

mikemorris commented Apr 26, 2024

I'm hopeful that with Storage Version Migrator moving in-tree in Kubernetes 1.30 with KEP-4192, we may have a tool to help with this workflow, but (from my experience testing the out-of-tree impl) the behavior is too global/automatic by default (and therefore scary!)

I hope we may be able to provide a "safe" upgrade path with a bit of custom tooling using preflight checks in gwctl, similar to the approach @howardjohn suggested in #2912 (comment):

  1. Install Experimental channel new version with v1alpha stored and v1alpha,v1 served versions.
  2. Check if any CRDs, fields (or enum values? what else?) in use are missing from the Standard channel. Attempting to overwrite CRDs missing in-use stored versions will block the "missing CRDs entirely bit", but we can still maybe handle this UX a bit nicer and earlier. We can't just compare against newer served versions because that wouldn't cover post-v1 changes like adding fields to HTTPRoute in the Experimental channel (I think the example @kflynn gave with a v1alpha7 HTTPRoute is not how we intend to make post-v1 backwards-compatible additions? Please LMK if I'm mistaken though.) This check may be difficult (and I don't want to manually maintain the logic), but I'm curious if we could do this with sufficient investment in some code generation. This is somewhat similar to the approach proposed in KEP-2558: Publish versioning information in OpenAPI except that we have the benefit of already having parsable channel flag comments.
  3. Warn user with sufficient detail.
  4. If no warnings found (or y/N override passed?), find Gateway API group CRs with a newer available served version (for initial promotion from v1alpha to v1 use cases), create SVM migration and watch for completion.
  5. Report successful migration, provide instructions to move to Standard channel.
  6. Install Standard channel CRDs, removing v1alpha versions.

Notably, this may not handle breaking changes between e.g. v1alpha1 and v1alpha2 in the Experimental channel if we take the same approach as we are with BackendTLSPolicy, but I think that's okay?

There's no way to just try out an experimental CRD in some portion of your cluster. You either try it for everything in your cluster or nothing.

I feel like this is more of a nice-to-have than a requirement - cloud providers have enabled such a proliferation of clusters that spinning up a new cluster with Experimental channel CRDs and redirecting some traffic to it doesn't seem too unreasonable. I view this primarily as an at-scale use case where a platform team would be managing shared app dev team access to clusters and could take on this story, not a must-have workflow for small self-serve teams. This is a pattern that wouldn't be bad to nudge users toward for other changes too, like migrating to a newer Kuberentes version instead of upgrading in-place.

As we're seeing with GRPCRoute, standard channel CRDs have to adopt some characteristics of experimental channel CRDs, like providing alpha API versions, if we want it to be possible to install them in clusters that currently have experimental CRDs.

For an initial v1alpha -> v1 promotion, I think simply serving v1 versions in the Experimental channel and providing some well-lit path for upgrading stored versions might be sufficient instead? (I'm not quite clear on how including v1alpha CRDs in Standard but not serving or storing them as GRPCRoute may do works with "automatic" translation as described in https://gateway-api.sigs.k8s.io/guides/crd-management/#api-version-removal and if that process changes with SVM in-tree.) Post-v1 though, I don't think even serving alpha versions would safely allow migrating from Experimental CRDs testing a new field back to Standard though...

It is ~impossible to install experimental channel CRDs in clusters with corresponding standard channel CRDs managed by the cluster provisioner.

On AKS we're evaluating an approach to allow users to "opt-out" of Gateway API management by the cluster provisioner - we'll install Standard channel CRDs by default when needed, but we want to let users "offboard" if they need functionality only available in the Experimental channel. Providing a safe path back to a managed Standard channel is a challenge currently though. Additionally, I would like to make it easier to install Experimental channel CRDs more granularly, such "Standard channel for everything except Experimental channel HTTPRoute".

Although our versioning model expressly allows for breaking changes in experimental channel, it's likely that they would be very disruptive because some have installed experimental channel CRDs without realizing it (same name, fields, etc) and could end up with things inexplicably breaking on them the next time they upgrade CRDs.

I think the way we're choosing to handle this with BackendTLSPolicy is reasonable - some pain is okay if it's not a surprise, and it's not possible to accidentally break in-use resources.

@mikemorris
Copy link
Contributor

mikemorris commented Apr 26, 2024

Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace.

We have seen some of this historically, but from conversations I've had with maintainers this seems to largely be a pattern which Gateway API implementations hope to move away from to avoid end-user confusion, particularly for incremental changes to existing CRDs. For the well-defined extension points in Gateway API (filters, policies), this is a viable path though.

The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on.

I think this is precisely the stage we're trying to better define here. I do expect we'll still see some new CRDs emerge from existing vendor-specific implementations (authorization policy as a prominent example), but we're largely trying to focus on the "mid-tier" with Gateway API - the path for moving from experimental shared APIs for common functionality (after being proven in vendor-specific implementations) to a standard.

@costinm
Copy link

costinm commented Apr 26, 2024

Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace.

We have seen some of this historically, but from conversations I've had with maintainers this seems to largely be a pattern which Gateway API implementations hope to move away from to avoid end-user confusion, particularly for incremental changes to existing CRDs. For the well-defined extension points in Gateway API (filters, policies), this is a viable path though.

I'm sure all implementations would like their specific features to be added to the existing CRDs directly - instead of
having to do harder work of going through the stages of experiment, multiple implementations, proof it works.

We hope and want for a lot of things - some are feasible, others are not, and what is nice for specific implementations is certainly not so nice for the users who have to deal with divergence between implementations and can't rely
on any portable interface.

The criteria of having 2-3 implementations for a core API fail to take into account the reality of long-term supported
releases, typical upgrade cycles - and multiple implementations used in the same cluster.

In any case - this proposal is orthogonal to this - if consensus exists to add a field to a core API directly as stable,
with no experiment or proof - it should be added with whatever process is defined.

If a feature does not have consensus on moving directly to stable - it will still need a mechanism for experimentation
and for implementations to prove the viability, users to provide feedback, etc.

As a user, I would prefer APIs that have been proven and vetted over APIs that are directly pushed to
stable - even if that makes things a bit harder for implementations.

The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on.

I think this is precisely the stage we're trying to better define here. I do expect we'll still see some new CRDs emerge from existing vendor-specific implementations (authorization policy as a prominent example), but we're largely trying to focus on the "mid-tier" with Gateway API - the path for moving from experimental shared APIs for common functionality (after being proven in vendor-specific implementations) to a standard.

That's very simple - if I understand the proposal correctly, it means the experimental shared API would live
in a different API group - like "authorization.experimental.k8s.io" - get the 3-4 implementations needed and
evolve without concerns about backwards compat or stability until everyone is happy - and then copy it to
the v1 API group.

Implementations can support the experimental api group for N releases - in parallel with v1.

Same model used for example for H3 - with different drafts using other names, and the final RFC using h3.

@mikemorris
Copy link
Contributor

mikemorris commented Apr 29, 2024

if consensus exists to add a field to a core API directly as stable,
with no experiment or proof - it should be added with whatever process is defined.

I don't believe anyone is suggesting this.

If a feature does not have consensus on moving directly to stable - it will still need a mechanism for experimentation
and for implementations to prove the viability, users to provide feedback, etc.

The contention of most maintainers in this thread is that the existing Experimental channel model (as defined at https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels) is a better way to handle this both for implementations, and, importantly, for end-user experience.

@costinm
Copy link

costinm commented Apr 29, 2024 via email

@robscott
Copy link
Member Author

The contention of most maintainers in this thread is that the existing Experimental channel model (as defined at https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels) is a better way to handle this both for implementations, and, importantly, for end-user experience.

My concern is that that's because we haven't introduced many breaking changes into experimental channel yet. That's leading people to believe that experimental channel is more stable than it's intended to be.

In Istio it has been almost impossible to fix anything between alpha and v1.

+1, this is one of my biggest concerns. Although I'm not very familiar with Istio versioning, I'm very familiar with the problems we've faced in Kubernetes re: changing beta APIs. Whenever an API version is broadly accessible (beta in upstream Kubernetes, experimental in Gateway API), it becomes very difficult to make any breaking changes.

If we're not very careful here, we're going to end up with the same result all over again where it becomes impossible to change APIs, even if they're technically labeled as alpha. My theory is that having a stronger separation via separate API groups and names will initially be somewhat painful but will lead to a much more sustainable API long term. (Imagine the pressure on API reviewers if approving an alpha API meant that everything had to be ~perfect the first time because we could never change anything after that initial release.)

@robscott
Copy link
Member Author

I think #2955 (comment) is reasonable - some pain is okay if it's not a surprise, and it's not possible to accidentally break in-use resources.

Agreed, I think this is the best case scenario. Importantly it only works when you're changing an entire resource. If you're changing an experimental field in a stable API like HTTPRoute you simply don't have that option available. The only option I can think of is "painful surprise" unless we separate the release channels like I'm proposing here.

@robscott
Copy link
Member Author

On AKS we're evaluating an approach to allow users to "opt-out" of Gateway API management by the cluster provisioner - we'll install Standard channel CRDs by default when needed, but we want to let users "offboard" if they need functionality only available in the Experimental channel. Providing a safe path back to a managed Standard channel is a challenge currently though.

Yep, I think it's reasonable to offer a path to offboard CRD management, GKE also has this, but it's very difficult to offer a safe upgrade path back to managed stable CRDs. This proposal is an attempt to change that.

  1. Check if any CRDs, fields (or enum values? what else?) in use are missing from the Standard channel. Attempting to overwrite CRDs missing in-use stored versions will block the "missing CRDs entirely bit", but we can still maybe handle this UX a bit nicer and earlier. We can't just compare against newer served versions because that wouldn't cover post-v1 changes like adding fields to HTTPRoute in the Experimental channel (I think the example @kflynn gave with a v1alpha7 HTTPRoute is not how we intend to make post-v1 backwards-compatible additions? Please LMK if I'm mistaken though.) This check may be difficult (and I don't want to manually maintain the logic), but I'm curious if we could do this with sufficient investment in some code generation. This is somewhat similar to the approach proposed in KEP-2558: Publish versioning information in OpenAPI except that we have the benefit of already having parsable channel flag comments.

Unfortunately I think it would be very difficult to maintain a tool like this. We'd need to have a tool that maintained the changes between every possible combination of CRDs and detect if any were set to a non-zero value. Even if we could detect this reliably, my working theory is that stable production usage of APIs should be entirely disconnected from experimental usage and they should be able to coexist within the same cluster. This approach would mean experimental usage in the dev namespace would prevent a prod upgrade from getting a newly graduated feature that is clearly needed.

Agree that we wouldn't end up with a v1alpha7 on a resource that's already made it to standard channel. Once it gets to that point the only changes allowed are backwards compatible and therefore no more version revs.

  1. Warn user with sufficient detail.
  2. If no warnings found (or y/N override passed?), find Gateway API group CRs with a newer available served version (for initial promotion from v1alpha to v1 use cases), create SVM migration and watch for completion.
    Report successful migration, provide instructions to move to Standard channel.

This doesn't really solve the problem for providers that are trying to provide a fully managed experience - ideally upgrades are safe and automatic. Our goal should be for a user to be able to start an upgrade and know that it will be safely executed - that's easy to accomplish if the only APIs installed by the provider are guaranteed to be stable and backwards compatible, but it falls apart if you introduce experimental APIs with the same name and group to the equation.

@costinm
Copy link

costinm commented Apr 30, 2024 via email

@youngnick
Copy link
Contributor

As always, I think that @kflynn's use cases are very useful for understanding the problems here.

Before I get started discussing that though, I think that it's important to review how the channels work on a per-object basis as well as on a per-field basis.

Versioning

We have two channels in each release bundle, experimental and standard.

Experimental includes

  • Resources that are at an alpha level (GRPCRoute meets this before v1.1).
  • Fields that are not considered standard yet in already GA'd resources. So the example about TeapotResponse stanza in HTTPRoute v1alpha7 is not really correct - there will never be a v1alpha7 of HTTPRoute. If we had major changes required, we could conceivably start with a v2 using v2alpha1, but that then would mean that version starting the whole experimental process over again. I cannot currently think of any way that we could need to do that for any of our graduated resources.

Standard includes:

  • v1 resources
  • Standard fields on those resources

That's it.

This problem arises because we have a rule that we don't include any alpha things in the Standard channel.

Problems

This means that for a graduation like GRPCRoute, there's no safe, easy migration path between the v1.1 Standard resources and the v1.0 Experimental resources, because the v1.1 Standard resources don't include any definitions for the v1alpha2 resources that, if you've been using the v1.0 Experimental resources, you are already using.

Technically this is fine, as @costinm mentions, because noone should be using GRPCRoute in any production scenario, and recreating all of your GRPCRoute resources from scratch means you need to check the resources as you reapply them.

This is a terrible experience for the most active members of our community though, who have been doing what we need and actually testing this functionality. In order to ensure that any GRPCRoute config in the cluster before upgrading to v1.1 is present, users will need to:

  • pull down all GRPCRoute objects from all namespaces, and save them as YAML
  • change the apiVersion field from gateway.networking.k8s.io/v1alpha2 to gateway.networking.k8s.io/v1
  • Install v1.1 standard
  • Reapply all the YAMLs they pulled down

This is an annoying, manual, error-prone process that Kubernetes has mechanisms designed to avoid, particularly in the case where objects can be safely round-tripped between versions, since there no incompatible changes. (Us maintainers work very hard to ensure this is the case!)

Rob's proposal

In this PR, @robscott makes the case that we should make this split more apparent to both users and implementation maintainers, by splitting the experimental code out into separate objects. This locks in the above process and makes it required for every experimental -> standard resource transition. Pros and cons of this approach as I see it:

Pro:

  • The split between experimental and standard is very clear.
  • The path for moving config between experimental and standard is also very clear. There is none. Users must manually make the changes in each and every object.
  • You can conceivably have different versions of the objects installed in the cluster. This allows people to experiment with new fields and objects in the same cluster as an implementation that only uses Standard objects. This also means you could conceivably be testing v1.1 experimental in the same cluster as running v1.0 Standard.

Con:

  • Having to make manual changes for these things is bad UX. The config is not actually different, the functional thing that we are saying here is "we can't guarantee that this transition is safe, so you have to have a human do it".
  • Having different versions of types with the same name (aside from their API Group) installed in the cluster is a recipe for disaster. This will mean that every interaction with a cluster with both Experimental and Standard resources installed will require, for example, kubectl get httproutes.gateway.networking.x-k8s.io or kubectl get httproutes.gateway.networking.k8s.io to disambiguate between the two. I think that if you're using the short name, which one you'll get is at best poorly defined.

Solving the migration problem for fields

Because of the way that Kubenetes handles unknown fields in persisted objects, changing from experimental channel to standard channel is not guaranteed to produce reliable behavior because the following can happen:

  • Experimental channel includes TeapotResponse filter as an experimental field for HTTPRoute. (HTTPRoute is already v1)
  • Ana uses the TeapotResponse filter in HTTPRoute objects, and this config is persisted to etcd.
  • Someone (Ian, Chihiro, or Ana) installs a Standard channel version of the Gateway definitions that does not include TeapotResponse.
  • GETs, LISTs, or any other read of this object will not include the TeapotResponse config. But it is still present in etcd, until something modifies the object, at which time the values will be pruned.
  • So, if anything at all touches the object, then the TeapotResponse config will be pruned and gone, as you would expect.

However, if nothing touches the object, and the TeapotResponse config moves to standard in a later version, then the config will be read out by the apiserver on reads, reappearing as if by magic.

Is this situation likely? no. Is it that bad? Probably not, but we can't guarantee it, which is critical. In practice, I think that it's very unlikely that objects would persist for that long without being modified at all, and if we performed the incantations to invoke the storage version migrator, then this issue will never arise, because the storage version migrator's whole job is to do a no-op write to the object to prevent exactly this sort of issue.

The other thing that could conceivably happen here is that we have a field with the same name, but a different behavior. In practice, again, we don't allow this as an API change, to prevent exactly this sort of thing. New behavior == new name.

In summary, solving this problem for fields with a high level of confidence involves ensuring that the storage version migrator or similar operation is running for anyone who is unsure.

Solving this problem for whole objects

For whole objects, the story is a little easier, I think.

To me, it seems that the simpler way to handle this problem is to relax the rule about not allowing any alpha resources in the Standard channel, if and only if the alpha resource being allowed is identical to the GA resource that's also included in Standard. This would mean that the user's upgrade process goes like this, using GRPCRoute as the example:

  • have working config of GRPCRoute at v1alpha2 using Gateway API v1.0 Experimental channel
  • install Gateway API v1.1 Standard, which will include both v1alpha2 and v1 GRPCRoute objects, for a defined number of versions. v1 is the storage version though.
  • Run the storage version migrator on all GRPCRoute objects.

You've now migrated your config, and can safely upgrade to the later release that removes GRPCRoute v1alpha2 from the storage versions. (When the storage version is v1, new objects will be saved as v1, and once the v1alpha2 is removed, attempted CRUD operations on v1alpha2 versions will fail).

This approach implies the following graduation process for graduating whole resources from Experimental to Standard:

  • The community decides that an Experimental resource is complete and marks it for graduation in the next release. At this time, the resource is frozen and no further changes will be accepted for it until after that release.
  • In the next release, the v1 resources are introduced, and the YAMLs are updated to include the v1 definitions, with the frozen experimental version available as an alternate storage version and definition. As part of this change, we also declare when the alpha versions of the object will be removed from the Standard install. This is only provided as an user convenience. The actual Go types are also changed at this point so that the alpha versions are type aliases to the v1 versions.
  • After the deprecation period ends, the alpha versions are removed from everywhere.

This is the same process we used for graduating the currently-GA resources from v1beta1 to v1, it just skips the beta part.

This allows the safe migration of the newly promoted resource only from Experimental to Standard, but the following things can still happen in an Experimental -> Standard migration.

  • Any Experimental fields in use in a GA object will be lost once the storage version migrator is run.
  • Experimental objects will actually stay in the cluster until the CRD definition is manually removed (since we can't remove objects as part of a kubectl install or similar operation.

Regardless of what we end up doing, I think that we need to prioritize documentation about how to move between versions and channels. This should be basically the same for most transitions, since things are either going to work for sure, or need a human to check.

This current proposal is effectively a way to enforce having to have a human check the experimental -> standard transition. I think we can do better than that.

@mikemorris
Copy link
Contributor

mikemorris commented May 2, 2024

What is not fine is pretending it is ok for a user to take an experimental or alpha CRD and use it in any production
environment ( 'to allow users to provide feedback' ) and expect we'll be able to make any structural or major changes and fix things afterwards, or play games with allowing some experimental APIs in production.

FWIW I agree with this @costinm, and why I think a goal of allowing experimental and standard APIs for the same resource to coexist in the same cluster is a dangerous goal, because it encourages this behavior. Isolating experimental CRDs in an "edge release" cluster and splitting some traffic towards it from an external load balancer layer to test new behavior would be a safer approach from a platform engineering team perspective.

  • have working config of GRPCRoute at v1alpha2 using Gateway API v1.0 Experimental channel
  • install Gateway API v1.1 Standard, which will include both v1alpha2 and v1 GRPCRoute objects, for a defined number of versions. v1 is the storage version though.

@youngnick does this work if v1alpha2 is included in the CRD but not served (and maybe marked as deprecated: true)? What would be the implications of that? Referring to migration process described in https://static.sched.com/hosted_files/kcsna2022/75/KubeCon%20Detroit_%20Building%20a%20k8s%20API%20with%20CRDs.pdf#page=17

In the next release, the v1 resources are introduced, and the YAMLs are updated to include the v1 definitions, with the frozen experimental version available as an alternate storage version and definition

@youngnick v1 definitions added to YAML for which channel, both? Frozen experimental version still set as storage version? In Experimental channel only, or Standard channel too?

@costinm
Copy link

costinm commented May 2, 2024

What is not fine is pretending it is ok for a user to take an experimental or alpha CRD and use it in any production
environment ( 'to allow users to provide feedback' ) and expect we'll be able to make any structural or major changes and fix things afterwards, or play games with allowing some experimental APIs in production.

FWIW I agree with this @costinm, and why I think a goal of allowing experimental and standard APIs for the same resource to coexist in the same cluster is a dangerous goal, because it encourages this behavior. Isolating experimental CRDs in an "edge release" cluster and splitting some traffic towards it from an external load balancer layer to test new behavior would be a safer approach from a platform engineering team perspective.

That's not a viable approach for anyone using state - and has quite a complex multi-cluster story.

The fundamental property of 'experimental' is that it is not supported, may break, may have security or scale issues. Should never be used in a prod environment.

Having independent or vendor CRDs - with a short support window, as some vendors call 'private preview' also has some dangers - but it's a much higher bar and coexisting with the adopted final API - or with other versions - is far safer.

This also allows bigger changes ( based on feedback ) and multiple experiments at the same time, to better compare and evaluate.

I think the H3 / quic process in IETF is the best analogy.

@costinm
Copy link

costinm commented May 2, 2024

@youngnick - I think mixing "field changes" (and in general - updates to existing v1 CRD) with "new experimental CRDs" is a mistake - we should never 'experiment' in the v1 API. The proposal (AFAIK) is about the experimental CRDs only, and it is not optimizing for people who experiment/test - but for responsible production users.

For the 2 cons you mentioned - "Having to make manual changes for these things is bad UX. The config is not actually different, ..." - the entire point of experiments is to have ability to make significant change - based on user feedback, implementations validating it works with their infra, other experimental proposal that may approach things in a different way. And users are not expected to make manual changes at all when adopting the new API in any prod environment, where it matters - because (1) they are not supposed to use the experiments in prod and (2) by using different apiGroup they can continue to use the alternative APIs as long as the implementations support them.

Having different versions of types with the same name (aside from their API Group) installed in the cluster is a recipe for disaster.

Completely agree. The 'experiments' - as well as vendor or 'cross vendor' production-supported APIs that are out of scope or not adopted yet by the 'core API' - but are supported and usable in production - should use different API groups and ideally different resource names.

VirtualService in Istio is an example of resource that will continue to be supported for a very long time - so are our authz and other policies. Other vendors have similar 'supported' APIs that may follow Gateway model - but will never become 'core'. We also have many experimental features - like the early implementation of persistent session - which we'll continue to support in parallel with the 'core v1' API. This is an example of APIs that are usable in production and user can continue to use them while gradually migrating to any 'core v1' ( or not ).

It is not huge burden to create a resource in a new ApiGroup - and iterate over it as needed, including by having multiple distinct CRDs at the same time, in same or different ApiGroup. Vendors can indicate support for one or more of the CRDs - and if a vendor supports that version it can be used in production, just like any vendor CRD.

In the end - as feedback and evolution converge and one of the CRDs is ready for 'core v1' - with multiple vendors
already implementing and supporting it - we can add a copy to core repo, and vendors may decide how long to support
the other ApiGroup.

The core of this proposal is that the status of a CRD or API is "supported" or "not supported" - this is not about 'experimental' or 'alpha', and not about "migrating users from alpha to v1". Users have a clear understanding
that a CRD can be used in production if the implementation they use provides "production supported implementation",
with clear support window - and will not be impacted in any way by core gateway v1 adopting the same - or a drastically
different - CRD as part of the upstream standard.

@mikemorris
Copy link
Contributor

It is not huge burden to create a resource in a new ApiGroup - and iterate over it as needed, including by having multiple distinct CRDs at the same time, in same or different ApiGroup. Vendors can indicate support for one or more of the CRDs - and if a vendor supports that version it can be used in production, just like any vendor CRD.

I think it's worth being explicit that the model you're proposing @costinm (which would be more akin the to CSS vendor prefix conventions) is different from this proposal, which is a single, shared, experimental apiGroup housed under the Gateway API project, not several vendor-specific apiGroups and resource names (like IstioHTTPRoute or something).

In some aspects, we already have vendor-specific apiGroups and resources for common concepts (like Linkerd's ServiceProfile and Istio's VirtualService), so we're at a different stage of collaboration - I think introducing an additional vendor-specific intermediate step between existing vendor resources and Gateway API adoption would be confusing to users and a significant barrier for the widespread adoption and success of the Gateway API project.

@costinm
Copy link

costinm commented May 2, 2024

I think Rob is proposing one shared API group because vendor and independent cross-vendor, API groups are already possible and don't need a proposal.

Could be something like ecosystem.gateway.... - and once we use it for one successful collaboration we can decide to create more, or stick with one.

@costinm
Copy link

costinm commented May 2, 2024 via email

@youngnick
Copy link
Contributor

To be clear about what this is proposing and solving:

This proposes splitting the existing Experimental channel into its own API group and named objects (gateway.networking.x-k8s.io and XGatewayClass, XGateway, XHTTPRoute, and so on). This is to ensure that moving from experimental to standard channel is always an operation that requires manual intervention; it is explicitly about making this manual intervention required, because we can't guarantee safety on the Experimental -> Standard transition.

This proposal is not about making any changes to the way the Experimental channel works. @costinm, I appreciate your concerns, but "Experimental" here does not mean "wildly experiment in any way we see fit". "Experimental" here is closer to beta than it is in other places, because we try very hard to not make breaking changes, and if we do, we indicate it with a version bump (as BackendTLSPolicy already did with moving to v1alpha3 on breaking change).

The existing Experimental channel is however also the way that we approximate feature gates in core Kubernetes. When we add a new field in a stable object, we add it to the copy of the stable object delivered via the experimental channel, so that users can opt in to testing the new behavior, as they could if we had a feature flag for it. Feature flags are not possible with CRDs because of the way they work, so this is the compromise we've worked out.

For stable objects, we strictly follow the upstream API Conventions and API Changes, which lay out safe ways to add fields to stable objects, as done in upstream when feature gates are available.

To be clear, we will not be changing anything about how the Experimental channel works at this point in the life of the project. The time for those debates was some years ago, and now there is too much built on top of these assumptions to change it.

I think it seems clear that our versioning documentation could probably use an update to make some of this clearer though.

@costinm
Copy link

costinm commented May 6, 2024 via email

@robscott
Copy link
Member Author

robscott commented May 7, 2024

This is a very helpful discussion that should continue, but it may be helpful to schedule a meeting so we can iterate a bit more quickly on these ideas. If you'd be interested in joining that discussion, please comment on this Slack thread: https://kubernetes.slack.com/archives/CR0H13KGA/p1715098072035809

@arkodg
Copy link
Contributor

arkodg commented May 21, 2024

I'm a -1 to this change because its very disruptive for the end user (entire teams - platform and app), there's at least a 2 time major cost here

  • rewriting current experimental config
  • rewriting again when moving to standard
  • more rewriting when trying out experimental features

Attempting to share the user first perspective
release-channel-overlap

taken from https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels highlights that the Experimental Channel is a super set of Stable Channel

End User/Team Journey

  1. Moving from experimental->standard
  • The team (cluster, platform admin) MUST be aware that they are losing out on functionality that some platform, app teams were relying on. This communication from the cluster, platform teams to the platform, app teams about this version/channel change can happen using internal team communication tooling or can be automated/mitigated using tooling in cluster (e.g. status field, webhook that reject configs if some fields are set) or CLI (gwctl highlighting which fields are being ignored).
    Config changes MAY be needed (e.g. GRPCRoute from v1a2 to v1)
  1. Moving from standard->experimental
  • The cluster, platform teams should make the platform, app teams aware of this change, so the platform, app teams can try out newer fields. In cluster tooling (like GatewayClass status) or CLI (gwctl) can highlight the experimental features
    No config changes needed here

Cluster + Implementation Capability - Its possible that clusters or implementations will either support only standard or experimental changes, and will call out that they do not support both release channels, which means they dont need to support these transitions. Some implementations and cluster providers may want to support this transition based on their user base

Usability - By focusing and investing in tooling to improve the experience of transitioning from one channel to the other, we can eliminate the need for forcing the user to make config changes which will make them less inclined to want these channel transitions ( as a community we dont want this user behavior )

@howardjohn
Copy link
Contributor

I am -1 on this as well. This will be extremely detrimental to user experience.

Users will need to know whether to use XType or Type in all usage. This is not just the person trying to use an experimental field, either. There are references (Policy, Gateway, Route) that all will need to be aligned and updated. Imagine I have a GW with 1000 routes and I want to use an experimental GW field; I need to somehow go update ALL the routes (likely owned by different personas in different namespaces) to use the XGateway.
This is pervasive across all usage. For instance, every Helm chart will now either need to do not support experimental (which harms our goal of "get more feedback") or have a knob to tune for every single usage. This also applies to things like ownerRefs, etc -- everywhere needs to be updated.

Additionally, we know have a situation where we have 2 disjoint sources for every type. I cannot even see all gateways in the system with kubectl, as I now have to do kubectl get gateway && kubectl get xgateway. To Kubernetes these are totally unrelated, so we lose guarantees around uniqueness, etc. This then needs to be reconciled in controllers and users mentally. I cannot fathom a user understanding the behavior if they happen to make a Gateway and XGateway with the same name, especially if they controller may not even support it. This also requires 2 watches on the api server, which, again, are disjoint and can cause strange issues and behavior (not to mention performance overhead).

One of the goals of this is to increase experimental adoption. I think this will harm it. Given the above UX, experimental is so painful to use I don't see much user adoption. It is also extremely painful for an implementation. Coupled with less usage, this will lead to less implementations supporting experimental; I would likely push for dropping experimental support in Istio if this goes forward (speaking as an individual, though, not for the whole project).

Today, all implementations actually support experimental resources implicitly, they just made not support experimental fields within those APIs. In that I mean everyone can read HTTPRoute, but they may just ignore timeout or some experimental field. Now every implementation will need to explicitly opt in to have support, which seems like a high barrier.

Another thing to consider is that this proposal puts a high burden on everyone (users+implementation, standard+experimental usage) forever. The current issues we face only impact user of experimental. Making experimental more ergonomic may seem like a pressing issue now, but 1-2 years from now it will likely be a very niche thing, as more and more functionality moves to core. However, if we do this now we will be stuck with the cruft of experimental forever.


What I propose instead is that we have some simple tooling to help migrate from Experimental to Standard.

IIUC, the current risk here is I apply Standard CRDs and some fields are silently ignored. gwctl can have a new option to pre-check this, which would basically be the equivilent of kubectl get gateway-api -oyaml | validate-against-openapi --schema=standard-crd.yaml (run the new CRD validation against each existing resource). This tooling is trivial to build; there are basically already existing tools to do this and libraries we can use to embed in gwctl. This would be a recommended upgrade path for users.

There is also another risk of Kubernetes platforms providing managed GW CRDs, where the user cannot control them. If users of these platforms are not aligned with the stability vs flexibility constraints the platform provides, they should work with the platform to provide customization options. To pick on a concrete one like GKE, I don't think its unreasonable to have an option for experimental if customers demand such a thing; GKE literally has the carve-out in their api (--gateway-api=standard, which is future-proofed to allow --gateway-api=experimental), presumably to add this if there is sufficient customer demand.

@costinm
Copy link

costinm commented May 21, 2024

Mostly agree with the last 2 comments.

But I also consider this thread as no longer an issue for me - if a user or vendor enables any experimental in prod, they will deal with the consequences, they are free to shoot themselves in foot and hopefully will learn from mistakes.

For any new feature - IMO vendor APIs are the best approach, can be used in production as soon as the vendor decides they are stable - and can expose the full capabilities of the implementation.

What I realized is that at the end of the day - it doesn't really matter what experimental does, it only matters to not allow it in a prod environment, and that's a choice each vendor or implementation can do. The rest - vendor APIs - is settled and broadly used.

With Istio and many other vendors providing pretty stable extensions ( most Istio APIs are now attachable ) - the users don't have to deal with experimental at all, they can use stable and more capable versions until the feature becomes v1, and keep using them as long as they need.

So it's really about each implementation and users making the right choice about production. An implementation shipping a feature as a stable v1 vendor API first - with all capabilities - seems a better idea than playing experimental games and allows a better way to identify commonality and adoption.

I still think a place to track and link all vendor extensions and identify common patterns and features wouls be nice - but a wiki or blog page can do this without any formalities.

@robscott
Copy link
Member Author

Usability - By focusing and investing in tooling to improve the experience of transitioning from one channel to the other, we can eliminate the need for forcing the user to make config changes which will make them less inclined to want these channel transitions ( as a community we dont want this user behavior )

gwctl can have a new option to pre-check this, which would basically be the equivilent of kubectl get gateway-api -oyaml | validate-against-openapi --schema=standard-crd.yaml (run the new CRD validation against each existing resource). This tooling is trivial to build; there are basically already existing tools to do this and libraries we can use to embed in gwctl. This would be a recommended upgrade path for users.

I think both @howardjohn and @arkodg have the same goal here, and I certainly agree that something like this would help. I do have concerns about this approach though:

  • Despite our best efforts, people will still use other less safe installation methods.
  • This kind of CLI tool is more complex than it sounds on the surface. What if there's only an issue upgrading 1/7 CRDs? What if a problematic CR is applied in the middle of the upgrade process? How can gwctl possibly know about the versions of the API supported by implementations running in the cluster?

There is also another risk of Kubernetes platforms providing managed GW CRDs, where the user cannot control them. If users of these platforms are not aligned with the stability vs flexibility constraints the platform provides, they should work with the platform to provide customization options. To pick on a concrete one like GKE, I don't think its unreasonable to have an option for experimental if customers demand such a thing; GKE literally has the carve-out in their api (--gateway-api=standard, which is future-proofed to allow --gateway-api=experimental), presumably to add this if there is sufficient customer demand.

Yep, we certainly hoped that there would be a safe way to install and manage experimental CRDs on behalf of customers, but I still haven't found one. Many of the problems have been covered above, but essentially it's ~impossible to safely and automatically upgrade CRDs that may have breaking changes, and it's ~impossible to safely support a transition from experimental to standard channel, which would undoubtedly be frustrating for anyone relying on this. You could of course offer an interactive gwctl-like experience like you described above, but that seems to defeat the purpose of automatic CRD management.

Imagine I have a GW with 1000 routes and I want to use an experimental GW field; I need to somehow go update ALL the routes (likely owned by different personas in different namespaces) to use the XGateway.

I'd argue that is an anti-pattern we very much do not want to support. Can you imagine a breaking change in the next version of experimental Gateways and now a user is stuck with 1000 routes attached to an old/outdated experimental Gateway without a safe upgrade path. I'd much rather the user deploy an experimental Gateway in isolation with only the relevant routes, ideally with a separate implementation to ensure the experimental resource does not impact any production workloads.

One of the goals of this is to increase experimental adoption. I think this will harm it. Given the above UX, experimental is so painful to use I don't see much user adoption. It is also extremely painful for an implementation. Coupled with less usage, this will lead to less implementations supporting experimental; I would likely push for dropping experimental support in Istio if this goes forward (speaking as an individual, though, not for the whole project).

That's certainly an outcome I'd like to avoid. Although it certainly could harm experimental adoption for some users, it may also increase adoption on managed platforms where it would otherwise be inaccessible. I'm also not convinced that the UX is inherently bad, it just makes it easier for users to clearly flag and understand when/where they are using experimental APIs, which certainly could reduce accidental usage of experimental channel, but hopefully would not have a huge impact on intentional usage of experimental channel. I'm hopeful that we can reach a solution that is not extremely painful for implementations, but if we can't, I agree that we should not proceed with this.

Another thing to consider is that this proposal puts a high burden on everyone (users+implementation, standard+experimental usage) forever. The current issues we face only impact user of experimental. Making experimental more ergonomic may seem like a pressing issue now, but 1-2 years from now it will likely be a very niche thing, as more and more functionality moves to core. However, if we do this now we will be stuck with the cruft of experimental forever.

I'm not sure how this proposal impacts current users of standard channel - this intentionally avoids any modifications there. This would certainly be disruptive for experimental channel users, but that's also part of the contract included with using experimental resources. My biggest fear with our current trajectory is that people are going to get burnt using Gateway API in production and accidentally transitioning between release channels. If an API is seen as unsafe/dangerous, that will really limit adoption going forward. I agree that 1-2 years from now, experimental channel will be less important, but I think that's all the more reason to isolate it from standard channel to avoid experimental usage breaking standard/production usage.

Based on our meeting earlier today, it seems like migrating to a GH discussion would be easier for people to follow, so I've created #3106, and will close this PR out. Feel free to respond here if you want to respond to anything specific in this thread, but in general, I'd encourage migrating to the discussion for follow ups wherever it's practical.

@robscott robscott closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet