New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting Experimental CRDs into separate API Group and Names #2912
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm not really a fan of a new experimental API group. It breaks client applications when something is promoted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not a fan of this. The solution to the "alpha in standard" channel seems clear to me - just remove alpha from the standard CRDs. The downsides of that approach are far more palatable and only impact extremely niche cases, while this causes widespread pain.
I'm also not convinced the sample controller is real enough to be meaningful, I've done a real controller using a similar approach (for to solve half of the problems here) and it was an astronomical pain. I cannot imagine adding support for experimental CRDs with this approach.
Not to pile on, but I'm also a -1 on this; separate groups makes the resources in the channel separate resources for all intents and purposes. This feels like a big burden on controllers for not a lot of gain IMO. Why not include v1alpha2 as a server version while stripping out any alpha fields? |
What would this break? Presumably most controllers would still be supporting both experimental and standard channel even with this change.
I completely agree that this solution would be entirely overkill if we were just trying to solve for how to graduate GRPCRoute to standard channel. Although that's definitely where this thought process started, I think there are much more compelling reasons in favor of an approach like what I've proposed here. Specifically I think experimental channel as it stands today is a trap that more and more people are going to fall into unless we make some kind of change. Here are the problems with the model today:
In my opinion, this leaves us with a couple options:
|
Clients that are authoring gateway resources (eg. Knative) that have typed clients would break. We wouldn't be able to work with both the standard channel and the experimental channel easily. |
Wouldn't you already have this issue? If a cluster only has standard channel CRDs installed and you try to install config that has experimental fields, won't that break? It seems like you'd already need to be aware of the channel of CRDs that is present when you're deciding what to configure. Or if everything fits in standard channel CRDs, just use those because they'll be far more stable and widely available. |
Seems safe to me?
I don't think this is a desired state. Nor common in other projects, including Kubernetes core. As a controller implementation, I would certainly not allow this; if the experimental code is enabled in our central controller it impacts the entire cluster, not just some namespaces that are using the experimental ones. There is a shared fate in a shared controller.
The same exists for "Alpha" API features in most Kubernetes providers. I don't see why we need new solutions here. |
It's more about the go types and client code. If I start using an experimental feature/CRD and then it's promoted to standard channel that's a breaking change for me to support. |
This is true in the case of GRPCRoute, but not likely to be true in many other cases. For example, HTTPRoute will often have several different experimental fields, and only some of them will graduate to standard in a given release. Some may also have breaking changes along the way.
Disagree. Kubernetes upstream APIs have long had the problem that no one tests them while in alpha. Gateway API + CRDs were intended to be a way to get a shorter feedback loop on API design. Repeating the problematic patterns of upstream Kubernetes APIs is not desirable here IMO.
+1 completely agree, each controller should decide if it's going to support experimental resources or not. What we've found with Gateway API is that it's very common to have multiple implementations of the API running in the same cluster, and some may offer production readiness, while others may be more experimental in nature.
This has resulted in near-zero feedback for any Kubernetes alpha APIs which is very painful (coming from someone who's had to deal with this cycle multiple times). In Gateway API we have an opportunity to have a demonstrably better feedback loop, which I believe should lead to a demonstrably better API. If no one uses or implements experimental channel because it's either too unsafe or just impossible to access on any of the managed Kubernetes providers, we've just unnecessarily recreated the same problems that upstream Kubernetes APIs have. |
This proposal continues to use the same go types for both experimental and standard channel (just with type aliasing like we're already doing). The only thing you'd need to change is the API group you're point to, which I think should be relatively straightforward and also not that common of a transition. I'm assuming Knative already needs some kind of flag of whether or not to attempt to use experimental fields/CRDs, this seems like it would be a natural extension of that? |
IMO Its only simple because the example you showed only uses a simple List. Once you pull in real machinery like informers, controller-runtime, custom abstractions, etc it becomes far more complex. This is speaking from experience when we implemented a "multi version" read support in Istio for Gateway API transition from alpha -> beta. |
That's fair, I'm curious if there are any shims or reference code that we could provide that would help here. My guess here is that the vast majority of controllers would need the following:
Is there anything else I'm missing here? |
Here is an example of us handling it: https://github.com/istio/istio/pull/41238/files. You'll note we had to duplicate some of our controllers entirely. This was only acceptable because it was short lived and caused by our own mistake in Istio rather than in the upstream API forcing it upon us. If our concern is we will not get people trying out experimental features, I don't get how this helps. It requires both a user AND controller opt-into supporting it, and both are painful. I don't expect every controller to have tons of code to handle this or expect Helm charts to update to have |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I managed to originally write the following comment on #2919. 🤦♂️ I'm repeating it here, along with @robscott's responses. Sorry for the confusion! (@robscott, please sing out if you think I'm misrepresenting you here.) @robscott, ultimately I think we're falling a bit into the how-before-what trap here -- I kind of feel like we're wrangling about how to do things without having a clear sense of exactly what we need to support. Could we back up a moment and lay out some use cases here?
A few that come immediately to my mind:
(1) is, I think, what we've been discussing with GRPCRoute. What other situations come to mind? |
The introduction of yet another version is also disruptive. I'm still not clear on how adding another version solves the problem of disruption for GRPC or other cases. I don't know the historical reasoning behind why the CRDs are not vendored, like other APIs we consume. To me that would be an alternate solution, and we haven't talked about it. |
I will pile on to what appears to be the broad consensus of all comments: I completely agree with all that another 'experimental' CRD is harmful - however for the same reasons, the current experimental CRD model is even more harmful and broken. Despite the comments - actions show broad consensus by all implementations on defining Gateway APIs as vendor extensions, under each vendor's namespace. And each vendor does have 'beta' or 'public preview' or 'GA' labeling for each API they define. It seems there is also agreement in this thread that this project ( gateway-api ) should not define some other space for new APIs. I completely agree - it will lead to confusion and attempts to define APIs in a void, without an implementation. The only thing missing is a space ( or spaces ) where different vendors can collaborate on a common API ( after they have their own implementation ), build interoperability tests - similar to IETF - before that API can be proposed for merger into this repository and part of the core. Of course that depends on sets of vendors or other orgs doing this - TL;DR: it seems we all agree - in words and actions - with what I consider the spirit of the proposal, which is to have new CRDs and features implemented in separate API Group and using different Names. |
I would note that for each API defined by a vendor or independent organization - the status ( GA or private preview or whatever the vendors use for their feature stability definition - doesn't have to be a version ) is associated with a specific feature. If a vendor defines a OTel API as v1 and marks it stable - it is certainly not an 'experimental' API, just a single-vendor API. Users can safely use the API - as well with similar stable APIs from other vendors - with the deprecation policy and guarantees of each vendor. The only point where this WG is involved is when a common API needs to be defined based on (stable, proven) vendor implementations of a feature, and conformance tests need to be defined and agreed on. The process is very similar to IETF model. |
I'm hopeful that with Storage Version Migrator moving in-tree in Kubernetes 1.30 with KEP-4192, we may have a tool to help with this workflow, but (from my experience testing the out-of-tree impl) the behavior is too global/automatic by default (and therefore scary!) I hope we may be able to provide a "safe" upgrade path with a bit of custom tooling using preflight checks in
Notably, this may not handle breaking changes between e.g. v1alpha1 and v1alpha2 in the Experimental channel if we take the same approach as we are with BackendTLSPolicy, but I think that's okay?
I feel like this is more of a nice-to-have than a requirement - cloud providers have enabled such a proliferation of clusters that spinning up a new cluster with Experimental channel CRDs and redirecting some traffic to it doesn't seem too unreasonable. I view this primarily as an at-scale use case where a platform team would be managing shared app dev team access to clusters and could take on this story, not a must-have workflow for small self-serve teams. This is a pattern that wouldn't be bad to nudge users toward for other changes too, like migrating to a newer Kuberentes version instead of upgrading in-place.
For an initial v1alpha -> v1 promotion, I think simply serving v1 versions in the Experimental channel and providing some well-lit path for upgrading stored versions might be sufficient instead? (I'm not quite clear on how including v1alpha CRDs in Standard but not serving or storing them as GRPCRoute may do works with "automatic" translation as described in https://gateway-api.sigs.k8s.io/guides/crd-management/#api-version-removal and if that process changes with SVM in-tree.) Post-v1 though, I don't think even serving alpha versions would safely allow migrating from Experimental CRDs testing a new field back to Standard though...
On AKS we're evaluating an approach to allow users to "opt-out" of Gateway API management by the cluster provisioner - we'll install Standard channel CRDs by default when needed, but we want to let users "offboard" if they need functionality only available in the Experimental channel. Providing a safe path back to a managed Standard channel is a challenge currently though. Additionally, I would like to make it easier to install Experimental channel CRDs more granularly, such "Standard channel for everything except Experimental channel HTTPRoute".
I think the way we're choosing to handle this with BackendTLSPolicy is reasonable - some pain is okay if it's not a surprise, and it's not possible to accidentally break in-use resources. |
We have seen some of this historically, but from conversations I've had with maintainers this seems to largely be a pattern which Gateway API implementations hope to move away from to avoid end-user confusion, particularly for incremental changes to existing CRDs. For the well-defined extension points in Gateway API (filters, policies), this is a viable path though.
I think this is precisely the stage we're trying to better define here. I do expect we'll still see some new CRDs emerge from existing vendor-specific implementations (authorization policy as a prominent example), but we're largely trying to focus on the "mid-tier" with Gateway API - the path for moving from experimental shared APIs for common functionality (after being proven in vendor-specific implementations) to a standard. |
I'm sure all implementations would like their specific features to be added to the existing CRDs directly - instead of We hope and want for a lot of things - some are feasible, others are not, and what is nice for specific implementations is certainly not so nice for the users who have to deal with divergence between implementations and can't rely The criteria of having 2-3 implementations for a core API fail to take into account the reality of long-term supported In any case - this proposal is orthogonal to this - if consensus exists to add a field to a core API directly as stable, If a feature does not have consensus on moving directly to stable - it will still need a mechanism for experimentation As a user, I would prefer APIs that have been proven and vetted over APIs that are directly pushed to
That's very simple - if I understand the proposal correctly, it means the experimental shared API would live Implementations can support the experimental api group for N releases - in parallel with v1. Same model used for example for H3 - with different drafts using other names, and the final RFC using h3. |
I don't believe anyone is suggesting this.
The contention of most maintainers in this thread is that the existing Experimental channel model (as defined at https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels) is a better way to handle this both for implementations, and, importantly, for end-user experience. |
On Mon, Apr 29, 2024, 09:15 Mike Morris ***@***.***> wrote:
if consensus exists to add a field to a core API directly as stable,
with no experiment or proof - it should be added with whatever process is
defined.
I don't believe anyone is suggesting this.
If a feature does not have consensus on moving directly to stable - it
will still need a mechanism for experimentation
and for implementations to prove the viability, users to provide feedback,
etc.
The contention of most maintainers in this thread is that the *existing*
Experimental channel model is a better way to handle this both for
implementations, *and*, importantly, for end-user experience
I have not seen any comment suggesting that either users or implementations
are happy with current experiment model or know a good way to handle any
significant changes between experimental and v1.
In Istio it has been almost impossible to fix anything between alpha and v1.
|
My concern is that that's because we haven't introduced many breaking changes into experimental channel yet. That's leading people to believe that experimental channel is more stable than it's intended to be.
+1, this is one of my biggest concerns. Although I'm not very familiar with Istio versioning, I'm very familiar with the problems we've faced in Kubernetes re: changing beta APIs. Whenever an API version is broadly accessible (beta in upstream Kubernetes, experimental in Gateway API), it becomes very difficult to make any breaking changes. If we're not very careful here, we're going to end up with the same result all over again where it becomes impossible to change APIs, even if they're technically labeled as alpha. My theory is that having a stronger separation via separate API groups and names will initially be somewhat painful but will lead to a much more sustainable API long term. (Imagine the pressure on API reviewers if approving an alpha API meant that everything had to be ~perfect the first time because we could never change anything after that initial release.) |
Agreed, I think this is the best case scenario. Importantly it only works when you're changing an entire resource. If you're changing an experimental field in a stable API like HTTPRoute you simply don't have that option available. The only option I can think of is "painful surprise" unless we separate the release channels like I'm proposing here. |
Yep, I think it's reasonable to offer a path to offboard CRD management, GKE also has this, but it's very difficult to offer a safe upgrade path back to managed stable CRDs. This proposal is an attempt to change that.
Unfortunately I think it would be very difficult to maintain a tool like this. We'd need to have a tool that maintained the changes between every possible combination of CRDs and detect if any were set to a non-zero value. Even if we could detect this reliably, my working theory is that stable production usage of APIs should be entirely disconnected from experimental usage and they should be able to coexist within the same cluster. This approach would mean experimental usage in the dev namespace would prevent a prod upgrade from getting a newly graduated feature that is clearly needed. Agree that we wouldn't end up with a v1alpha7 on a resource that's already made it to standard channel. Once it gets to that point the only changes allowed are backwards compatible and therefore no more version revs.
This doesn't really solve the problem for providers that are trying to provide a fully managed experience - ideally upgrades are safe and automatic. Our goal should be for a user to be able to start an upgrade and know that it will be safely executed - that's easy to accomplish if the only APIs installed by the provider are guaranteed to be stable and backwards compatible, but it falls apart if you introduce experimental APIs with the same name and group to the equation. |
I wrote a longer rant doc in the context of Istio - but IMO the concept of
'semantic versioning' for APIs is very harmful and has created
major problems.
For protocols like HTTP/1.1, HTTP/2, HTTP/3 - or IPv4 and IPv6 - it works
great, because they have mechanisms to be used at the
same time and are all long-term stable. That's not the case with APIs or
CRDs.
"Alpha" or "experimental" are just a way to justify launching APIs faster
and skipping the hard work and due diligence (scale, security,
usability, consistency, etc) - and putting the burden on the user to deal
with any problems that are found in the API - or as is the
case in Istio, getting stuck with whatever was barely reviewed as
experimental because making changes is too painful and
users are already relying on the API.
It is fine to launch a throw away API or CRD with a short support window -
like the drafts that led to HTTP/3 in the protocol world, as long
as it is clear the API will be dropped and replaced.
It is fine to launch a vendor API - with long term support, even if a
'least common denominator' API will also be supported later.
What is not fine is pretending it is ok for a user to take an experimental
or alpha CRD and use it in any production
environment ( 'to allow users to provide feedback' ) and expect we'll be
able to make any structural or major changes
and fix things afterwards, or play games with allowing some experimental
APIs in production. If you need proof - look
at Istio APIs and pseudo-APIs ( env variables, etc) - and how many real
changes we had between 'alpha1' and 'v1'.
…On Mon, Apr 29, 2024 at 5:54 PM Rob Scott ***@***.***> wrote:
On AKS we're evaluating an approach to allow users to "opt-out" of Gateway
API management by the cluster provisioner - we'll install Standard channel
CRDs by default when needed, but we want to let users "offboard" if they
need functionality only available in the Experimental channel. Providing a
safe path back to a managed Standard channel is a challenge currently
though.
Yep, I think it's reasonable to offer a path to offboard CRD management,
GKE also has this, but it's *very* difficult to offer a safe upgrade path
back to managed stable CRDs. This proposal is an attempt to change that.
2. Check if any CRDs, fields (or enum values? what else?) in use are
missing from the Standard channel. Attempting to overwrite CRDs missing
in-use stored versions will block the "missing CRDs entirely bit", but we
can still maybe handle this UX a bit nicer and earlier. We can't just
compare against newer served versions because that wouldn't cover post-v1
changes like adding fields to HTTPRoute in the Experimental channel (I
think the example @kflynn <https://github.com/kflynn> gave with a
v1alpha7 HTTPRoute is not how we intend to make post-v1
backwards-compatible additions? Please LMK if I'm mistaken though.) This
check may be difficult (and I don't want to manually maintain the logic),
but I'm curious if we could do this with sufficient investment in some code
generation. This is somewhat similar to the approach proposed in KEP-2558:
Publish versioning information in OpenAPI
<https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2558-publish-version-openapi>
except that we have the benefit of already having parsable channel flag
comments.
Unfortunately I think it would be *very* difficult to maintain a tool
like this. We'd need to have a tool that maintained the changes between
*every* possible combination of CRDs *and* detect if any were set to a
non-zero value. Even if we could detect this reliably, my working theory is
that stable production usage of APIs should be entirely disconnected from
experimental usage and they should be able to coexist within the same
cluster. This approach would mean experimental usage in the dev namespace
would prevent a prod upgrade from getting a newly graduated feature that is
clearly needed.
Agree that we wouldn't end up with a v1alpha7 on a resource that's already
made it to standard channel. Once it gets to that point the only changes
allowed are backwards compatible and therefore no more version revs.
3. Warn user with sufficient detail.
4. If no warnings found (or y/N override passed?), find Gateway API
group CRs with a newer available served version (for initial promotion from
v1alpha to v1 use cases), create SVM migration and watch for completion.
Report successful migration, provide instructions to move to Standard
channel.
This doesn't really solve the problem for providers that are trying to
provide a fully managed experience - ideally upgrades are safe and
automatic. Our goal should be for a user to be able to start an upgrade and
know that it will be safely executed - that's easy to accomplish if the
only APIs installed by the provider are guaranteed to be stable and
backwards compatible, but it falls apart if you introduce experimental APIs
with the same name and group to the equation.
—
Reply to this email directly, view it on GitHub
<#2912 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAUR2SHGVKFXARIDS7GO6DY73TUFAVCNFSM6AAAAABFPCBRUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBTHE4TONJZHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
As always, I think that @kflynn's use cases are very useful for understanding the problems here. Before I get started discussing that though, I think that it's important to review how the channels work on a per-object basis as well as on a per-field basis. VersioningWe have two channels in each release bundle, experimental and standard. Experimental includes
Standard includes:
That's it. This problem arises because we have a rule that we don't include any alpha things in the Standard channel. ProblemsThis means that for a graduation like GRPCRoute, there's no safe, easy migration path between the v1.1 Standard resources and the v1.0 Experimental resources, because the v1.1 Standard resources don't include any definitions for the v1alpha2 resources that, if you've been using the v1.0 Experimental resources, you are already using. Technically this is fine, as @costinm mentions, because noone should be using GRPCRoute in any production scenario, and recreating all of your GRPCRoute resources from scratch means you need to check the resources as you reapply them. This is a terrible experience for the most active members of our community though, who have been doing what we need and actually testing this functionality. In order to ensure that any GRPCRoute config in the cluster before upgrading to v1.1 is present, users will need to:
This is an annoying, manual, error-prone process that Kubernetes has mechanisms designed to avoid, particularly in the case where objects can be safely round-tripped between versions, since there no incompatible changes. (Us maintainers work very hard to ensure this is the case!) Rob's proposalIn this PR, @robscott makes the case that we should make this split more apparent to both users and implementation maintainers, by splitting the experimental code out into separate objects. This locks in the above process and makes it required for every experimental -> standard resource transition. Pros and cons of this approach as I see it: Pro:
Con:
Solving the migration problem for fieldsBecause of the way that Kubenetes handles unknown fields in persisted objects, changing from experimental channel to standard channel is not guaranteed to produce reliable behavior because the following can happen:
However, if nothing touches the object, and the Is this situation likely? no. Is it that bad? Probably not, but we can't guarantee it, which is critical. In practice, I think that it's very unlikely that objects would persist for that long without being modified at all, and if we performed the incantations to invoke the storage version migrator, then this issue will never arise, because the storage version migrator's whole job is to do a no-op write to the object to prevent exactly this sort of issue. The other thing that could conceivably happen here is that we have a field with the same name, but a different behavior. In practice, again, we don't allow this as an API change, to prevent exactly this sort of thing. New behavior == new name. In summary, solving this problem for fields with a high level of confidence involves ensuring that the storage version migrator or similar operation is running for anyone who is unsure. Solving this problem for whole objectsFor whole objects, the story is a little easier, I think. To me, it seems that the simpler way to handle this problem is to relax the rule about not allowing any alpha resources in the Standard channel, if and only if the alpha resource being allowed is identical to the GA resource that's also included in Standard. This would mean that the user's upgrade process goes like this, using GRPCRoute as the example:
You've now migrated your config, and can safely upgrade to the later release that removes GRPCRoute This approach implies the following graduation process for graduating whole resources from Experimental to Standard:
This is the same process we used for graduating the currently-GA resources from This allows the safe migration of the newly promoted resource only from Experimental to Standard, but the following things can still happen in an Experimental -> Standard migration.
Regardless of what we end up doing, I think that we need to prioritize documentation about how to move between versions and channels. This should be basically the same for most transitions, since things are either going to work for sure, or need a human to check. This current proposal is effectively a way to enforce having to have a human check the experimental -> standard transition. I think we can do better than that. |
FWIW I agree with this @costinm, and why I think a goal of allowing experimental and standard APIs for the same resource to coexist in the same cluster is a dangerous goal, because it encourages this behavior. Isolating experimental CRDs in an "edge release" cluster and splitting some traffic towards it from an external load balancer layer to test new behavior would be a safer approach from a platform engineering team perspective.
@youngnick does this work if v1alpha2 is included in the CRD but not served (and maybe marked as
@youngnick v1 definitions added to YAML for which channel, both? Frozen experimental version still set as storage version? In Experimental channel only, or Standard channel too? |
That's not a viable approach for anyone using state - and has quite a complex multi-cluster story. The fundamental property of 'experimental' is that it is not supported, may break, may have security or scale issues. Should never be used in a prod environment. Having independent or vendor CRDs - with a short support window, as some vendors call 'private preview' also has some dangers - but it's a much higher bar and coexisting with the adopted final API - or with other versions - is far safer. This also allows bigger changes ( based on feedback ) and multiple experiments at the same time, to better compare and evaluate. I think the H3 / quic process in IETF is the best analogy. |
@youngnick - I think mixing "field changes" (and in general - updates to existing v1 CRD) with "new experimental CRDs" is a mistake - we should never 'experiment' in the v1 API. The proposal (AFAIK) is about the experimental CRDs only, and it is not optimizing for people who experiment/test - but for responsible production users. For the 2 cons you mentioned - "Having to make manual changes for these things is bad UX. The config is not actually different, ..." - the entire point of experiments is to have ability to make significant change - based on user feedback, implementations validating it works with their infra, other experimental proposal that may approach things in a different way. And users are not expected to make manual changes at all when adopting the new API in any prod environment, where it matters - because (1) they are not supposed to use the experiments in prod and (2) by using different apiGroup they can continue to use the alternative APIs as long as the implementations support them.
Completely agree. The 'experiments' - as well as vendor or 'cross vendor' production-supported APIs that are out of scope or not adopted yet by the 'core API' - but are supported and usable in production - should use different API groups and ideally different resource names. VirtualService in Istio is an example of resource that will continue to be supported for a very long time - so are our authz and other policies. Other vendors have similar 'supported' APIs that may follow Gateway model - but will never become 'core'. We also have many experimental features - like the early implementation of persistent session - which we'll continue to support in parallel with the 'core v1' API. This is an example of APIs that are usable in production and user can continue to use them while gradually migrating to any 'core v1' ( or not ). It is not huge burden to create a resource in a new ApiGroup - and iterate over it as needed, including by having multiple distinct CRDs at the same time, in same or different ApiGroup. Vendors can indicate support for one or more of the CRDs - and if a vendor supports that version it can be used in production, just like any vendor CRD. In the end - as feedback and evolution converge and one of the CRDs is ready for 'core v1' - with multiple vendors The core of this proposal is that the status of a CRD or API is "supported" or "not supported" - this is not about 'experimental' or 'alpha', and not about "migrating users from alpha to v1". Users have a clear understanding |
I think it's worth being explicit that the model you're proposing @costinm (which would be more akin the to CSS vendor prefix conventions) is different from this proposal, which is a single, shared, experimental apiGroup housed under the Gateway API project, not several vendor-specific apiGroups and resource names (like IstioHTTPRoute or something). In some aspects, we already have vendor-specific apiGroups and resources for common concepts (like Linkerd's ServiceProfile and Istio's VirtualService), so we're at a different stage of collaboration - I think introducing an additional vendor-specific intermediate step between existing vendor resources and Gateway API adoption would be confusing to users and a significant barrier for the widespread adoption and success of the Gateway API project. |
I think Rob is proposing one shared API group because vendor and independent cross-vendor, API groups are already possible and don't need a proposal. Could be something like ecosystem.gateway.... - and once we use it for one successful collaboration we can decide to create more, or stick with one. |
In some aspects, *we already have* vendor-specific apiGroups and
resources for common concepts (like Linkerd's ServiceProfile
<https://linkerd.io/2.15/reference/service-profiles/> and Istio's
VirtualService
<https://istio.io/latest/docs/reference/config/networking/virtual-service/>),
so we're at a different stage of collaboration - I think introducing an
additional vendor-specific intermediate step between existing vendor
resources and Gateway API adoption would be confusing to users and a
significant barrier for the widespread adoption and success of the Gateway
API project.
It is not 'another vendor' apiGroup - but a shared, cross vendor apiGroup
that will allow CRDs to evolve, get adopted by multiple
vendors, get support and testing - before getting into the core API group.
While the proposal is for a replacement for the current 'experimental' and
alpha - I think the intent is (or should be) to provide a
place where TLSRoute or other CRDs can be developed and adopted - without
using the 'alpha' or 'experimental' concept, but
with independent maturity ( just like the vendor extensions we support
today - but with multiple vendors implementing each)
Message ID: ***@***.***>
… |
To be clear about what this is proposing and solving: This proposes splitting the existing Experimental channel into its own API group and named objects ( This proposal is not about making any changes to the way the Experimental channel works. @costinm, I appreciate your concerns, but "Experimental" here does not mean "wildly experiment in any way we see fit". "Experimental" here is closer to beta than it is in other places, because we try very hard to not make breaking changes, and if we do, we indicate it with a version bump (as BackendTLSPolicy already did with moving to The existing Experimental channel is however also the way that we approximate feature gates in core Kubernetes. When we add a new field in a stable object, we add it to the copy of the stable object delivered via the experimental channel, so that users can opt in to testing the new behavior, as they could if we had a feature flag for it. Feature flags are not possible with CRDs because of the way they work, so this is the compromise we've worked out. For stable objects, we strictly follow the upstream API Conventions and API Changes, which lay out safe ways to add fields to stable objects, as done in upstream when feature gates are available. To be clear, we will not be changing anything about how the Experimental channel works at this point in the life of the project. The time for those debates was some years ago, and now there is too much built on top of these assumptions to change it. I think it seems clear that our versioning documentation could probably use an update to make some of this clearer though. |
On Mon, May 6, 2024 at 1:05 AM Nick Young ***@***.***> wrote:
To be clear about what this is proposing and solving:
This proposes splitting the existing Experimental channel into its own API
group and named objects (gateway.networking.x-k8s.io and XGatewayClass,
XGateway, XHTTPRoute, and so on). This is to ensure that moving from
experimental to standard channel is *always* an operation that requires
manual intervention; it is explicitly about making this manual intervention
*required*, because we *can't* guarantee safety on the Experimental ->
Standard transition.
Understood. This solves the largest problem - but it doesn't say "and we'll
never add any other API group or class names in the future".
Best to move incrementally.
The last part - "intervention required" - is correct, but there is a very
important context: "at some point" or "gradually".
An implementation may allow a user to continue to use XHTTPRoute in
parallel with the update HTTPRoute for a few releases
so user can safely and gradually move. That's not possible with the current
mechanism.
This proposal is *not* about making any changes to the way the
Experimental channel works. @costinm <https://github.com/costinm>, I
appreciate your concerns, but "Experimental" here does not mean "wildly
experiment in any way we see fit". "Experimental" here is closer to beta
than it is in other places, because we try *very* hard to not make
breaking changes, and if we do, we indicate it with a version bump (as
BackendTLSPolicy already did with moving to v1alpha3 on breaking change).
And that's one of the main problems: the 'version bump' is extremely
painful and it forces us to "try very hard to not make breaking changes".
It is called experimental, not "beta" - for a reason. We should have the
ability to make breaking changes and not have to have
the 'experiment' be almost identical to v1.
The existing Experimental channel is however *also* the way that we
approximate feature gates in core Kubernetes. When we add a new field in a
*stable* object, we add it to the copy of the *stable* object delivered
via the *experimental* channel, so that users *can opt in* to testing the
new behavior, as they could if we had a feature flag for it. Feature flags
are *not possible* with CRDs because of the way they work, so this is the
compromise we've worked out.
And with this proposal - we don't have to make any compromise and are
closer to feature gates in core k8s ( in particular if
we also do the second step in my suggested improvement to this proposal ).
Users/admin can install or not the
gateway.networking.x-k8s.io API group. With my (extended) proposal users
would install authz.gateways.networking.x-k8s.io CRD
as a feature gate for just the authz.
For *stable* objects, we strictly follow the upstream API Conventions
<https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md>
and API Changes
<https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md>,
which lay out *safe* ways to add fields to *stable* objects, *as done in
upstream when feature gates are available*.
With this proposal - we can do the same, with the added benefit that we can
test the 'safe ways to add fields' and get feedback from users and
implementations - since they will happen in XHttpRoute first, and that can
be used concurrently with HttpRoute.
To be clear, we will *not* be changing anything about how the
Experimental channel works at this point in the life of the project. The
time for those debates was some years ago, and now there is too much built
on top of these assumptions to change it.
I think this proposal is a fundamental change in how 'experimental' works -
it allows it to be closer to feature gates and improves the
life of users and our ability to make better APIs. Not sure what is 'built
on top' of the current mechanism - but this proposal also
has the property that it can be tried alongside - we just need to create
the new APIgroup, use it along for one API,
see how it goes - and not get stuck with some model we decided years ago.
That's pretty much the same
problem - making a decision and not having any ability to change because
'too much is built on top"
I did make a similar proposal at the beginning of this WG, when we were
discussing the problems with Istio and Ingress APIs.
Hard to make decisions without data and experimentation - I think we
discovered that what was decided years ago in
gateway - and even longer ago in Istio - is not perfect and needs changes.
I think it seems clear that our versioning documentation could probably
use an update to make some of this clearer though.
What it should make clear is the downsides and risks of the current
approach - we tend to forget that
every decision has tradeoffs, and there is too little on the risks of using
the current experimental ( or the
"never, ever allow use of experimental in production" recommendation -
which I hope we generally agree on ).
|
This is a very helpful discussion that should continue, but it may be helpful to schedule a meeting so we can iterate a bit more quickly on these ideas. If you'd be interested in joining that discussion, please comment on this Slack thread: https://kubernetes.slack.com/archives/CR0H13KGA/p1715098072035809 |
I'm a -1 to this change because its very disruptive for the end user (entire teams - platform and app), there's at least a 2 time major cost here
Attempting to share the user first perspective taken from https://gateway-api.sigs.k8s.io/concepts/versioning/#release-channels highlights that the Experimental Channel is a super set of Stable Channel End User/Team Journey
Cluster + Implementation Capability - Its possible that clusters or implementations will either support only Usability - By focusing and investing in tooling to improve the experience of transitioning from one channel to the other, we can eliminate the need for forcing the user to make config changes which will make them less inclined to want these channel transitions ( as a community we dont want this user behavior ) |
I am -1 on this as well. This will be extremely detrimental to user experience. Users will need to know whether to use XType or Type in all usage. This is not just the person trying to use an experimental field, either. There are references (Policy, Gateway, Route) that all will need to be aligned and updated. Imagine I have a GW with 1000 routes and I want to use an experimental GW field; I need to somehow go update ALL the routes (likely owned by different personas in different namespaces) to use the XGateway. Additionally, we know have a situation where we have 2 disjoint sources for every type. I cannot even see all gateways in the system with kubectl, as I now have to do One of the goals of this is to increase experimental adoption. I think this will harm it. Given the above UX, experimental is so painful to use I don't see much user adoption. It is also extremely painful for an implementation. Coupled with less usage, this will lead to less implementations supporting experimental; I would likely push for dropping experimental support in Istio if this goes forward (speaking as an individual, though, not for the whole project). Today, all implementations actually support experimental resources implicitly, they just made not support experimental fields within those APIs. In that I mean everyone can read HTTPRoute, but they may just ignore Another thing to consider is that this proposal puts a high burden on everyone (users+implementation, standard+experimental usage) forever. The current issues we face only impact user of experimental. Making experimental more ergonomic may seem like a pressing issue now, but 1-2 years from now it will likely be a very niche thing, as more and more functionality moves to core. However, if we do this now we will be stuck with the cruft of experimental forever. What I propose instead is that we have some simple tooling to help migrate from Experimental to Standard. IIUC, the current risk here is I apply Standard CRDs and some fields are silently ignored. There is also another risk of Kubernetes platforms providing managed GW CRDs, where the user cannot control them. If users of these platforms are not aligned with the stability vs flexibility constraints the platform provides, they should work with the platform to provide customization options. To pick on a concrete one like GKE, I don't think its unreasonable to have an option for experimental if customers demand such a thing; GKE literally has the carve-out in their api ( |
Mostly agree with the last 2 comments. But I also consider this thread as no longer an issue for me - if a user or vendor enables any experimental in prod, they will deal with the consequences, they are free to shoot themselves in foot and hopefully will learn from mistakes. For any new feature - IMO vendor APIs are the best approach, can be used in production as soon as the vendor decides they are stable - and can expose the full capabilities of the implementation. What I realized is that at the end of the day - it doesn't really matter what experimental does, it only matters to not allow it in a prod environment, and that's a choice each vendor or implementation can do. The rest - vendor APIs - is settled and broadly used. With Istio and many other vendors providing pretty stable extensions ( most Istio APIs are now attachable ) - the users don't have to deal with experimental at all, they can use stable and more capable versions until the feature becomes v1, and keep using them as long as they need. So it's really about each implementation and users making the right choice about production. An implementation shipping a feature as a stable v1 vendor API first - with all capabilities - seems a better idea than playing experimental games and allows a better way to identify commonality and adoption. I still think a place to track and link all vendor extensions and identify common patterns and features wouls be nice - but a wiki or blog page can do this without any formalities. |
I think both @howardjohn and @arkodg have the same goal here, and I certainly agree that something like this would help. I do have concerns about this approach though:
Yep, we certainly hoped that there would be a safe way to install and manage experimental CRDs on behalf of customers, but I still haven't found one. Many of the problems have been covered above, but essentially it's ~impossible to safely and automatically upgrade CRDs that may have breaking changes, and it's ~impossible to safely support a transition from experimental to standard channel, which would undoubtedly be frustrating for anyone relying on this. You could of course offer an interactive gwctl-like experience like you described above, but that seems to defeat the purpose of automatic CRD management.
I'd argue that is an anti-pattern we very much do not want to support. Can you imagine a breaking change in the next version of experimental Gateways and now a user is stuck with 1000 routes attached to an old/outdated experimental Gateway without a safe upgrade path. I'd much rather the user deploy an experimental Gateway in isolation with only the relevant routes, ideally with a separate implementation to ensure the experimental resource does not impact any production workloads.
That's certainly an outcome I'd like to avoid. Although it certainly could harm experimental adoption for some users, it may also increase adoption on managed platforms where it would otherwise be inaccessible. I'm also not convinced that the UX is inherently bad, it just makes it easier for users to clearly flag and understand when/where they are using experimental APIs, which certainly could reduce accidental usage of experimental channel, but hopefully would not have a huge impact on intentional usage of experimental channel. I'm hopeful that we can reach a solution that is not extremely painful for implementations, but if we can't, I agree that we should not proceed with this.
I'm not sure how this proposal impacts current users of standard channel - this intentionally avoids any modifications there. This would certainly be disruptive for experimental channel users, but that's also part of the contract included with using experimental resources. My biggest fear with our current trajectory is that people are going to get burnt using Gateway API in production and accidentally transitioning between release channels. If an API is seen as unsafe/dangerous, that will really limit adoption going forward. I agree that 1-2 years from now, experimental channel will be less important, but I think that's all the more reason to isolate it from standard channel to avoid experimental usage breaking standard/production usage. Based on our meeting earlier today, it seems like migrating to a GH discussion would be easier for people to follow, so I've created #3106, and will close this PR out. Feel free to respond here if you want to respond to anything specific in this thread, but in general, I'd encourage migrating to the discussion for follow ups wherever it's practical. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR is a follow up to #2844. As I've been considering how we'll handle the graduation of GRPCRoute, it's become clear to me that our current experimental and standard channel separation is flawed. This is an attempt to fix that.
Essentially the problem is that once someone chooses to install an experimental version of a CRD, they have no safe path to go back to standard channel. GRPCRoute did not cause this problem, but it did highlight it. Essentially, we'll need to include "v1alpha2" in our standard channel version of GRPCRoute simply to ensure that it can actually be installed in clusters that previously had GRPCRoute.
This PR proposes a big change. It moves all experimental channel CRDs to a separate API group
gateway.networking.x-k8s.io
and gives all resources anX
prefix to denote their experimental status. This has the result of completely separating the resources. Practically that means that experimental and standard channel Gateways can coexist in the same cluster, but that the only possible migration path between channels involves recreating resources.This would admittedly be annoying for controller authors, but I'm hoping only moderately. This approach relies on type aliases to minimize the friction. Here's what I'd expect most controllers to do:
Take a look at
hack/sample-client
in this PR for an overly simple example of using experimental and standard channel types together.All of this may sound like a huge pain, so why bother? I think this approach comes with some pretty important benefits:
This PR is still very much a WIP, opening it early to get some feedback on the direction.
Does this PR introduce a user-facing change?: