API Versioning compatibility in the future #202

frodopwns · 2021-02-16T21:40:45Z

Describe the feature

Backward compatibility. In the future, the API for the custom resources this project uses may change. Many maintainers of operators choose not to implement semver or versioning at all. This can lead to painful releases.

Luckily, kubebuilder has a system in place for managing this: https://book.kubebuilder.io/multiversion-tutorial/conversion-concepts.html

What would the new user story look like?

A cluster maintainer maintains a cluster running version 1.0 of Capsule. That maintainer's customers would like the new functionality in version 2 of Capsule.

When the maintainer deploys Capsule 2.0, the existing CRDs get patched to add the new CRD version rather than replacing it. Any existing instances should still be accessible and the new version of the controllers should handle conversion between 1.0 CRs and 2.0 CRs seamlessly.

How would the new interaction with Capsule look like? E.g.

What are the prerequisites for this?
Each new version of a CRD would require a new folder in ./api with ConvertTo and ConvertFrom methods that handle conversion to the stored version of the CRD.

Kustomize templates would need changing to add the conversion webhooks.

Is there a plan for this already in place?

prometherion · 2021-02-17T09:18:16Z

Thanks for opening this issue/question, really appreciated it, also considering the hot topic!

At the current state, Capsule is adding features upon request, or encountering minor fixes that are not affecting the base schema, e.g. the allowed external service IP to mitigate CVE-2020-8554: that's it, pretty easy since we're adding fields, these are nullable so no side effects are expected.

If I understood correctly the documentation, upon conversion webhook setup, the Kubernetes API Server should be able to return the converted object even getting with the old versioning as occurred with the ingresses.networking.k8s.io/v1beta1 from ingresses.extensions/v1beta1 resource: I'm not sure this is going to work also for the Shared Informer, although I'd say it should since the underlying implementation is just a watch op towards the API server, I hadn't the time to focus on this also considering it's not our priority in this current phase.

Anyway, if this would be true, we should be allowed to hack just a little bit the current Tenant controller codebase interacting directly with the new version, dropping eventually unsupported functionalities introduced with newer versions when dealing with old versions.

I'm aware of the fact we pretty sure are going to face changes in the upcoming future, as long as the CRD API version change from v1alpha1 to v1beta1 or whatever: this has been a topic I had to focus on while working for other companies, at that time there wasn't so much info or code-base to address the issue, so I ended up with the following idea.

What I wanted to do is to provide a domain interface, totally abstracted from the CRD version: each underlying version is going to implement the domain functions as retrieving the allowed hostnames, IP, the NetworkPolicies to apply, etc. Said so, I take a look at the proper section of kubebuilder CRD versioning and it seems that what we just need to do is implementing the Convertible interface.

I haven't tested yet if we still need two controllers, one per CRD version, but what popped in my mind while typing this reply is the interaction with the webhook since they're versioned:

https://github.com/clastix/capsule/blob/e3b927f1122cc6fd48030da807c2b094f5c5819c/pkg/webhook/tenant/validating.go#L32

I know I wrote a lot of stuff, so adding a tl;dr; here! Right now, we don't have anything in place, once the problem will be there we'll implement the smoothest and neat implementation without decreasing the DX and reliability of the Operator (sic: without providing downtime or deletion of resources upon CRD version conversion).

frodopwns · 2021-02-17T16:16:29Z

You shouldn't need additional controllers per version. The last time I used this, installing a new CRD version caused kubernetes to use the conversion hook to upgrade all existing instances. Of course, in that case, I was updating the stored version to the new version. In theory, you could leave the oldest version as the stored version which would mean new CR instances get converted downward before operations happen.

As long as the controller operates under the assumption that it is working on the hub version you should only need one controller. The Kube API converts your instance to that stored version before returning it to you.

Thanks for taking the time to respond! I know the conversion stuff doesn't matter too much early on for an operator's development but once in production, it becomes essential.

prometherion · 2021-02-17T17:50:10Z

Thanks for sharing all your insights @frodopwns, really appreciated it, also because the topic, as I said, it's hot and tough for most people, and getting feedback is definitely gold!

You shouldn't need additional controllers per version

This is great news!

I know the conversion stuff doesn't matter too much early on for an operator's development but once in production, it becomes essential

Right now we have some companies running Capsule in production and, honestly, the issue has been raised several times, and I couldn't agree more on managing the lifecycle of CRDs at its best, automating it.

Instead, regarding webhooks, did you have the chance to test if the same logic (v1 >>> hub >>> v2) is applied to the webhooks?

frodopwns · 2021-02-18T00:10:16Z

I haven't worked with webhooks much but I suspect you would need to implement a new hook for each version.

The hooks are attached to the versioned API struct itself:

func (r *CronJob) SetupWebhookWithManager(mgr ctrl.Manager) error {
    return ctrl.NewWebhookManagedBy(mgr).
        For(r).
        Complete()
}

// Default implements webhook.Defaulter so a webhook will be registered for the type
func (r *CronJob) Default() {
    cronjoblog.Info("default", "name", r.Name)

    if r.Spec.ConcurrencyPolicy == "" {
        r.Spec.ConcurrencyPolicy = AllowConcurrent
    }
    if r.Spec.Suspend == nil {
        r.Spec.Suspend = new(bool)
    }
    if r.Spec.SuccessfulJobsHistoryLimit == nil {
        r.Spec.SuccessfulJobsHistoryLimit = new(int32)
        *r.Spec.SuccessfulJobsHistoryLimit = 3
    }
    if r.Spec.FailedJobsHistoryLimit == nil {
        r.Spec.FailedJobsHistoryLimit = new(int32)
        *r.Spec.FailedJobsHistoryLimit = 1
    }
}

And I believe the webhook gets called before the conversion would happen at storage time.

frodopwns · 2021-02-18T00:23:19Z

I have an example multi-version repo here: https://github.com/Azure/azure-service-operator
Could come in handy for reference.

One interesting thing about the hub and spokes. Which version you choose as your hub can have ramifications.

I chose to make my newest version always be the hub version. I did this because when you set an old version as the hub and add new fields to your models, Kubernetes needs to store them in the old version. This is done by adding the new fields into labels when converting to the hub version.

Another side-effect is automatic updates to old CRs. When you implement conversion hooks and set your stored CRD version to the new version, Kubernetes converts the old CRs and resaves them. Could be dangerous...

The downside to updating the stored hub) version is all the code that needs to be updated when you release a version.

prometherion · 2021-03-01T09:51:47Z

I have an example multi-version repo here: https://github.com/Azure/azure-service-operator
Could come in handy for reference.

Again, thanks a lot! <3 This is absolutely gold!

I chose to make my newest version always be the hub version

I couldn't agree more.

The downside to updating the stored hub) version is all the code that needs to be updated when you release a version.

It's a pay-off we can easily consider while upgrading, legit.

prometherion · 2021-07-01T14:53:15Z

Closing since #317 is landing on master, thanks a lot for the useful inputs @frodopwns!

frodopwns added the blocked-needs-validation Issue need triage and validation label Feb 16, 2021

prometherion added question Further information is requested and removed blocked-needs-validation Issue need triage and validation labels Feb 17, 2021

prometherion mentioned this issue Jun 7, 2021

New Tenant v1beta1 version #286

Closed

prometherion closed this as completed Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Versioning compatibility in the future #202

API Versioning compatibility in the future #202

frodopwns commented Feb 16, 2021

prometherion commented Feb 17, 2021

frodopwns commented Feb 17, 2021

prometherion commented Feb 17, 2021

frodopwns commented Feb 18, 2021

frodopwns commented Feb 18, 2021

prometherion commented Mar 1, 2021

prometherion commented Jul 1, 2021

API Versioning compatibility in the future #202

API Versioning compatibility in the future #202

Comments

frodopwns commented Feb 16, 2021

Describe the feature

What would the new user story look like?

prometherion commented Feb 17, 2021

frodopwns commented Feb 17, 2021

prometherion commented Feb 17, 2021

frodopwns commented Feb 18, 2021

frodopwns commented Feb 18, 2021

prometherion commented Mar 1, 2021

prometherion commented Jul 1, 2021