Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conversion: Return typed error on failure to convert for CRs #115650

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

MadhavJivrajani
Copy link
Contributor

@MadhavJivrajani MadhavJivrajani commented Feb 9, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Return a typed error that also passes as a "Missing Version Error"
so that server side apply updates are not blocked from happening on
custom resources that have a version that is no longer available.

This commit also adds an integration test to verify the same.

Which issue(s) this PR fixes:

Fixes #111937

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2023
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 9, 2023
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 9, 2023
Copy link
Member

@apelisse apelisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this code but that seems reasonable. Can you add a test to reproduce the bug that this is attempting to fix in ssa? Thank you!

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 22, 2023
@MadhavJivrajani MadhavJivrajani marked this pull request as ready for review February 22, 2023 15:04
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 22, 2023
@MadhavJivrajani
Copy link
Contributor Author

@apelisse I've added a reproducer test for the bug now, could you PTAL? Thanks!

@MadhavJivrajani
Copy link
Contributor Author

/test pull-kubernetes-conformance-kind-ipv6-parallel
(container not ready)

@MadhavJivrajani
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 22, 2023
@MadhavJivrajani MadhavJivrajani force-pushed the ssa-conversion-return-typed-error branch from 8689845 to 2c1cadb Compare July 4, 2023 13:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2023
@MadhavJivrajani
Copy link
Contributor Author

@liggitt @apelisse I've addressed most comments (except #115650 (comment)), could you PTAL?
Apologies, this slipped through the cracks 🙏🏼

// Migrate existing CR to new storage version by applying an
// empty merge patch.
_, err = dynamicClient.Resource(gvr).
Patch(context.TODO(), name, types.MergePatchType, []byte("{}"), metav1.PatchOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe send an apply in the new version that changes other fields, so that you have two entries at that point, one in the old version and one in the new version.


apiVersion = noxuDefinition.Spec.Group + "/" + noxuDefinition.Spec.Versions[0].Name
yamlBody = []byte(fmt.Sprintf(`
apiVersion: %s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you send new fields in that apply so that you can see from the list of managed fields exactly what happened. (i.e. the old entry was removed even though it still should have owned some fields), the second entry that was inserted also still exists, and that last one also exist. You should have two entries at the end.

@dprotaso
Copy link
Contributor

dprotaso commented Jan 9, 2024

@MadhavJivrajani bump

@MadhavJivrajani
Copy link
Contributor Author

Thanks for the ping, this had dropped off my radar. Planning to get back to this.

This commit adds the ability to inject a custom string instead of the
default scheme name. It also consequently edits what is printed when
a custom string is passed.

This commit also adds tests to exercise the different code paths taken
for notRegisteredErr.

This is needed when we want to return typed errors and scheme names are
not known before hand, such as conversion in custom resources.

Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
Return a typed error that also passes as a "Missing Version Error"
so that server side apply updates are not blocked from happening on
custom resources that have a version that is no longer available.

This commit also adds an integration test to verify the same.

Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
@MadhavJivrajani MadhavJivrajani force-pushed the ssa-conversion-return-typed-error branch from 2c1cadb to fa6dcc7 Compare January 10, 2024 09:38
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MadhavJivrajani
Once this PR has been reviewed and has the lgtm label, please assign jpbetz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MadhavJivrajani
Copy link
Contributor Author

@apelisse 👋🏼
I think I understand what you were asking from the test, but still not a 100% sure.
I've pushed a commit with some log lines that tell what is happening at each stage, could you please take a look and correct me if I'm misunderstanding something here?

Copy pasting the output here for easy access:

    apply_crd_test.go:866: Created CR: {"apiVersion":"mygroup.example.com/v1beta2","kind":"WishIHadChosenNoxu","metadata":{"creationTimestamp":"2024-01-10T09:37:35Z","generation":1,"managedFields":[{"apiVersion":"mygroup.example.com/v1beta2","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:cronSpec":{},"f:ports":{"k:{\"containerPort\":80,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:replicas":{}}},"manager":"apply_test","operation":"Apply","time":"2024-01-10T09:37:35Z"}],"name":"mytest","resourceVersion":"90","uid":"e306af17-e9e1-4efb-b480-49b79da8fd9f"},"spec":{"cronSpec":"* * * * */5","ports":[{"containerPort":80,"name":"x","protocol":"TCP"}],"replicas":1}}
        
    apply_crd_test.go:882: 
        
        Updated CRD, migrated away from old storage version: [v1beta2 v1beta1]
    apply_crd_test.go:898: 
        
        Migrated CR to new storage version after patch (should have 2 managedFields entries in different versions): &{map[apiVersion:mygroup.example.com/v1beta1 kind:WishIHadChosenNoxu metadata:map[creationTimestamp:2024-01-10T09:37:35Z generation:2 managedFields:[map[apiVersion:mygroup.example.com/v1beta2 fieldsType:FieldsV1 fieldsV1:map[f:spec:map[f:cronSpec:map[] f:ports:map[k:{"containerPort":80,"protocol":"TCP"}:map[.:map[] f:containerPort:map[] f:name:map[] f:protocol:map[]]]]] manager:apply_test operation:Apply time:2024-01-10T09:37:35Z] map[apiVersion:mygroup.example.com/v1beta1 fieldsType:FieldsV1 fieldsV1:map[f:spec:map[f:replicas:map[]]] manager:apply.test operation:Update time:2024-01-10T09:37:35Z]] name:mytest resourceVersion:92 uid:e306af17-e9e1-4efb-b480-49b79da8fd9f] spec:map[cronSpec:* * * * */5 ports:[map[containerPort:80 name:x protocol:TCP]] replicas:2]]}
    apply_crd_test.go:911: 
        
        Updated CRD, remove old storage version from status: [v1beta1]
    apply_crd_test.go:954: 
        
        applying object again but with changed replicas (should have 2 converted managedFields entries): {"apiVersion":"mygroup.example.com/v1beta1","kind":"WishIHadChosenNoxu","metadata":{"creationTimestamp":"2024-01-10T09:37:35Z","generation":2,"managedFields":[{"apiVersion":"mygroup.example.com/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:cronSpec":{},"f:ports":{"k:{\"containerPort\":80,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:replicas":{}}},"manager":"apply_test","operation":"Apply","time":"2024-01-10T09:37:35Z"},{"apiVersion":"mygroup.example.com/v1beta1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:replicas":{}}},"manager":"apply.test","operation":"Update","time":"2024-01-10T09:37:35Z"}],"name":"mytest","resourceVersion":"96","uid":"e306af17-e9e1-4efb-b480-49b79da8fd9f"},"spec":{"cronSpec":"* * * * */5","ports":[{"containerPort":80,"name":"x","protocol":"TCP"}],"replicas":2}}

@MadhavJivrajani
Copy link
Contributor Author

MadhavJivrajani commented Jan 10, 2024

If I run the added test against master that does not have the new typed error logic, I get a failure consistent with the original bug report:

    apply_crd_test.go:866: Created CR: {"apiVersion":"mygroup.example.com/v1beta2","kind":"WishIHadChosenNoxu","metadata":{"creationTimestamp":"2024-01-10T09:54:49Z","generation":1,"managedFields":[{"apiVersion":"mygroup.example.com/v1beta2","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:cronSpec":{},"f:ports":{"k:{\"containerPort\":80,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:replicas":{}}},"manager":"apply_test","operation":"Apply","time":"2024-01-10T09:54:49Z"}],"name":"mytest","resourceVersion":"89","uid":"fca7007c-de0d-4048-b714-c777b13103c4"},"spec":{"cronSpec":"* * * * */5","ports":[{"containerPort":80,"name":"x","protocol":"TCP"}],"replicas":1}}
        
    apply_crd_test.go:882: 
        
        Updated CRD, migrated away from old storage version: [v1beta2 v1beta1]
    apply_crd_test.go:898: 
        
        Migrated CR to new storage version after patch: &{map[apiVersion:mygroup.example.com/v1beta1 kind:WishIHadChosenNoxu metadata:map[creationTimestamp:2024-01-10T09:54:49Z generation:2 managedFields:[map[apiVersion:mygroup.example.com/v1beta2 fieldsType:FieldsV1 fieldsV1:map[f:spec:map[f:cronSpec:map[] f:ports:map[k:{"containerPort":80,"protocol":"TCP"}:map[.:map[] f:containerPort:map[] f:name:map[] f:protocol:map[]]]]] manager:apply_test operation:Apply time:2024-01-10T09:54:49Z] map[apiVersion:mygroup.example.com/v1beta1 fieldsType:FieldsV1 fieldsV1:map[f:spec:map[f:replicas:map[]]] manager:apply.test operation:Update time:2024-01-10T09:54:49Z]] name:mytest resourceVersion:91 uid:fca7007c-de0d-4048-b714-c777b13103c4] spec:map[cronSpec:* * * * */5 ports:[map[containerPort:80 name:x protocol:TCP]] replicas:2]]}
    apply_crd_test.go:911: 
        
        Updated CRD, remove old storage version from status: [v1beta1]
E0110 15:24:49.608597   46532 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"request to convert CR to an invalid group/version: mygroup.example.com/v1beta2"}: request to convert CR to an invalid group/version: mygroup.example.com/v1beta2
    apply_crd_test.go:951: failed to create custom resource with apply: an error on the server ("unknown") has prevented the request from succeeding:
        k8s
        

        v1Statusf
        
        FailureNrequest to convert CR to an invalid group/version: mygroup.example.com/v1beta2"0?"

@MadhavJivrajani
Copy link
Contributor Author

/retest

@MadhavJivrajani
Copy link
Contributor Author

/retest

@dprotaso
Copy link
Contributor

Can we cherry pick this change back to supported releases? This issue has affected K8s since 1.22

if len(s) == 0 {
return ""
}
return " in scheme " + strconv.Quote(string(s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Padding the context with a space prefix here feels clunky. Could we do all the formatting in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbetz the reason for this is in how the notRegisteredErr string is formatted here: https://github.com/MadhavJivrajani/kubernetes/blob/fa6dcc734bf6d6faa720714628b8709c27736f19/staging/src/k8s.io/apimachinery/pkg/runtime/error.go#L73

There are two main cases to take care of: when we use a genericContext and when we use a schemeContext. If the genericContext is used (meaning we already know what the scheme is, such as in cases of built in types), things would be okay for the most part, however, if we use a schemeContext (for when we don't know the scheme before hand such as in CRs), we need to add in a in scheme there towards the end of the formatted notRegisteredErr error string. So to handle both cases and not overcomplicate the formatting logic, a space is padded here.

if len(s) == 0 {
return ""
}
return " " + string(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

return &notRegisteredErr{context: schemeContext(schemeName), gvk: gvk, target: target}
}

func NewNotRegisteredGVKErrForTargetWithContext(context string, gvk schema.GroupVersionKind, target GroupVersioner) error {
Copy link
Contributor

@jpbetz jpbetz Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see context use type string here. It makes it easy to bypass the use of genericContext (..and the formatting that genericContext adds, which makes me think we should handle the formatting differently..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context is probably a misnomer, but happy to change what it is called.

It makes it easy to bypass the use of genericContext (..and the formatting that genericContext adds, which makes me think we should handle the formatting differently..)

Not sure I understand the bypassing part here, could you please elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini
Copy link
Member

This issue is biting us pretty hard in Cluster API, is there any chance to backport this to older releases of Kubernetes (as much as possible?)

@dims
Copy link
Member

dims commented Feb 14, 2024

/assign @liggitt @apelisse

@liggitt
Copy link
Member

liggitt commented Feb 15, 2024

This issue is biting us pretty hard in Cluster API, is there any chance to backport this to older releases of Kubernetes (as much as possible?)

It's really hard to reason about the implications of the change, so doing that safely on master is the first step. It's too early to comment about the possibility of backports, but my inclination would be not to backport, since this isn't a regression (and the implications are super hard to reason about).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2024
@dprotaso
Copy link
Contributor

but my inclination would be not to backport

Yeah sounds good @liggitt

@MadhavJivrajani Is there any other action here for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

managedField entry for a removed API version prevent further resource updates
10 participants