Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiextensions apiserver: update storage version for custom resources #96403

Closed
wants to merge 10 commits into from

Conversation

roycaihw
Copy link
Member

@roycaihw roycaihw commented Nov 10, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
Implements the section "CRDs" in KEP.

Does this PR introduce a user-facing change?:

Add a feature in the API server to update storage version for custom resources. Enabling internal.apiserver.k8s.io/v1alpha1 API, StorageVersionAPI feature gate and APIServerIdentity feature gate are required to use this feature. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190802-dynamic-coordinated-storage-version.md

/sig api-machinery
/assign @sttts @caesarxuchao

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Nov 10, 2020
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 10, 2020
@k8s-ci-robot k8s-ci-robot added area/apiserver area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 10, 2020
return nil
}
}
sv.OwnerReferences = append(sv.OwnerReferences, ref)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append an owner reference to delete stale StorageVersions. This does not guarantee StorageVersions will be updated when the CRD gets updated.

// This means stale records can still exist in the cluster if an apiserver
// hasn't got any request. All apiservers have to at least serve one
// request (can be a GET) for a CR's storage version record to be
// up-to-date, and to trigger storage migration.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to the owner-reference approach is to let crdHandler garbage collect storageversion whenever we tear down old storage. That approach is more responsive, but it's more expensive (apiextensions-apiserver replicas conflict with each other in HA cluster) and only helps with some rare corner cases.

Copy link
Member Author

@roycaihw roycaihw Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caesarxuchao Specifically, in an HA cluster, after a CRD gets encoding version update:

  1. if all three server have served requests with new storage, storage migration will be triggered for both approaches.
  2. if only one or two of the three servers have served requests with new storage, storage migration won't happen for the owner-reference approach, but will for the alternative
  3. if none of the three servers has served requests with new storage, storage migration won't happen for the owner-reference approach; it will for the alternative by a chance:
    a. if the storage migrator observed the old StorageVersion object getting shrunk to one/two server records, it will trigger the migration
    b. if the watch events get compressed (e.g. informer reconnection) and the storage migrator only saw a DELETE event, it won't trigger the migration

3 (updating the schema without even read after the update) should be extremely rare. I don't expect 2 to happen assuming enough number (>3) of requests have made to the servers and the load balancer works reasonably.

That said, we can let the storage migrator send read requests periodically to eliminate 3) and increase the chance of 1). Also the alternative can be a future improvement if one region being starved is a concern.

Copy link
Contributor

@sttts sttts Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this logic in the handler chain feels misplaced. Why is that? Just register in the event handlers for CRDs. I.e. put it into pkg/controllers/storageversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@roycaihw roycaihw Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is pkg/controllers/storageversion.

let me think about using the event handlers to create/update storage versions. I don't want to use the event handlers to garbage collect storage versions because of the reason I mentioned in: https://github.com/kubernetes/kubernetes/pull/96403/files#r520386107.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree using the event handlers to create/update storage versions works better, and solves https://github.com/kubernetes/kubernetes/pull/96403/files#r520403155. Will update the PR

utilfeature.DefaultFeatureGate.Enabled(genericfeatures.APIServerIdentity) {
kubeclientset, err := kubernetes.NewForConfig(s.GenericAPIServer.LoopbackClientConfig)
if err != nil {
return nil, fmt.Errorf("failed to create kubernetes clientset: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the bots picked me from /test

should Kubernetes be uppercase here, or can it be dropped similar to line 173:

return nil, fmt.Errorf("failed to create clientset: %v", err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! kubernetes is the package name: k8s.io/client-go/kubernetes.NewForConfig. I have it to distinguish from L173, which is the apiextensions clientset: k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset.NewForConfig. Changed the error to failed to create clientset for storage versions to make it less confusing.

@@ -0,0 +1,2 @@
approvers:
- roycaihw
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially the SIG label should be here:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/apimachinery/OWNERS#L25-L26

not sure if there should be more approvers added here too. ideally, yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the label and @caesarxuchao in approvers. Sadly I couldn't find a sig-api-machinery-approvers alias

// Send a request to make the server create handler and update storage version
_, err = dynamicClient.Resource(gvr).Namespace("default").List(context.TODO(), metav1.ListOptions{})
if err != nil {
t.Fatalf("unexpected error when listing foos: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "foos" with gvr.Resource or just "resources"?
IIRC, the error here should already include what failed to be listed.

}

var storageVersion apiserverinternalv1alpha1.ServerStorageVersion
if err := wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you see a case where 10 seconds might not be enough - i.e. to not introduce a flake?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Actually we don't need the wait. I was doing something else and forgot about it. Removed

t.Fatalf("failed to get storage version for custom resources: %v", err)
}
if !strings.HasPrefix(storageVersion.APIServerID, "kube-apiserver-") {
t.Fatalf("apiserver ID doesn't contain kube-apiserver- prefix, has: %v", apiserverID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think apiserverID is "" at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's populated by the apiserver-identity feature:

id = "kube-apiserver-" + uuid.New().String()

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 10, 2020
@@ -508,6 +520,9 @@ func (r *crdHandler) updateCustomResourceDefinition(oldObj, newObj interface{})

klog.V(4).Infof("Updating customresourcedefinition %s", newCRD.Name)
r.removeStorage_locked(newCRD.UID)
if err := r.updateStorageVersionFor(newCRD); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am a little worried to have this inside of the lock as it calls out via a client (and potentially blocks). Same above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. Decoupled the update logic and the lock

@liggitt
Copy link
Member

liggitt commented Feb 3, 2021

I was hoping for a change more like this, where we intercept every write operation:

diff --git a/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/customresource_handler.go b/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/customresource_handler.go
index b317e30ebd2..263b4524154 100644
--- a/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/customresource_handler.go
+++ b/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/customresource_handler.go
@@ -402,16 +402,31 @@ func (r *crdHandler) serveResource(w http.ResponseWriter, req *http.Request, req
 			responsewriters.ErrorNegotiated(err, Codecs, schema.GroupVersion{Group: requestInfo.APIGroup, Version: requestInfo.APIVersion}, w, req)
 			return nil
 		}
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.CreateResource(storage, requestScope, r.admission)
 	case "update":
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.UpdateResource(storage, requestScope, r.admission)
 	case "patch":
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.PatchResource(storage, requestScope, r.admission, supportedTypes)
 	case "delete":
 		allowsOptions := true
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.DeleteResource(storage, allowsOptions, requestScope, r.admission)
 	case "deletecollection":
 		checkBody := true
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.DeleteCollection(storage, checkBody, requestScope, r.admission)
 	default:
 		responsewriters.ErrorNegotiated(
@@ -430,8 +445,14 @@ func (r *crdHandler) serveStatus(w http.ResponseWriter, req *http.Request, reque
 	case "get":
 		return handlers.GetResource(storage, requestScope)
 	case "update":
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.UpdateResource(storage, requestScope, r.admission)
 	case "patch":
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.PatchResource(storage, requestScope, r.admission, supportedTypes)
 	default:
 		responsewriters.ErrorNegotiated(
@@ -450,8 +471,14 @@ func (r *crdHandler) serveScale(w http.ResponseWriter, req *http.Request, reques
 	case "get":
 		return handlers.GetResource(storage, requestScope)
 	case "update":
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.UpdateResource(storage, requestScope, r.admission)
 	case "patch":
+		if err := crdInfo.WaitForConsistentStorageVersion(req.Context()); err != nil {
+			return err
+		}
 		return handlers.PatchResource(storage, requestScope, r.admission, supportedTypes)
 	default:
 		responsewriters.ErrorNegotiated(

that contains the intercepting logic to a single method and leaves the rest of the custom resource handling logic alone

in that single method, we can:

  • no-op if the feature is disabled or if there's only a single storageVersion (e.g. nothing to converge on)
  • return fast once servers converge
  • wait, honoring request timeout (or optionally a failsafe)

@deads2k
Copy link
Contributor

deads2k commented Feb 3, 2021

As I recall the KEP, our goal for CRs is that we don't ever storage a CR with an encoding version that doesn't match what is stored in the API. This is slightly more complicated than the built-in resources since the encoding version can change without restarting. However, I think we could simplify slightly with a flow like

  1. create storage for CR as normal
  2. read-only requests are allowed immediately
  3. first mutating request for a CR arrives
  4. create or update the storage version. If this fails, fail the request
  5. accept mutating requests

I don't know exactly where the code change would be offhand, but I don't know that we need asynchronous writing in this case. The first few mutating requests after an encoding version change would have higher than normal latency, but there are not that many of them.

As I recall, there's a controller-y aspect to writing for the built-in resources. I think a similar thing could be built using a cache that checks each request to see if it matches and does the synchronous write then.

@roycaihw
Copy link
Member Author

roycaihw commented Feb 11, 2021

@deads2k There could be some race conditions. Imagine:

  1. ...
  2. updating the storage version to v1.
  3. CRD got updated to v2
  4. apiserver saw the watch event, removed the v1 storage from the storage map, and put it to graceful teardown
  5. (the v1 storage could still update the storage version to v1, and finish serving the in-flight mutating requests)
  6. new mutating request came in
  7. apiserver created v2 storage, updating the storage version to v2
  • The v2 update might happen before the v1 update. In that case the storage version ends up being v1.
  • Or assuming the storage version updates happened in the right order (first v1, then v2), but some v1 in-flight mutating requests got served after the v2 storage version update. Furthermore if a storage migration happened after the v2 update and before the v1 in-flight requests, v1 data would get persisted in etcd.

The asynchronous channel makes sure the updates happen in the right order, and latter updates begin only after the former ones' graceful teardown finishes.

@roycaihw
Copy link
Member Author

roycaihw commented Feb 11, 2021

I was hoping for a change more like this, where we intercept every write operation:

@liggitt Thanks. That's doable. I will update the PR

When a storage version update is aborted because the CRD was deleted, we
don't need to reject in-flight CR requests. We can safely allow these
requests to proceed, because the CRD finalizer guarantees no write
requests can succeed after a CRD is deleted-- CREATE requests are 405
NOT ALLOWED, other requests will get 404 NOT FOUND.
In theory, "CRD deletion and CRD finalizer tearing-down CRs" could happen
in between 1. crdHandler saw CRD terminating=false and 2. crdHandler
serves CR create request.

This commit makes crdHandler read the latest CRD from the cache (shared
with CRD finalizer) right before serving the create request, to make
sure the gap is as narrow as possible, and the chance for the race is
still low.
@roycaihw
Copy link
Member Author

I checked CRD deletion workflow-- we don't need to reject CR writes if a CRD is deleted, see commit message 8661f73 for more detail. One concern raised was #99181. I had a mitigation in 6ddc1eb.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 18, 2021

@roycaihw: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-bazel-test 100aa3b link /test pull-kubernetes-bazel-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@liggitt
Copy link
Member

liggitt commented Feb 18, 2021

the CRD finalizer guarantees no write requests can succeed after a CRD is deleted

are you sure? updates have to be allowed in order to remove finalizers on CR instances blocking cleanup, right?

by "after a CRD is deleted", do you mean "actually deleted" or "deletionTimestamp persisted on the CRD"?

@roycaihw
Copy link
Member Author

updates have to be allowed in order to remove finalizers on CR instances blocking cleanup, right?

The cleanup goes directly to the storage, so it won't be blocked by the http handler.

by "after a CRD is deleted"

I mean actually deleted. This is when crdHandler gets a DELETE watch event, starts tearing down the storage. The CRD finalizer controller did its job already, so we can unblock write requests and know they won't succeed.

@roycaihw
Copy link
Member Author

updates have to be allowed in order to remove finalizers on CR instances blocking cleanup, right?

The cleanup goes directly to the storage, so it won't be blocked by the http handler.

I may be wrong. Let me double check.

But even if CR cleanup needs to go through the http handler, we won't deadlock. The order would be:

  1. storage version update succeeds
  2. CR cleanup finishes
  3. CRD is actually deleted
  4. crdHandler sees the DELETE event, aborts storage version update (no-op since the update already succeeded), unblocks write requests (no-op)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2021
@k8s-ci-robot
Copy link
Contributor

@roycaihw: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2021
@k8s-triage-robot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 4, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: New KEP
Development

Successfully merging this pull request may close these issues.

None yet