New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hitting 34s timeouts with server-side apply on large custom resource objects #102749
Comments
/wg api-expression |
/sig api-machinery Can you provide an example CR that triggers this behavior? |
slack thread (with linked CRs): https://kubernetes.slack.com/archives/C0123CNN8F3/p1623878934051800?thread_ts=1623689723.044900&cid=C0123CNN8F3 |
does this also reproduce in 1.21 / HEAD, or is it limited to 1.19/1.20? |
This was also reproducible on 1.21, but I'm unsure which patch release since that was internally reported and I did not personally test it. I have only personally tested the following:
|
/triage accepted |
I suspect this is due to ReconcileFieldSetWithSchema being run on all updates. For types with no schema (or that makes heavy use of x-kubernetes-preserve-unknown-fields: true) ReconcileFieldSetWithSchema needs to be skipped. It already tries to bail out early for deduced schemas (https://github.com/kubernetes-sigs/structured-merge-diff/blob/ea1021dbc0f242313159d5dd4801ff29304712fe/typed/reconcile_schema.go#L130), but I'm not convinced that's working right, and I don't think it covers this case. /assign |
Thank you so much @jpbetz! Looks like we were on the right track. Is it possible for this change to be backported to k8s 1.19, 1.20, and 1.21? (I think 1.22 GA would have this fix by default, right?) |
I'm in favor of back porting this as far as we possibly can. Once the fix is merged and the PR to version bump structured-merge-diff is open against github.com/kubernetes/kubernetes I'll open the cherry-pick requests. |
That sounds great. Please reach out if you would like help or reviews. |
@nickgerace there is a mitigation for this issue that works on v1.20+ (but not 1.19): Use x-kubernetes-map-type: atomic, e.g.:
|
Thank you @jpbetz. Is there a plan for k8s 1.19.10+ in the future? |
Bump SMD to v4.1.2 to pick up #102749 fix
Hi, sorry to comment on a closed issue, but according to Kubernetes changelog the fix for this is included in v.1.21.3, but we are still hitting this issue on a fresh 1.21.3 cluster. We are still getting 34s timeouts when creating big custom resources. By using the workaround described here we can create big custom objects successfully. Could anyone confirm it was fixed in v1.21.3, please? |
@javiku it was: [nickgerace at rancherbook in ~/github.com/kubernetes/kubernetes] (0) (master)
% git tag --contains 44d4c4fe69f9fd2ee7bade2d15c8bab6be3ec98e
v1.21.3
v1.21.4
v1.21.4-rc.0
v1.21.5-rc.0 |
…!1053) Bump SMD to v4.1.2 to pick up kubernetes#102749 fix Bump SMD to v4.1.2 to pick up kubernetes#102749 fix kubernetes#103320
What happened:
Performing server-side apply on a CR of at least 700KB in size results in 34 second timeout with Kubernetes 1.19.10+ and Kubernetes 1.20+.
What you expected to happen:
Like the creation and deletion events, I expected the update event to take a minimal amount of time. Understandably, our particular CR in question has a large
status
field (which is being trimmed down to reasonable size anyway). However, it is interesting that we have never faced a timeout here in Kubernetes v1.19.9 and below.How to reproduce it (as minimally and precisely as possible):
kubectl create
) or server-sidekubectl edit
) or server-sideAnything else we need to know?:
I'm currently testing server-side apply behavior on CRs with/without their
openAPIV3Schema
populated. It may be possible this issue only affects CRDs that preserve unknown fields for unpopulatedopenAPIV3Schema
fields, and/or it only affects CRDs with largestatus
fields:This is a comment with details of our investigation for
rancher/rancher
. It may be relevant for further context, but I do not want to inundate maintainers with a huge comment if need be: rancher/rancher#32419 (comment)I've also narrowed down some potential suspects, but have not yet been able to test them:
CRD in question:
Environment:
kubectl version
): 1.19.10, 1.20.6 (server)cat /etc/os-release
): Ubuntu 20.04 LTSuname -a
): 5.4Thank you in advance for any and all help! I'd be happy to provide more detail that may help, and I hope to even find the code that caused this as well. This is an ongoing investigation, but I thought I'd file the issue since I believe we have had enough reproduction scenarios to warrant so.
EDIT: I believe
wg-api-expression
was the best to assign based on this: https://github.com/kubernetes-sigs/structured-merge-diff#community-discussion-contribution-and-supportThe text was updated successfully, but these errors were encountered: