Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hitting 34s timeouts with server-side apply on large custom resource objects #102749

Closed
nickgerace opened this issue Jun 9, 2021 · 16 comments · Fixed by #103318
Closed

Hitting 34s timeouts with server-side apply on large custom resource objects #102749

nickgerace opened this issue Jun 9, 2021 · 16 comments · Fixed by #103318
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/api-expression Categorizes an issue or PR as relevant to WG API Expression.
Milestone

Comments

@nickgerace
Copy link

nickgerace commented Jun 9, 2021

What happened:

Performing server-side apply on a CR of at least 700KB in size results in 34 second timeout with Kubernetes 1.19.10+ and Kubernetes 1.20+.

What you expected to happen:

Like the creation and deletion events, I expected the update event to take a minimal amount of time. Understandably, our particular CR in question has a large status field (which is being trimmed down to reasonable size anyway). However, it is interesting that we have never faced a timeout here in Kubernetes v1.19.9 and below.

How to reproduce it (as minimally and precisely as possible):

  1. Use a Kubernetes 1.19.10 or 1.20.6 cluster
  2. Create a 700KB+ sized custom resource object, client-side (e.g. kubectl create) or server-side
  3. Update the object, client-side (e.g. kubectl edit) or server-side
  4. Server-side apply will timeout in 34s

Anything else we need to know?:

I'm currently testing server-side apply behavior on CRs with/without their openAPIV3Schema populated. It may be possible this issue only affects CRDs that preserve unknown fields for unpopulated openAPIV3Schema fields, and/or it only affects CRDs with large status fields:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
      status:
        x-kubernetes-preserve-unknown-fields: true

This is a comment with details of our investigation for rancher/rancher. It may be relevant for further context, but I do not want to inundate maintainers with a huge comment if need be: rancher/rancher#32419 (comment)

I've also narrowed down some potential suspects, but have not yet been able to test them:

CRD in question:

apiVersion: apiextensions.k8s.io/v1                                                                                                                                                           
kind: CustomResourceDefinition                                                                                                                                                                
metadata:                                                                                                                                                                                                                                                                                                                                                          
  name: catalogs.management.cattle.io                                                                                                                                                         
spec:
  conversion:
    strategy: None
  group: management.cattle.io
  names:
    kind: Catalog
    listKind: CatalogList
    plural: catalogs
    singular: catalog
  preserveUnknownFields: true
  scope: Cluster
  versions:
  - name: v3
    served: true
    storage: true

Environment:

  • Kubernetes version (use kubectl version): 1.19.10, 1.20.6 (server)
  • Cloud provider or hardware configuration: Digital Ocean Droplet (2 CPU / 8GB RAM)
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04 LTS
  • Kernel (e.g. uname -a): 5.4
  • Install tools: RKE v1.2.8
  • Network plugin and version (if this is a network-related bug): n/a
  • Others: n/a

Thank you in advance for any and all help! I'd be happy to provide more detail that may help, and I hope to even find the code that caused this as well. This is an ongoing investigation, but I thought I'd file the issue since I believe we have had enough reproduction scenarios to warrant so.

EDIT: I believe wg-api-expression was the best to assign based on this: https://github.com/kubernetes-sigs/structured-merge-diff#community-discussion-contribution-and-support

@nickgerace nickgerace added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 9, 2021
@nickgerace
Copy link
Author

/wg api-expression

@k8s-ci-robot k8s-ci-robot added wg/api-expression Categorizes an issue or PR as relevant to WG API Expression. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 9, 2021
@lavalamp
Copy link
Member

/sig api-machinery

Can you provide an example CR that triggers this behavior?

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 24, 2021
@liggitt liggitt added kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 24, 2021
@liggitt
Copy link
Member

liggitt commented Jun 24, 2021

@liggitt liggitt added this to the v1.22 milestone Jun 24, 2021
@liggitt
Copy link
Member

liggitt commented Jun 24, 2021

does this also reproduce in 1.21 / HEAD, or is it limited to 1.19/1.20?

@nickgerace
Copy link
Author

nickgerace commented Jun 24, 2021

This was also reproducible on 1.21, but I'm unsure which patch release since that was internally reported and I did not personally test it. I have only personally tested the following:

  • 1.19.0 (not found)
  • 1.19.9 (not found)
  • 1.19.10 (found issue)
  • 1.20.6 (found issue)

@caesarxuchao
Copy link
Member

/triage accepted
/cc @leilajal

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021
@jpbetz
Copy link
Contributor

jpbetz commented Jun 24, 2021

I suspect this is due to ReconcileFieldSetWithSchema being run on all updates. For types with no schema (or that makes heavy use of x-kubernetes-preserve-unknown-fields: true) ReconcileFieldSetWithSchema needs to be skipped. It already tries to bail out early for deduced schemas (https://github.com/kubernetes-sigs/structured-merge-diff/blob/ea1021dbc0f242313159d5dd4801ff29304712fe/typed/reconcile_schema.go#L130), but I'm not convinced that's working right, and I don't think it covers this case.

/assign

@jpbetz
Copy link
Contributor

jpbetz commented Jun 25, 2021

@nickgerace
Copy link
Author

Thank you so much @jpbetz! Looks like we were on the right track.

Is it possible for this change to be backported to k8s 1.19, 1.20, and 1.21? (I think 1.22 GA would have this fix by default, right?)

@jpbetz
Copy link
Contributor

jpbetz commented Jun 25, 2021

I'm in favor of back porting this as far as we possibly can. Once the fix is merged and the PR to version bump structured-merge-diff is open against github.com/kubernetes/kubernetes I'll open the cherry-pick requests.

@nickgerace
Copy link
Author

That sounds great. Please reach out if you would like help or reviews.

@jpbetz
Copy link
Contributor

jpbetz commented Jun 26, 2021

@nickgerace there is a mitigation for this issue that works on v1.20+ (but not 1.19): Use x-kubernetes-map-type: atomic, e.g.:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic
      status:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic
        

@nickgerace
Copy link
Author

nickgerace commented Jun 29, 2021

Thank you @jpbetz. Is there a plan for k8s 1.19.10+ in the future?

jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021
@jpbetz
Copy link
Contributor

jpbetz commented Jun 30, 2021

I've opened PRs to fix and backport this:

main branch: #103318
1.21: #103319
1.20: #103320
1.19: #103321

jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021
jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021
k8s-ci-robot added a commit that referenced this issue Jun 30, 2021
Bump SMD to v4.1.2 to pick up #102749 fix
k8s-ci-robot added a commit that referenced this issue Jul 9, 2021
Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix
k8s-ci-robot added a commit that referenced this issue Jul 9, 2021
Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix
k8s-ci-robot added a commit that referenced this issue Jul 9, 2021
Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix
@javiku
Copy link

javiku commented Jul 19, 2021

Hi, sorry to comment on a closed issue, but according to Kubernetes changelog the fix for this is included in v.1.21.3, but we are still hitting this issue on a fresh 1.21.3 cluster. We are still getting 34s timeouts when creating big custom resources.

By using the workaround described here we can create big custom objects successfully.

Could anyone confirm it was fixed in v1.21.3, please?

@nickgerace
Copy link
Author

@javiku it was:

[nickgerace at rancherbook in ~/github.com/kubernetes/kubernetes] (0) (master)
% git tag --contains 44d4c4fe69f9fd2ee7bade2d15c8bab6be3ec98e
v1.21.3
v1.21.4
v1.21.4-rc.0
v1.21.5-rc.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/api-expression Categorizes an issue or PR as relevant to WG API Expression.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants