Hitting 34s timeouts with server-side apply on large custom resource objects #102749

nickgerace · 2021-06-09T22:06:20Z

What happened:

Performing server-side apply on a CR of at least 700KB in size results in 34 second timeout with Kubernetes 1.19.10+ and Kubernetes 1.20+.

What you expected to happen:

Like the creation and deletion events, I expected the update event to take a minimal amount of time. Understandably, our particular CR in question has a large status field (which is being trimmed down to reasonable size anyway). However, it is interesting that we have never faced a timeout here in Kubernetes v1.19.9 and below.

How to reproduce it (as minimally and precisely as possible):

Use a Kubernetes 1.19.10 or 1.20.6 cluster
Create a 700KB+ sized custom resource object, client-side (e.g. kubectl create) or server-side
Update the object, client-side (e.g. kubectl edit) or server-side
Server-side apply will timeout in 34s

Anything else we need to know?:

I'm currently testing server-side apply behavior on CRs with/without their openAPIV3Schema populated. It may be possible this issue only affects CRDs that preserve unknown fields for unpopulated openAPIV3Schema fields, and/or it only affects CRDs with large status fields:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
      status:
        x-kubernetes-preserve-unknown-fields: true

This is a comment with details of our investigation for rancher/rancher. It may be relevant for further context, but I do not want to inundate maintainers with a huge comment if need be: rancher/rancher#32419 (comment)

I've also narrowed down some potential suspects, but have not yet been able to test them:

Structured Merge Diff v4.0.2: Possible (https://github.com/kubernetes-sigs/structured-merge-diff/compare/v4.0.1..v4.0.2)
Lease Manager Changes in apiserver: Doubtful (3820a5d, 537b8d3, 42a3d75)
Protobuf v1.3.2: Doubtful (15596ce)

CRD in question:

Source: https://github.com/rancher/rancher/blob/v2.5.8/pkg/apis/management.cattle.io/v3/catalog_types.go
In cluster:

apiVersion: apiextensions.k8s.io/v1                                                                                                                                                           
kind: CustomResourceDefinition                                                                                                                                                                
metadata:                                                                                                                                                                                                                                                                                                                                                          
  name: catalogs.management.cattle.io                                                                                                                                                         
spec:
  conversion:
    strategy: None
  group: management.cattle.io
  names:
    kind: Catalog
    listKind: CatalogList
    plural: catalogs
    singular: catalog
  preserveUnknownFields: true
  scope: Cluster
  versions:
  - name: v3
    served: true
    storage: true

Environment:

Kubernetes version (use kubectl version): 1.19.10, 1.20.6 (server)
Cloud provider or hardware configuration: Digital Ocean Droplet (2 CPU / 8GB RAM)
OS (e.g: cat /etc/os-release): Ubuntu 20.04 LTS
Kernel (e.g. uname -a): 5.4
Install tools: RKE v1.2.8
Network plugin and version (if this is a network-related bug): n/a
Others: n/a

Thank you in advance for any and all help! I'd be happy to provide more detail that may help, and I hope to even find the code that caused this as well. This is an ongoing investigation, but I thought I'd file the issue since I believe we have had enough reproduction scenarios to warrant so.

EDIT: I believe wg-api-expression was the best to assign based on this: https://github.com/kubernetes-sigs/structured-merge-diff#community-discussion-contribution-and-support

The text was updated successfully, but these errors were encountered:

nickgerace · 2021-06-09T22:09:17Z

/wg api-expression

lavalamp · 2021-06-24T17:54:19Z

/sig api-machinery

Can you provide an example CR that triggers this behavior?

liggitt · 2021-06-24T17:58:31Z

slack thread (with linked CRs): https://kubernetes.slack.com/archives/C0123CNN8F3/p1623878934051800?thread_ts=1623689723.044900&cid=C0123CNN8F3

liggitt · 2021-06-24T18:27:56Z

does this also reproduce in 1.21 / HEAD, or is it limited to 1.19/1.20?

nickgerace · 2021-06-24T19:36:10Z

This was also reproducible on 1.21, but I'm unsure which patch release since that was internally reported and I did not personally test it. I have only personally tested the following:

1.19.0 (not found)
1.19.9 (not found)
1.19.10 (found issue)
1.20.6 (found issue)

caesarxuchao · 2021-06-24T20:11:43Z

/triage accepted
/cc @leilajal

jpbetz · 2021-06-24T23:12:37Z

I suspect this is due to ReconcileFieldSetWithSchema being run on all updates. For types with no schema (or that makes heavy use of x-kubernetes-preserve-unknown-fields: true) ReconcileFieldSetWithSchema needs to be skipped. It already tries to bail out early for deduced schemas (https://github.com/kubernetes-sigs/structured-merge-diff/blob/ea1021dbc0f242313159d5dd4801ff29304712fe/typed/reconcile_schema.go#L130), but I'm not convinced that's working right, and I don't think it covers this case.

/assign

jpbetz · 2021-06-25T03:06:39Z

WIP Fix: kubernetes-sigs/structured-merge-diff#200

nickgerace · 2021-06-25T16:18:13Z

Thank you so much @jpbetz! Looks like we were on the right track.

Is it possible for this change to be backported to k8s 1.19, 1.20, and 1.21? (I think 1.22 GA would have this fix by default, right?)

jpbetz · 2021-06-25T16:24:25Z

I'm in favor of back porting this as far as we possibly can. Once the fix is merged and the PR to version bump structured-merge-diff is open against github.com/kubernetes/kubernetes I'll open the cherry-pick requests.

nickgerace · 2021-06-25T16:42:14Z

That sounds great. Please reach out if you would like help or reviews.

jpbetz · 2021-06-26T00:11:30Z

@nickgerace there is a mitigation for this issue that works on v1.20+ (but not 1.19): Use x-kubernetes-map-type: atomic, e.g.:

schema:
  openAPIV3Schema:
    properties:
      spec:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic
      status:
        x-kubernetes-preserve-unknown-fields: true
        x-kubernetes-map-type: atomic

nickgerace · 2021-06-29T16:13:27Z

Thank you @jpbetz. Is there a plan for k8s 1.19.10+ in the future?

jpbetz · 2021-06-30T02:41:54Z

I've opened PRs to fix and backport this:

main branch: #103318
1.21: #103319
1.20: #103320
1.19: #103321

Bump SMD to v4.1.2 to pick up #102749 fix

Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix

javiku · 2021-07-19T14:07:51Z

Hi, sorry to comment on a closed issue, but according to Kubernetes changelog the fix for this is included in v.1.21.3, but we are still hitting this issue on a fresh 1.21.3 cluster. We are still getting 34s timeouts when creating big custom resources.

By using the workaround described here we can create big custom objects successfully.

Could anyone confirm it was fixed in v1.21.3, please?

nickgerace · 2021-09-09T19:43:22Z

@javiku it was:

[nickgerace at rancherbook in ~/github.com/kubernetes/kubernetes] (0) (master)
% git tag --contains 44d4c4fe69f9fd2ee7bade2d15c8bab6be3ec98e
v1.21.3
v1.21.4
v1.21.4-rc.0
v1.21.5-rc.0

…!1053) Bump SMD to v4.1.2 to pick up kubernetes#102749 fix Bump SMD to v4.1.2 to pick up kubernetes#102749 fix kubernetes#103320

nickgerace added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 9, 2021

k8s-ci-robot added wg/api-expression Categorizes an issue or PR as relevant to WG API Expression. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 9, 2021

nickgerace mentioned this issue Jun 9, 2021

Investigate why requests timeout on large objects only in kubernetes v1.19.10+ and v1.20+ rancher/rancher#32419

Closed

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 24, 2021

liggitt added kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 24, 2021

liggitt added this to the v1.22 milestone Jun 24, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021

k8s-ci-robot assigned jpbetz Jun 24, 2021

jpbetz mentioned this issue Jun 25, 2021

Don't process schemaless fields with ReconcileFieldSetWithSchema kubernetes-sigs/structured-merge-diff#200

Merged

jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021

Bump SMD to v4.1.2 to pick up kubernetes#102749 fix

a67405b

jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021

Bump SMD to v4.1.2 to pick up kubernetes#102749 fix

e4783df

jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021

Bump SMD to v4.1.2 to pick up kubernetes#102749 fix

b790cf3

jpbetz added a commit to jpbetz/kubernetes that referenced this issue Jun 30, 2021

Bump SMD to v4.1.2 to pick up kubernetes#102749 fix

44d4c4f

k8s-ci-robot closed this as completed in #103318 Jun 30, 2021

k8s-ci-robot added a commit that referenced this issue Jun 30, 2021

Merge pull request #103318 from jpbetz/fix-102749

0dad7d1

Bump SMD to v4.1.2 to pick up #102749 fix

k8s-ci-robot added a commit that referenced this issue Jul 9, 2021

Merge pull request #103321 from jpbetz/fix-102749-1.19

6ca436b

Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix

k8s-ci-robot added a commit that referenced this issue Jul 9, 2021

Merge pull request #103320 from jpbetz/fix-102749-1.20

86c8e96

Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix

k8s-ci-robot added a commit that referenced this issue Jul 9, 2021

Merge pull request #103319 from jpbetz/fix-102749-1.21

c17d7c5

Manual cherry pick of #103318: Bump SMD to v4.1.2 to pick up #102749 fix

VidarHUN mentioned this issue Jul 14, 2021

Slow apiserver with Kopf nolar/kopf#800

Open

2 tasks

markterm mentioned this issue Jul 14, 2021

Successfully finished pods show stuck in pending phase indefinitely in GKE + Kubernetes v1.20 argoproj/argo-workflows#6256

Closed

tczhao mentioned this issue Jul 17, 2021

fix(deployment): workaround fix, 34s timeout bug for argo crd on k8s1.20+ kubeflow/pipelines#6075

Merged

1 task

kostas-theo mentioned this issue Jul 19, 2021

fix(argo-workflows): pod status pending issue argoproj/argo-helm#837

Merged

5 tasks

nickgerace mentioned this issue Aug 5, 2021

Bitnami Chart fails to refresh, causing Rancher Backup Restore timeouts and loops rancher/rancher#31940

Closed

nickgerace mentioned this issue Aug 13, 2021

Reduce size of v3.App objects rancher/rancher#32939

Open

nickgerace mentioned this issue Sep 9, 2021

Upgrade k8s version sees memory increase on downstream clusters rancher/rancher#31640

Closed

sttts mentioned this issue Jan 27, 2022

Make OpenAPIV3Schema atomic kcp-dev/kcp#383

Merged

ialidzhikov mentioned this issue Mar 29, 2022

apiserver: Request for a custom resource leads to > 1000% CPU usage increase for more than 10min #109099

Closed

silenceper mentioned this issue Aug 13, 2022

k8s apiserver write big custom object timeout, apiserver high cpu #111819

Closed

chenchun pushed a commit to chenchun/kubernetes that referenced this issue Mar 20, 2024

Bump SMD to v4.1.2 to pick up kubernetes#102749 fix

220e898

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hitting 34s timeouts with server-side apply on large custom resource objects #102749

Hitting 34s timeouts with server-side apply on large custom resource objects #102749

nickgerace commented Jun 9, 2021 •

edited

nickgerace commented Jun 9, 2021

lavalamp commented Jun 24, 2021

liggitt commented Jun 24, 2021

liggitt commented Jun 24, 2021

nickgerace commented Jun 24, 2021 •

edited

caesarxuchao commented Jun 24, 2021

jpbetz commented Jun 24, 2021 •

edited

jpbetz commented Jun 25, 2021

nickgerace commented Jun 25, 2021

jpbetz commented Jun 25, 2021

nickgerace commented Jun 25, 2021

jpbetz commented Jun 26, 2021

nickgerace commented Jun 29, 2021 •

edited

jpbetz commented Jun 30, 2021

javiku commented Jul 19, 2021 •

edited

nickgerace commented Sep 9, 2021

Hitting 34s timeouts with server-side apply on large custom resource objects #102749

Hitting 34s timeouts with server-side apply on large custom resource objects #102749

Comments

nickgerace commented Jun 9, 2021 • edited

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

nickgerace commented Jun 9, 2021

lavalamp commented Jun 24, 2021

liggitt commented Jun 24, 2021

liggitt commented Jun 24, 2021

nickgerace commented Jun 24, 2021 • edited

caesarxuchao commented Jun 24, 2021

jpbetz commented Jun 24, 2021 • edited

jpbetz commented Jun 25, 2021

nickgerace commented Jun 25, 2021

jpbetz commented Jun 25, 2021

nickgerace commented Jun 25, 2021

jpbetz commented Jun 26, 2021

nickgerace commented Jun 29, 2021 • edited

jpbetz commented Jun 30, 2021

javiku commented Jul 19, 2021 • edited

nickgerace commented Sep 9, 2021

nickgerace commented Jun 9, 2021 •

edited

nickgerace commented Jun 24, 2021 •

edited

jpbetz commented Jun 24, 2021 •

edited

nickgerace commented Jun 29, 2021 •

edited

javiku commented Jul 19, 2021 •

edited