Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches #124680

nagygergo · 2024-05-02T17:08:34Z

What happened?

I was running some load testing related to flux. When creating 10.000 kustomization custom resources (about 1KiB), the k8s apiserver consumes about 1GiB of memory. When checking with 100.000k and 300.00k, the k8s apiserver scales linearly.
When doing the same thing for 1KiB conifgmaps, creating 10.000 resources, the k8s apiserver consumes about 100 MiB of memory.
Memory pprof for 10k kustomizations:

Memory pprof for 10k configmaps:

The memory usage stays the same as long as the resources are existing. After looking a bit into what might force this, it seems that the kube-controller-manager sets up a watch for the kustomizations/configmaps resources.

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a4187168-a301-4fd1-907a-09da8fc3b587","stage":"RequestReceived","requestURI":"/apis/kustomize.toolkit.fluxcd.io/v1/kustomizations?allowWatchBookmarks=true\u0026resourceVersion=727\u0026timeout=5m37s\u0026timeoutSeconds=337\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":["172.18.0.2"],"userAgent":"kube-controller-manager/v1.29.2 (linux/amd64) kubernetes/4b8e819/metadata-informers","objectRef":{"resource":"kustomizations","apiGroup":"kustomize.toolkit.fluxcd.io","apiVersion":"v1"},"requestReceivedTimestamp":"2024-04-29T14:33:32.979279Z","stageTimestamp":"2024-04-29T14:33:32.979279Z"}

This is needed because garbage collector that runs in kube-controller-manager needs to walk the ownership reference map, and it wants to do that in cache:

kubernetes/pkg/controller/garbagecollector/garbagecollector.go

Line 253 in a9eded0

if err := gc.resyncMonitors(logger, newResources); err != nil {

What did you expect to happen?

Expectation would've been that there is similar memory usage for in-tree and custom resources.

Also, the current garbage collector seems to force k8s-apiserver to cache the full contents of etcd. Is that a correct implementation?

How can we reproduce it (as minimally and precisely as possible)?

Create a cluster
kind create cluster
Add the kustomize CRD
curl -L https://raw.githubusercontent.com/fluxcd/kustomize-controller/main/config/crd/bases/kustomize.toolkit.fluxcd.io_kustomizations.yaml | kubectl apply -f -
Create 10k of the following

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: podinfo
spec:
  interval: 10m
  targetNamespace: default
  sourceRef:
    kind: GitRepository
    name: podinfo
  path: "./kustomize"
  prune: true
  timeout: 1m
  patches:
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: not-used
      spec:
        template:
          metadata:
            annotations:
              cluster-autoscaler.kubernetes.io/safe-to-evict: "true"        
    target:
      kind: Deployment
      labelSelector: "app.kubernetes.io/part-of=my-app"
  - patch: |
      - op: add
        path: /spec/template/spec/securityContext
        value:
          runAsUser: 10000
          fsGroup: 1337
      - op: add
        path: /spec/template/spec/containers/0/securityContext
        value:
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          capabilities:
            drop:
              - ALL        
    target:
      kind: Deployment
      name: podinfo
      namespace: apps

Anything else we need to know?

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

kind

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

nagygergo · 2024-05-02T19:58:51Z

/sig apimachinery

k8s-ci-robot · 2024-05-02T19:58:53Z

@nagygergo: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nagygergo · 2024-05-02T19:59:40Z

/sig api-machinery

cici37 · 2024-05-30T16:46:57Z

/triage accepted

cici37 · 2024-05-30T16:47:46Z

/cc @benluddy @jpbetz @wojtek-t

jpbetz · 2024-05-30T17:03:56Z

@benluddy any idea how much of this could be optimized away by CBOR?

benluddy · 2024-05-30T17:28:54Z

I'd expect the steady-state memory to depend on the size of the Go objects in the watch cache. Regardless of the object storage encoding, the decoded Unstructured objects will be larger than in-tree counterparts, mainly because of all the maps with duplicate copies of the string keys. Maybe there is an opportunity to do efficient string interning using the information in the CR schemas?

wojtek-t · 2024-06-03T08:53:42Z

Regardless of the object storage encoding, the decoded Unstructured objects will be larger than in-tree counterparts, mainly because of all the maps with duplicate copies of the string keys.

+1 - I don't think CBOR will change much here. The memory usage is large because of inefficient in-memory representation of CRDs.
We've been talking about storing "partially serialized" objects in memory in the past (don't store Unstructured themselves, but rather store a serialized object, potentially with deserialized ObjectMeta for efficient filtering). But things like "field selectors for CRD" would make it harder (and that would be a large effort on its own).

So for now, CRDs are by-design less efficient than built-in resources, with CBOR we're attacking the CPU part first thought.

nagygergo added the kind/bug Categorizes issue or PR as related to a bug. label May 2, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2024

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 2, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches #124680

Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches #124680

nagygergo commented May 2, 2024 •

edited

Loading

nagygergo commented May 2, 2024

k8s-ci-robot commented May 2, 2024

nagygergo commented May 2, 2024

cici37 commented May 30, 2024

cici37 commented May 30, 2024

jpbetz commented May 30, 2024

benluddy commented May 30, 2024

wojtek-t commented Jun 3, 2024

Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches #124680

Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches #124680

Comments

nagygergo commented May 2, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

nagygergo commented May 2, 2024

k8s-ci-robot commented May 2, 2024

nagygergo commented May 2, 2024

cici37 commented May 30, 2024

cici37 commented May 30, 2024

jpbetz commented May 30, 2024

benluddy commented May 30, 2024

wojtek-t commented Jun 3, 2024

nagygergo commented May 2, 2024 •

edited

Loading