Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch request for CRs costs about 10-15x more memory in k8s-apiserver than in-tree resource watches #124680

Open
nagygergo opened this issue May 2, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nagygergo
Copy link

nagygergo commented May 2, 2024

What happened?

I was running some load testing related to flux. When creating 10.000 kustomization custom resources (about 1KiB), the k8s apiserver consumes about 1GiB of memory. When checking with 100.000k and 300.00k, the k8s apiserver scales linearly.
When doing the same thing for 1KiB conifgmaps, creating 10.000 resources, the k8s apiserver consumes about 100 MiB of memory.
Memory pprof for 10k kustomizations:
10k kustomizations

Memory pprof for 10k configmaps:
image

The memory usage stays the same as long as the resources are existing. After looking a bit into what might force this, it seems that the kube-controller-manager sets up a watch for the kustomizations/configmaps resources.

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"a4187168-a301-4fd1-907a-09da8fc3b587","stage":"RequestReceived","requestURI":"/apis/kustomize.toolkit.fluxcd.io/v1/kustomizations?allowWatchBookmarks=true\u0026resourceVersion=727\u0026timeout=5m37s\u0026timeoutSeconds=337\u0026watch=true","verb":"watch","user":{"username":"system:kube-controller-manager","groups":["system:authenticated"]},"sourceIPs":["172.18.0.2"],"userAgent":"kube-controller-manager/v1.29.2 (linux/amd64) kubernetes/4b8e819/metadata-informers","objectRef":{"resource":"kustomizations","apiGroup":"kustomize.toolkit.fluxcd.io","apiVersion":"v1"},"requestReceivedTimestamp":"2024-04-29T14:33:32.979279Z","stageTimestamp":"2024-04-29T14:33:32.979279Z"}

This is needed because garbage collector that runs in kube-controller-manager needs to walk the ownership reference map, and it wants to do that in cache:

if err := gc.resyncMonitors(logger, newResources); err != nil {

What did you expect to happen?

Expectation would've been that there is similar memory usage for in-tree and custom resources.

Also, the current garbage collector seems to force k8s-apiserver to cache the full contents of etcd. Is that a correct implementation?

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a cluster
    kind create cluster

  2. Add the kustomize CRD
    curl -L https://raw.githubusercontent.com/fluxcd/kustomize-controller/main/config/crd/bases/kustomize.toolkit.fluxcd.io_kustomizations.yaml | kubectl apply -f -

  3. Create 10k of the following

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: podinfo
spec:
  interval: 10m
  targetNamespace: default
  sourceRef:
    kind: GitRepository
    name: podinfo
  path: "./kustomize"
  prune: true
  timeout: 1m
  patches:
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: not-used
      spec:
        template:
          metadata:
            annotations:
              cluster-autoscaler.kubernetes.io/safe-to-evict: "true"        
    target:
      kind: Deployment
      labelSelector: "app.kubernetes.io/part-of=my-app"
  - patch: |
      - op: add
        path: /spec/template/spec/securityContext
        value:
          runAsUser: 10000
          fsGroup: 1337
      - op: add
        path: /spec/template/spec/containers/0/securityContext
        value:
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          capabilities:
            drop:
              - ALL        
    target:
      kind: Deployment
      name: podinfo
      namespace: apps

Anything else we need to know?

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

kind

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@nagygergo nagygergo added the kind/bug Categorizes issue or PR as related to a bug. label May 2, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2024
@nagygergo
Copy link
Author

/sig apimachinery

@k8s-ci-robot
Copy link
Contributor

@nagygergo: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nagygergo
Copy link
Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 2, 2024
@cici37
Copy link
Contributor

cici37 commented May 30, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2024
@cici37
Copy link
Contributor

cici37 commented May 30, 2024

/cc @benluddy @jpbetz @wojtek-t

@jpbetz
Copy link
Contributor

jpbetz commented May 30, 2024

@benluddy any idea how much of this could be optimized away by CBOR?

@benluddy
Copy link
Contributor

I'd expect the steady-state memory to depend on the size of the Go objects in the watch cache. Regardless of the object storage encoding, the decoded Unstructured objects will be larger than in-tree counterparts, mainly because of all the maps with duplicate copies of the string keys. Maybe there is an opportunity to do efficient string interning using the information in the CR schemas?

@wojtek-t
Copy link
Member

wojtek-t commented Jun 3, 2024

Regardless of the object storage encoding, the decoded Unstructured objects will be larger than in-tree counterparts, mainly because of all the maps with duplicate copies of the string keys.

+1 - I don't think CBOR will change much here. The memory usage is large because of inefficient in-memory representation of CRDs.
We've been talking about storing "partially serialized" objects in memory in the past (don't store Unstructured themselves, but rather store a serialized object, potentially with deserialized ObjectMeta for efficient filtering). But things like "field selectors for CRD" would make it harder (and that would be a large effort on its own).

So for now, CRDs are by-design less efficient than built-in resources, with CBOR we're attacking the CPU part first thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants