Skip to content

[Bug] kyverno may not be able to handle 3.5k policy resources with 2 rules each, total of 7k rules #10458

@aborrero

Description

@aborrero

Kyverno Version

1.10.6

Description

We are working on migrating away from PodSecurityPolicy (PSP) in order to migrate off kubernetes 1.24 into 1.25.

For that task we selected kyverno, and deployed kyverno namespaced policy resources that we considered would be equivalent to what PSP were doing.

Our kubenertes cluster is a platform-as-a-service, and we create a namespace for each user to deploy their own workloads, and therefore we work with namespaced Policy resources. We deploy kubernetes in bare-metal (via kubeadm) in virtual machines in our own datacenter.

We deploy kyverno via helm, using the 3.0.9 chart, which deploys kyverno v1.10.7.

We initially tried to deploy 3.5k policy resources, each with 2 rules (mutate, validate), therefore a total of 7k rules.
The different kyverno components had problems with the default CPU/MEM resource requests and limits. The pods were constantly getting OOM-killed, etc. We decided to lift the resource limits entirely. This change made kyverno apparently stable.

Then, I detected some of the kyverno policies had not been correctly created because we were using a name too long (the resource name was templated with the user name). So I decided to rename each policy and each rule with the same string.

When running the rename script (delete the old policy resource, create a new policy resource which is the same, but with different name), the cluster experimented an extremely high load, to the point of becoming unusable.

We needed to perform a number of rescue operations, including completely disabling kyverno as documented here: https://kyverno.io/docs/troubleshooting/#api-server-is-blocked

In our system, a kyverno policy resource looks like this:

apiVersion: kyverno.io/v1
kind: Policy
metadata:
  annotations:
    kyverno.io/kubernetes-version: "1.24"
    kyverno.io/kyverno-version: 1.10.7
    policies.kyverno.io/category: Toolforge pod policy
    policies.kyverno.io/description: Toolforge tool account pod security mutation
      and validation
    policies.kyverno.io/subject: Pod
    policies.kyverno.io/title: Toolforge arturo-test-tool pod policy
    toolforge.org/kyverno_pod_policy_version: "2"
  creationTimestamp: "2024-06-12T16:45:35Z"
  generation: 1
  name: toolforge-kyverno-pod-policy
  namespace: tool-arturo-test-tool
  resourceVersion: "1997638015"
  uid: 4c10361a-65b2-4141-a5b0-011dc3d6d36d
spec:
  background: true
  rules:
  - match:
      all:
      - resources:
          kinds:
          - Pod
    mutate:
      patchStrategicMerge:
        spec:
          containers:
          - (name): '*'
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                - ALL
              privileged: false
              runAsNonRoot: true
          hostIPC: false
          hostNetwork: false
          hostPID: false
          securityContext:
            fsGroup: 54005
            runAsGroup: 54005
            runAsUser: 54005
            seccompProfile:
              type: RuntimeDefault
    name: toolforge-mutate-pod-policy
  - match:
      any:
      - resources:
          kinds:
          - Pod
    name: toolforge-validate-pod-policy
    validate:
      message: pod security configuration must be correct
      pattern:
        spec:
          =(ephemerealContainers):
          - =(securityContext):
              =(allowPrivilegeEscalation): false
              =(capabilities):
                drop:
                - ALL
              =(privileged): false
              =(runAsGroup): 54005
              =(runAsNonRoot): true
              =(runAsUser): 54005
              =(seccompProfile):
                type: RuntimeDefault
              x(seLinuxOptions): null
          =(hostIPC): false
          =(hostNetwork): false
          =(hostPID): false
          =(initContainers):
          - =(securityContext):
              =(allowPrivilegeEscalation): false
              =(capabilities):
                drop:
                - ALL
              =(privileged): false
              =(runAsGroup): 54005
              =(runAsNonRoot): true
              =(runAsUser): 54005
              =(seccompProfile):
                type: RuntimeDefault
              x(seLinuxOptions): null
          =(workingDir): /data/project/arturo-test-tool
          containers:
          - =(securityContext):
              =(allowPrivilegeEscalation): false
              =(capabilities):
                drop:
                - ALL
              =(privileged): false
              =(runAsGroup): 54005
              =(runAsNonRoot): true
              =(runAsUser): 54005
              =(seccompProfile):
                type: RuntimeDefault
              x(seLinuxOptions): null
          securityContext:
            =(runAsNonRoot): true
            =(seccompProfile):
              type: RuntimeDefault
            =(supplementalGroups): 1-65535
            fsGroup: 54005
            runAsGroup: 54005
            runAsUser: 54005
  validationFailureAction: Audit

I cannot tell exactly what happened. I've seen other reports about Kyverno's UpdateRequests flooding the k8s API Server, which may be the case here, because the k8s control plane nodes experimented an extremely high load during this incident.

Also, reading the documentation about resource limits https://kyverno.io/docs/installation/scaling/ I have some doubts if kyverno was ever tested to work in an environment with 7k rules.

So my questions are:

  • is this failure scenario known already? is this exactly the same as other similar and already opened issues like 8668 and 8668
  • are we doing something obviously wrong?
  • is kyverno able to handle 3.5k policy resources with 2 rules each, totaling 7k rules?
  • do you have any special recommendations / hints about how to accommodate for our PSP-replacement scenario? Maybe, to split the policies into 2 separate things, or make a single huge cluster-level policy to achieve the same?
  • do you think the issue is somewhere else? like, the api-server needs to be scaled up, or something.

Slack discussion

No response

Troubleshooting

  • I have read and followed the documentation AND the troubleshooting guide.
  • I have searched other issues in this repository and mine is not recorded.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingreportsIssues related to policy reports.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions