Kyverno Version
1.10.6
Description
We are working on migrating away from PodSecurityPolicy (PSP) in order to migrate off kubernetes 1.24 into 1.25.
For that task we selected kyverno, and deployed kyverno namespaced policy resources that we considered would be equivalent to what PSP were doing.
Our kubenertes cluster is a platform-as-a-service, and we create a namespace for each user to deploy their own workloads, and therefore we work with namespaced Policy resources. We deploy kubernetes in bare-metal (via kubeadm) in virtual machines in our own datacenter.
We deploy kyverno via helm, using the 3.0.9 chart, which deploys kyverno v1.10.7.
We initially tried to deploy 3.5k policy resources, each with 2 rules (mutate, validate), therefore a total of 7k rules.
The different kyverno components had problems with the default CPU/MEM resource requests and limits. The pods were constantly getting OOM-killed, etc. We decided to lift the resource limits entirely. This change made kyverno apparently stable.
Then, I detected some of the kyverno policies had not been correctly created because we were using a name too long (the resource name was templated with the user name). So I decided to rename each policy and each rule with the same string.
When running the rename script (delete the old policy resource, create a new policy resource which is the same, but with different name), the cluster experimented an extremely high load, to the point of becoming unusable.
We needed to perform a number of rescue operations, including completely disabling kyverno as documented here: https://kyverno.io/docs/troubleshooting/#api-server-is-blocked
In our system, a kyverno policy resource looks like this:
apiVersion: kyverno.io/v1
kind: Policy
metadata:
annotations:
kyverno.io/kubernetes-version: "1.24"
kyverno.io/kyverno-version: 1.10.7
policies.kyverno.io/category: Toolforge pod policy
policies.kyverno.io/description: Toolforge tool account pod security mutation
and validation
policies.kyverno.io/subject: Pod
policies.kyverno.io/title: Toolforge arturo-test-tool pod policy
toolforge.org/kyverno_pod_policy_version: "2"
creationTimestamp: "2024-06-12T16:45:35Z"
generation: 1
name: toolforge-kyverno-pod-policy
namespace: tool-arturo-test-tool
resourceVersion: "1997638015"
uid: 4c10361a-65b2-4141-a5b0-011dc3d6d36d
spec:
background: true
rules:
- match:
all:
- resources:
kinds:
- Pod
mutate:
patchStrategicMerge:
spec:
containers:
- (name): '*'
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
hostIPC: false
hostNetwork: false
hostPID: false
securityContext:
fsGroup: 54005
runAsGroup: 54005
runAsUser: 54005
seccompProfile:
type: RuntimeDefault
name: toolforge-mutate-pod-policy
- match:
any:
- resources:
kinds:
- Pod
name: toolforge-validate-pod-policy
validate:
message: pod security configuration must be correct
pattern:
spec:
=(ephemerealContainers):
- =(securityContext):
=(allowPrivilegeEscalation): false
=(capabilities):
drop:
- ALL
=(privileged): false
=(runAsGroup): 54005
=(runAsNonRoot): true
=(runAsUser): 54005
=(seccompProfile):
type: RuntimeDefault
x(seLinuxOptions): null
=(hostIPC): false
=(hostNetwork): false
=(hostPID): false
=(initContainers):
- =(securityContext):
=(allowPrivilegeEscalation): false
=(capabilities):
drop:
- ALL
=(privileged): false
=(runAsGroup): 54005
=(runAsNonRoot): true
=(runAsUser): 54005
=(seccompProfile):
type: RuntimeDefault
x(seLinuxOptions): null
=(workingDir): /data/project/arturo-test-tool
containers:
- =(securityContext):
=(allowPrivilegeEscalation): false
=(capabilities):
drop:
- ALL
=(privileged): false
=(runAsGroup): 54005
=(runAsNonRoot): true
=(runAsUser): 54005
=(seccompProfile):
type: RuntimeDefault
x(seLinuxOptions): null
securityContext:
=(runAsNonRoot): true
=(seccompProfile):
type: RuntimeDefault
=(supplementalGroups): 1-65535
fsGroup: 54005
runAsGroup: 54005
runAsUser: 54005
validationFailureAction: Audit
I cannot tell exactly what happened. I've seen other reports about Kyverno's UpdateRequests flooding the k8s API Server, which may be the case here, because the k8s control plane nodes experimented an extremely high load during this incident.
Also, reading the documentation about resource limits https://kyverno.io/docs/installation/scaling/ I have some doubts if kyverno was ever tested to work in an environment with 7k rules.
So my questions are:
- is this failure scenario known already? is this exactly the same as other similar and already opened issues like 8668 and 8668
- are we doing something obviously wrong?
- is kyverno able to handle 3.5k policy resources with 2 rules each, totaling 7k rules?
- do you have any special recommendations / hints about how to accommodate for our PSP-replacement scenario? Maybe, to split the policies into 2 separate things, or make a single huge cluster-level policy to achieve the same?
- do you think the issue is somewhere else? like, the api-server needs to be scaled up, or something.
Slack discussion
No response
Troubleshooting
Kyverno Version
1.10.6
Description
We are working on migrating away from PodSecurityPolicy (PSP) in order to migrate off kubernetes 1.24 into 1.25.
For that task we selected kyverno, and deployed kyverno namespaced policy resources that we considered would be equivalent to what PSP were doing.
Our kubenertes cluster is a platform-as-a-service, and we create a namespace for each user to deploy their own workloads, and therefore we work with namespaced Policy resources. We deploy kubernetes in bare-metal (via kubeadm) in virtual machines in our own datacenter.
We deploy kyverno via helm, using the 3.0.9 chart, which deploys kyverno v1.10.7.
We initially tried to deploy 3.5k policy resources, each with 2 rules (mutate, validate), therefore a total of 7k rules.
The different kyverno components had problems with the default CPU/MEM resource requests and limits. The pods were constantly getting OOM-killed, etc. We decided to lift the resource limits entirely. This change made kyverno apparently stable.
Then, I detected some of the kyverno policies had not been correctly created because we were using a name too long (the resource name was templated with the user name). So I decided to rename each policy and each rule with the same string.
When running the rename script (delete the old policy resource, create a new policy resource which is the same, but with different name), the cluster experimented an extremely high load, to the point of becoming unusable.
We needed to perform a number of rescue operations, including completely disabling kyverno as documented here: https://kyverno.io/docs/troubleshooting/#api-server-is-blocked
In our system, a kyverno policy resource looks like this:
I cannot tell exactly what happened. I've seen other reports about Kyverno's UpdateRequests flooding the k8s API Server, which may be the case here, because the k8s control plane nodes experimented an extremely high load during this incident.
Also, reading the documentation about resource limits https://kyverno.io/docs/installation/scaling/ I have some doubts if kyverno was ever tested to work in an environment with 7k rules.
So my questions are:
Slack discussion
No response
Troubleshooting