-
Notifications
You must be signed in to change notification settings - Fork 137
Description
What happened?
When implementing Project Quotas, we're running into a problem where rolling restarts of machine deployments fail due to quota validation. The issue occurs because during a rolling update, the machine controller temporarily creates new machines before deleting old ones, causing a brief period where the total resource usage exceeds the quota limits.
Error Message:
failed to sync Machineset replicas: admission webhook "machines.cluster.k8c.io" denied the request: requested CPU "6" would exceed current quota (quota/used "20"/"18")
Root Cause:
- Machine controller uses a rolling update strategy with
maxSurgeto maintain availability - New machines are created before old machines are deleted
- Quota validation happens at admission webhook level before the old machines are removed
- This creates a temporary overage that fails quota validation
Expected behavior
The machine controller should be able to perform rolling updates without failing quota validation when:
- The replica count of the MachineDeployment is not being increased (i.e., it's a rolling restart/update, not a scale-up)
- The temporary overage is within the
maxSurgelimits of the rolling update strategy - The total resource usage will return to normal levels once the rolling update completes
Proposed Solution:
- Add a quota bypass mechanism for machines created during rolling updates
- Use an annotation
kubermatic.io/bypass-quota-validation=trueon machines created during rolling updates - Modify the admission webhook to detect quota-related errors and bypass validation when the annotation is present
- Ensure the bypass only applies to quota-related errors, not other validation failures
How to reproduce the issue?
-
Setup a MachineDeployment with quota limits:
apiVersion: "cluster.k8s.io/v1alpha1" kind: MachineDeployment metadata: name: test-deployment spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1 template: spec: providerSpec: value: cloudProvider: "azure" cloudProviderSpec: vmSize: "Standard_D2s_v3" # 2 vCPUs per machine
-
Set up resource quotas:
apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota spec: hard: requests.cpu: "6" # Total of 3 machines * 2 CPUs = 6 CPUs
-
Trigger a rolling update by changing the machine template (e.g., update the OS image or kubelet version)
-
Observe the failure when the machine controller tries to create the 4th machine (3 existing + 1 new for rolling update)
Additional details
Current Flow:
MachineDeployment Update → MachineSet Scaling → Machine Creation → Quota Check → FAIL
Proposed Flow:
MachineDeployment Update → MachineSet Scaling → Machine Creation (with bypass annotation) → Quota Check → BYPASS → SUCCESS