Skip to content

Feature Request: Quota Bypass for Rolling Updates in Machine Deployments #1968

@ronissac88

Description

@ronissac88

What happened?

When implementing Project Quotas, we're running into a problem where rolling restarts of machine deployments fail due to quota validation. The issue occurs because during a rolling update, the machine controller temporarily creates new machines before deleting old ones, causing a brief period where the total resource usage exceeds the quota limits.

Error Message:

failed to sync Machineset replicas: admission webhook "machines.cluster.k8c.io" denied the request: requested CPU "6" would exceed current quota (quota/used "20"/"18")

Root Cause:

  • Machine controller uses a rolling update strategy with maxSurge to maintain availability
  • New machines are created before old machines are deleted
  • Quota validation happens at admission webhook level before the old machines are removed
  • This creates a temporary overage that fails quota validation

Expected behavior

The machine controller should be able to perform rolling updates without failing quota validation when:

  1. The replica count of the MachineDeployment is not being increased (i.e., it's a rolling restart/update, not a scale-up)
  2. The temporary overage is within the maxSurge limits of the rolling update strategy
  3. The total resource usage will return to normal levels once the rolling update completes

Proposed Solution:

  • Add a quota bypass mechanism for machines created during rolling updates
  • Use an annotation kubermatic.io/bypass-quota-validation=true on machines created during rolling updates
  • Modify the admission webhook to detect quota-related errors and bypass validation when the annotation is present
  • Ensure the bypass only applies to quota-related errors, not other validation failures

How to reproduce the issue?

  1. Setup a MachineDeployment with quota limits:

    apiVersion: "cluster.k8s.io/v1alpha1"
    kind: MachineDeployment
    metadata:
      name: test-deployment
    spec:
      replicas: 3
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 1
      template:
        spec:
          providerSpec:
            value:
              cloudProvider: "azure"
              cloudProviderSpec:
                vmSize: "Standard_D2s_v3"  # 2 vCPUs per machine
  2. Set up resource quotas:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: compute-quota
    spec:
      hard:
        requests.cpu: "6"  # Total of 3 machines * 2 CPUs = 6 CPUs
  3. Trigger a rolling update by changing the machine template (e.g., update the OS image or kubelet version)

  4. Observe the failure when the machine controller tries to create the 4th machine (3 existing + 1 new for rolling update)

Additional details

Current Flow:

MachineDeployment Update → MachineSet Scaling → Machine Creation → Quota Check → FAIL

Proposed Flow:

MachineDeployment Update → MachineSet Scaling → Machine Creation (with bypass annotation) → Quota Check → BYPASS → SUCCESS

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-requestkind/featureCategorizes issue or PR as related to a new feature.sig/cluster-managementDenotes a PR or issue as being assigned to SIG Cluster Management.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions