-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Describe the bug
When autoscaling.enable: true
is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas
field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.
To Reproduce
- Deploy NGF with autoscaling enabled using these Helm values:
nginx:
autoscaling:
enable: true
metrics:
- external:
metric:
name: <some-external-metric-providing-connection-count-across-all-replicas>
target:
type: Value
value: 20000
type: External
minReplicas: 1
maxReplicas: 10
-
Wait for HPA to trigger a scale-down event
-
Observe scale events:
kubectl get events -n ngf --sort-by='.lastTimestamp' -o custom-columns='when:lastTimestamp,msg:message,reason:reason,obj:involvedObject.name,cmp:source.component' | grep -E "SuccessfulRescale|ScalingReplicaSet"
- Check who last updated the deployment replicas
kubectl get deployment nginx-public-gateway-nginx -n ngf --show-managed-fields -o json | \
jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas") | {manager: .manager, operation: .operation, time: .time}'
Expected behavior
When autoscaling.enable: true
, the NGF controller should:
- Create the HPA resource
- Not change the
spec.replicas
field after HPA is created - Allow HPA to be the sole controller managing replica count
Your environment
- Version of NGINX Gateway Fabric: 2.1.2 (commit: 877c415, date: 2025-09-25T19:31:07Z)
- Kubernetes Version: v1.32.6
- Platform: Azure Kubernetes Service (AKS)
- Exposure method: Service type LoadBalancer
- Helm Chart Version: nginx-gateway-fabric-2.1.2
Observed behavior
Events show deployment scaling up and down in the same second:
> kubectl get events -n immy-routing --sort-by='.lastTimestamp' -o custom-columns='when:lastTimestamp,msg:message,reason:reason,obj:involvedObject.name,cmp:source.component' | grep -E "SuccessfulRescale|ScalingReplicaSet"
2025-10-02T18:17:53Z New size: 10; reason: external metric datadogmetric@immy-routing:nginx-connection-count-connections(nil) above target SuccessfulRescale nginx-public-gateway-nginx horizontal-pod-autoscaler
2025-10-02T18:19:38Z Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 10 to 8 ScalingReplicaSet nginx-public-gateway-nginx deployment-controller
2025-10-02T18:19:38Z Scaled up replica set nginx-public-gateway-nginx-57b699c549 from 8 to 10 ScalingReplicaSet nginx-public-gateway-nginx deployment-controller
2025-10-02T18:21:38Z Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 10 to 9 ScalingReplicaSet nginx-public-gateway-nginx deployment-controller
2025-10-02T18:21:38Z Scaled up replica set nginx-public-gateway-nginx-57b699c549 from 9 to 10 ScalingReplicaSet nginx-public-gateway-nginx deployment-controller
2025-10-02T18:25:23Z Scaled up replica set ngf-nginx-gateway-fabric-74db69c968 from 0 to 1 ScalingReplicaSet ngf-nginx-gateway-fabric deployment-controller
2025-10-02T18:25:26Z Scaled down replica set ngf-nginx-gateway-fabric-7b99997d79 from 1 to 0 ScalingReplicaSet ngf-nginx-gateway-fabric deployment-controller
2025-10-02T18:25:39Z New size: 9; reason: All metrics below target SuccessfulRescale nginx-public-gateway-nginx horizontal-pod-autoscaler
2025-10-02T18:51:42Z Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 9 to 8 ScalingReplicaSet nginx-public-gateway-nginx deployment-controller
2025-10-02T18:51:42Z New size: 8; reason: All metrics below target SuccessfulRescale nginx-public-gateway-nginx horizontal-pod-autoscaler
Checking managed fields confirms NGF controller ("gateway" manager) is modifying replicas in the same second as the hpa:
> kubectl get deployment nginx-public-gateway-nginx -n immy-routing --show-managed-fields -o json | \
jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas") | {manager: .manager, operation: .operation, time: .time}'
{
"manager": "gateway",
"operation": "Update",
"time": "2025-10-02T18:51:42Z"
}
And the replica count set by the HPA has been overwritten back to the old value:
> kubectl get deployment nginx-public-gateway-nginx -n immy-routing -o json | jq '.spec.replicas'
9
Additional context
Suspected root cause: The NGF controller is updating the deployment, including the spec.replicas
field, even when HPA is enabled and results a race condition:
- HPA decides to scale (e.g., 10 → 8 replicas)
- HPA updates deployment
.spec.replicas: 8
- Deployment terminates relevant pods
- NGF controller reconciles and resets
.spec.replicas
back to the old value (e.g., 10) - Deployment spins up pods again
Impact on production:
- Pods restart every 2 minutes (matching HPA scale-down period)
- Thousands of websocket connections dropped on each restart
- Connection storms after scale-downs cause metric spikes
- HPA unable to effectively manage scaling due to constant interference
Metadata
Metadata
Assignees
Labels
Type
Projects
Status