Skip to content

[Bug]: Ingress Controller is not able to restart correctly after OOM #8529

@glieske

Description

@glieske

Version

5.2.1

What Kubernetes platforms are you running on?

EKS Amazon

Steps to reproduce

  1. Deploy an ingress controller with a lower memory request (just for test, so it will be easier to observe) like
    Requests: 200Mi, Limits: 512Mi. or so.
  2. Generate some traffic, wait for the OOM event, and we want the container to be restarted by Kubernetes.
  3. Observe nginx logs:

I20251112 09:42:27.813418 1 flags.go:287] Starting with flags: ["-nginx-plus=false" "-nginx-reload-timeout=60000" "-enable-app-protect=false" "-enable-app-protect-dos=false" "-nginx-configmaps=nginx/nginx-lb-nginx-ingress" "-ingress-class=nginx" "-health-status=true" "-health-status-uri=/nginx-health" "-nginx-debug=false" "-log-level=info" "-log-format=glog" "-nginx-status=true" "-nginx-status-port=8080" "-nginx-status-allow-cidrs=127.0.0.1" "-report-ingress-status" "-enable-leader-election=true" "-leader-election-lock-name=nginx-lb-nginx-ingress-leader-election" "-enable-prometheus-metrics=true" "-prometheus-metrics-listen-port=9113" "-prometheus-tls-secret=" "-enable-service-insight=false" "-service-insight-listen-port=9114" "-service-insight-tls-secret=" "-enable-custom-resources=true" "-enable-snippets=true" "-disable-ipv6=true" "-enable-tls-passthrough=false" "-enable-cert-manager=false" "-enable-oidc=false" "-enable-external-dns=true" "-default-http-listener-port=80" "-default-https-listener-port=443" "-ready-status=true" "-ready-status-port=8081" "-enable-latency-metrics=true" "-ssl-dynamic-reload=true" "-enable-telemetry-reporting=false" "-weight-changes-dynamic-reload=false"]
[[[ some more lines here ]]]
2025/11/12 09:42:27 [notice] 21#21: js vm init njs: 0000556A38EE7780
2025/11/12 09:42:27 [emerg] 21#21: bind() to unix:/var/lib/nginx/nginx-status.sock failed (98: Address already in use)
2025/11/12 09:42:27 [emerg] 21#21: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2025/11/12 09:42:27 [emerg] 21#21: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2025/11/12 09:42:27 [emerg] 21#21: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2025/11/12 09:42:27 [notice] 21#21: try again to bind() after 500ms

Nginx can't bind to a port unless the pod is removed and recreated; after that, it works fine.

This issue definitely didn't start with this version. We've been seeing it for some time. I might be mistaken, but I believe it began after upgrading the Helm chart to version 2.0+.

What is worth to mention is that we are using linkerd, so we have a native sidecar running the whole time with linkerd's proxy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugAn issue reporting a potential bugneeds triageAn issue that needs to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions