Ingress controller keeps increasing the memory when new backend reload action triggered #8362

pdefreitas · 2022-03-21T10:29:26Z

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.1.2
Kubernetes version (use kubectl version): 1.21.9, 1.22.6

Environment:

Cloud provider or hardware configuration: Azure Kubernetes Service (AKS)
OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (Bionic Beaver)
Kernel (e.g. uname -a): Linux 5.4.0-1070-azure #73~18.04.1-Ubuntu SMP Wed Feb 9 15:36:45 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools: Azure managed
Basic cluster related info: Versions mentioned above + cluster autoscaler.
How was the ingress-nginx-controller installed:
- If helm was used then please show output of helm ls -A | grep -i ingress

helm ls -A | grep -i ingress
nginx-ingress-z                         x           26              2022-03-07 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

nginx-ingress-y                         y           7               2022-03-17 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

nginx-ingress-x                         x           26              2022-03-07 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>

nginx-ingress-x

USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    timeoutSeconds: 30
  config:
    enable-modsecurity: true
    hsts: true
    proxy-body-size: 50m
    ssl-protocols: TLSv1.2 TLSv1.3
    ssl-session-cache: false
  electionID: nginx-custom-x
  ingressClass: nginx-custom-x
  ingressClassByName: true
  ingressClassResource:
    controllerValue: k8s.io/nginx-custom-x
    name: nginx-custom-x
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  publishService:
    enabled: true
  rbac:
    create: true
  resources:
    limits:
      memory: 1200Mi
    requests:
      cpu: 100m
      memory: 1000Mi
  scope:
    enabled: true
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-resource-group: xxx
    externalTrafficPolicy: Local
    loadBalancerIP: x.x.x.x
  startupProbe:
    failureThreshold: 5
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 2

nginx-ingress-z

USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    timeoutSeconds: 30
  config:
    enable-modsecurity: true
    enable-real-ip: "true"
    hsts: true
    proxy-body-size: 50m
    ssl-protocols: TLSv1.2 TLSv1.3
    ssl-session-cache: false
    use-proxy-protocol: "true"
  electionID: nginx-custom-z
  ingressClass: nginx-custom-z
  ingressClassByName: true
  ingressClassResource:
    controllerValue: k8s.io/nginx-custom-z
    name: nginx-custom-z
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  publishService:
    enabled: true
  rbac:
    create: true
  resources:
    limits:
      memory: 800Mi
    requests:
      cpu: 100m
      memory: 500Mi
  scope:
    enabled: true
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: true
      service.beta.kubernetes.io/azure-load-balancer-resource-group: xxx
    loadBalancerIP: x.x.x.x
  startupProbe:
    failureThreshold: 5
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 2

if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
- Ingress controller nginx-ingress-y on namespace y is not leaking memory.
Current State of the controller:
- All ingress controllers work properly until they get killed. nginx-ingress-y that is alone on its own namespace doesn't have any issue, it has a similar configuration to nginx-ingress-x. nginx-ingress-z eventually runs out of memory (not so frequent because it doesn't have so many ingress rules). nginx-ingress-x is the most problematic.
Current state of ingress object, if applicable:
- Ingress changes are properly applied to both controllers. Sometimes the ingress controllers get out of memory limits (OOM) and happens a similar behavior to Fix for buggy ingress sync with retries #8325 and Fix buggy retry logic in syncIngress() #7086.

What happened:

Ingress controllers nginx-ingress-z and nginx-ingress-x are leaking memory over time. We noticed that the memory increases when there are backend reload operations happening.

What you expected to happen:

I would expect the memory to be kept constant in-between backend reloads (releasing memory). Issues #8166, #8336 and #8357 exhibit similar behavior in a similar setup.

How to reproduce it:

Install two ingress controllers in the same namespace with the user-supplied values from above.
Add multiple ingress rules to each ingress controller.
- nginx-ingress-x has ~10 ingress resources with ModSecurity + OWASP ModSecurity Core Rule Set.
- nginx-ingress-z has ~7 ingress resources.
Force backend to reload, you will notice that memory increases on each reload eventually causing OOM.
Pods will be stuck in CrashRestartLoop due to Fix for buggy ingress sync with retries #8325 and Fix buggy retry logic in syncIngress() #7086. You end up having to scale the deployment to zero and scaling it back up again to launch a new pod.

Anything else we need to know:

Ingress rules on nginx-ingress-x have ModSecurity + OWASP ModSecurity Core Rule Set annotations. nginx-ingress-z handles internal traffic (virtual network level), and it uses proxy protocol. This setup was working fine without any kind of memory increase prior to 0.48.x. We had to upgrade to 1.x.x due to Kubernetes upgrade + security patches. The same issue happens without Prometheus metrics enabled (we've enabled them for troubleshooting purposes).

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-03-21T10:29:32Z

@pdefreitas: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

longwuyuan · 2022-03-21T11:01:30Z

/remove-kind bug
/kind feature

Install each instance of the ingress-nginx controller in its own namespaces. Its documented.
The issues you have listed are not the same problem when compared in all related aspects.
When higher-priority issues are resolved, the developers will get time to work on namespace related functionality. For now, install each instance of the ingress-nginx controller in its own dedicated namespace and do not install another instance of the controller in the same namespace.

pdefreitas · 2022-03-21T11:31:56Z

@longwuyuan thanks for the prompt reply but there are multiple problems to address:

There are multiple ticket open and comments reporting an usual amount of memory being consumed by latest ingress controller releases. Wouldn't it be worth to understand what causes it?
In regards to the comment namespace related functionality doe you believe it would cause high memory? Because with the configuration above the setup works perfectly. The current documentation does not mention such limitation:
- https://kubernetes.github.io/ingress-nginx/user-guide/multiple-ingress/
- https://kubernetes.github.io/ingress-nginx/#how-to-easily-install-multiple-instances-of-the-ingress-nginx-controller-in-the-same-cluster (does not mention as hard requirement).
PRs Fix for buggy ingress sync with retries #8325 and/or Fix buggy retry logic in syncIngress() #7086 are stuck to be released for long time. On our setup we're able to reproduce this bug when controller container gets OOM'd.

longwuyuan · 2022-03-21T11:44:40Z

From my limited visibility, I can state that ;

multiple distinct problems are likely experienced by one user, but not a large set of real users in production
memory allocation and then failure to release is a very very precise short-description of a problem. But no user has provided a step-by-step procedure that someone else can copy/paste from and reproduce. Some of the the generic description of the problem of memory usage spiralling out of control is invalid (for example a infinite for loop in bash creating ingress objects at the speed of the multicore server class cpu)
There is shortage of developers so if there is a triage completed, it will result in a usable definition of the problem and a reproducible sequence of steps that anyone can use to recreate the problem on their kind/minikube cluster. If the triage results in a relatively clear action item, then developers can set priority accordingly. It seems unfair to have anyone repeat taks for gathering data to reproduce a problem

k8s-triage-robot · 2022-06-19T12:07:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-07-19T12:32:45Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jojoooo1 · 2022-08-08T12:21:58Z

Having exactly the same issue withe a very similar config.

k8s-triage-robot · 2022-09-07T22:02:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-09-07T22:02:25Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pdefreitas added the kind/bug Categorizes issue or PR as related to a bug. label Mar 21, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Mar 21, 2022

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 21, 2022

moraesjeremias mentioned this issue May 18, 2022

Nginx Ingress Pods Consuming Too Much Memory #8166

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2022

k8s-ci-robot closed this as completed Sep 7, 2022

rsafonseca mentioned this issue Jan 19, 2024

Allow configuring nginx worker reload behaviour, to prevent multiple concurrent worker reloads which can lead to high resource usage and OOMKill #10884

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

pdefreitas commented Mar 21, 2022 •

edited

k8s-ci-robot commented Mar 21, 2022

longwuyuan commented Mar 21, 2022

pdefreitas commented Mar 21, 2022

longwuyuan commented Mar 21, 2022

k8s-triage-robot commented Jun 19, 2022

k8s-triage-robot commented Jul 19, 2022

Jojoooo1 commented Aug 8, 2022

k8s-triage-robot commented Sep 7, 2022

k8s-ci-robot commented Sep 7, 2022

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

Comments

pdefreitas commented Mar 21, 2022 • edited

k8s-ci-robot commented Mar 21, 2022

longwuyuan commented Mar 21, 2022

pdefreitas commented Mar 21, 2022

longwuyuan commented Mar 21, 2022

k8s-triage-robot commented Jun 19, 2022

k8s-triage-robot commented Jul 19, 2022

Jojoooo1 commented Aug 8, 2022

k8s-triage-robot commented Sep 7, 2022

k8s-ci-robot commented Sep 7, 2022

pdefreitas commented Mar 21, 2022 •

edited