Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress controller keeps increasing the memory when new backend reload action triggered #8362

Closed
pdefreitas opened this issue Mar 21, 2022 · 9 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@pdefreitas
Copy link

pdefreitas commented Mar 21, 2022

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.1.2
Kubernetes version (use kubectl version): 1.21.9, 1.22.6

Environment:

  • Cloud provider or hardware configuration: Azure Kubernetes Service (AKS)

  • OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS (Bionic Beaver)

  • Kernel (e.g. uname -a): Linux 5.4.0-1070-azure #73~18.04.1-Ubuntu SMP Wed Feb 9 15:36:45 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: Azure managed

  • Basic cluster related info: Versions mentioned above + cluster autoscaler.

  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
helm ls -A | grep -i ingress
nginx-ingress-z                         x           26              2022-03-07 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

nginx-ingress-y                         y           7               2022-03-17 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2

nginx-ingress-x                         x           26              2022-03-07 00:00:00.000000000 +0000 UTC         deployed        ingress-nginx-4.0.18                    1.1.2
  • If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>

nginx-ingress-x

USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    timeoutSeconds: 30
  config:
    enable-modsecurity: true
    hsts: true
    proxy-body-size: 50m
    ssl-protocols: TLSv1.2 TLSv1.3
    ssl-session-cache: false
  electionID: nginx-custom-x
  ingressClass: nginx-custom-x
  ingressClassByName: true
  ingressClassResource:
    controllerValue: k8s.io/nginx-custom-x
    name: nginx-custom-x
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  publishService:
    enabled: true
  rbac:
    create: true
  resources:
    limits:
      memory: 1200Mi
    requests:
      cpu: 100m
      memory: 1000Mi
  scope:
    enabled: true
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-resource-group: xxx
    externalTrafficPolicy: Local
    loadBalancerIP: x.x.x.x
  startupProbe:
    failureThreshold: 5
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 2

nginx-ingress-z

USER-SUPPLIED VALUES:
controller:
  admissionWebhooks:
    timeoutSeconds: 30
  config:
    enable-modsecurity: true
    enable-real-ip: "true"
    hsts: true
    proxy-body-size: 50m
    ssl-protocols: TLSv1.2 TLSv1.3
    ssl-session-cache: false
    use-proxy-protocol: "true"
  electionID: nginx-custom-z
  ingressClass: nginx-custom-z
  ingressClassByName: true
  ingressClassResource:
    controllerValue: k8s.io/nginx-custom-z
    name: nginx-custom-z
  metrics:
    enabled: true
    service:
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
  podAnnotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  publishService:
    enabled: true
  rbac:
    create: true
  resources:
    limits:
      memory: 800Mi
    requests:
      cpu: 100m
      memory: 500Mi
  scope:
    enabled: true
  service:
    annotations:
      service.beta.kubernetes.io/azure-load-balancer-internal: true
      service.beta.kubernetes.io/azure-load-balancer-resource-group: xxx
    loadBalancerIP: x.x.x.x
  startupProbe:
    failureThreshold: 5
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 2
  • if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances

    • Ingress controller nginx-ingress-y on namespace y is not leaking memory.
  • Current State of the controller:

    • All ingress controllers work properly until they get killed. nginx-ingress-y that is alone on its own namespace doesn't have any issue, it has a similar configuration to nginx-ingress-x. nginx-ingress-z eventually runs out of memory (not so frequent because it doesn't have so many ingress rules). nginx-ingress-x is the most problematic.
  • Current state of ingress object, if applicable:

What happened:

Ingress controllers nginx-ingress-z and nginx-ingress-x are leaking memory over time. We noticed that the memory increases when there are backend reload operations happening.

What you expected to happen:

I would expect the memory to be kept constant in-between backend reloads (releasing memory). Issues #8166, #8336 and #8357 exhibit similar behavior in a similar setup.

How to reproduce it:

  • Install two ingress controllers in the same namespace with the user-supplied values from above.
  • Add multiple ingress rules to each ingress controller.
    • nginx-ingress-x has ~10 ingress resources with ModSecurity + OWASP ModSecurity Core Rule Set.
    • nginx-ingress-z has ~7 ingress resources.
  • Force backend to reload, you will notice that memory increases on each reload eventually causing OOM.
  • Pods will be stuck in CrashRestartLoop due to Fix for buggy ingress sync with retries #8325 and Fix buggy retry logic in syncIngress() #7086. You end up having to scale the deployment to zero and scaling it back up again to launch a new pod.

Anything else we need to know:

Ingress rules on nginx-ingress-x have ModSecurity + OWASP ModSecurity Core Rule Set annotations. nginx-ingress-z handles internal traffic (virtual network level), and it uses proxy protocol. This setup was working fine without any kind of memory increase prior to 0.48.x. We had to upgrade to 1.x.x due to Kubernetes upgrade + security patches. The same issue happens without Prometheus metrics enabled (we've enabled them for troubleshooting purposes).

@pdefreitas pdefreitas added the kind/bug Categorizes issue or PR as related to a bug. label Mar 21, 2022
@k8s-ci-robot
Copy link
Contributor

@pdefreitas: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Mar 21, 2022
@longwuyuan
Copy link
Contributor

/remove-kind bug
/kind feature

Install each instance of the ingress-nginx controller in its own namespaces. Its documented.
The issues you have listed are not the same problem when compared in all related aspects.
When higher-priority issues are resolved, the developers will get time to work on namespace related functionality. For now, install each instance of the ingress-nginx controller in its own dedicated namespace and do not install another instance of the controller in the same namespace.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 21, 2022
@pdefreitas
Copy link
Author

@longwuyuan thanks for the prompt reply but there are multiple problems to address:

@longwuyuan
Copy link
Contributor

From my limited visibility, I can state that ;

  • multiple distinct problems are likely experienced by one user, but not a large set of real users in production
  • memory allocation and then failure to release is a very very precise short-description of a problem. But no user has provided a step-by-step procedure that someone else can copy/paste from and reproduce. Some of the the generic description of the problem of memory usage spiralling out of control is invalid (for example a infinite for loop in bash creating ingress objects at the speed of the multicore server class cpu)
  • There is shortage of developers so if there is a triage completed, it will result in a usable definition of the problem and a reproducible sequence of steps that anyone can use to recreate the problem on their kind/minikube cluster. If the triage results in a relatively clear action item, then developers can set priority accordingly. It seems unfair to have anyone repeat taks for gathering data to reproduce a problem

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2022
@Jojoooo1
Copy link

Jojoooo1 commented Aug 8, 2022

Having exactly the same issue withe a very similar config.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

5 participants