Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nginx pods crash with OOM | Prometheus metric collector memory leak #10141

Open
akhilesh-godi opened this issue Jun 27, 2023 · 7 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@akhilesh-godi
Copy link

What happened:
We run the nginx ingress controller in AWS EKS. We use the controller under very high loads (~250M rpm)

When under stress, the metrics handler responses are delayed. A simple curl shows that it takes about 13s for the endpoint to respond.

We notice that after an inflection point, the overall memory of the process starts to increase and keeps increasing until it hits OOM and crashes.

This is consistently reproducible under load.

The heap profile clearly reflects the same:
image

image

Memory leak: (48GB is the max memory per pod)
image

Open FDs:
image

Throughput: (unreliable because metrics endpoint is latent)
Screenshot 2023-06-27 at 5 40 04 PM

I strongly second the issue reported here #9738. However, the mitigation to exclude metrics is not feasible as we have already excluded all that we can and the mitigation provided in #9770 is not feasible. Any more reduction would mean we run blind on what is happening inside the controller.

What you expected to happen:
The metrics endpoint shouldn't be latent. There should be configurable timeouts. The goroutine/memory leak should be fixed.

@akhilesh-godi akhilesh-godi added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2023
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 27, 2023
@longwuyuan
Copy link
Contributor

/triage accepted
/priority important-longterm

Yes there are open issues on performance. But not much progress is happening for lack of actionable info. We do have a k6 test option in CI but that is just K6 against a vanilla kind cluster workload.

If the small tiny difficult-to-find details are available as actionable info, I think some progress is possible. For one thing, replicating the environ not so much for load but moe so for config is the challenge.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 28, 2023
@akhilesh-godi
Copy link
Author

Ah! I see.

I'm not sure of the exact root cause yet. However, I have a feeling that the circumstances under which this happens might not be necessary to reproduce the leak.

I'm considering placing an artificial sleep in the handler on the metrics path, and having the client close the connection before the server responds to see if that helps. I haven't gotten my hands dirty with the code yet. I will report back in case I have any findings.

@akhilesh-godi
Copy link
Author

akhilesh-godi commented Jul 6, 2023

Sorry, I hadn't been able to spend time on this again, until today.
I took a goroutine dump after having noticed the goroutine leak.
Here are further findings:

ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'semacquire' | wc -l
13678
ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'goroutine ' | wc -l
13780

A significant of these are attributed to a lock that was probably not released effectively. About 13678 out of 13780 are goroutines that are waiting on the lock

goroutine 64208 [semacquire]:
sync.runtime_SemacquireMutex(0x193c180?, 0x0?, 0x1b80c81?)
        runtime/sema.go:77 +0x25                                                              
sync.(*Mutex).lockSlow(0xc0010b0010)                             
        sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)                                   
        sync/mutex.go:90
github.com/prometheus/client_golang/prometheus.(*summary).Observe(0xc0010b0000, 0x3f789374bc6a7efa)
        github.com/prometheus/client_golang@v1.14.0/prometheus/summary.go:285 +0x6e
k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage(0xc000135780, {0xc62f4f8000, 0x1729d6, 0x1ac000})
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:336 +0x10d9
k8s.io/ingress-nginx/internal/ingress/metric/collectors.handleMessages({0x7f9a986b8d28?, 0xc0fe3cac40}, 0xc2aea626b0)
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:529 +0xb7
created by k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).Start
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:402 +0xed

This should help with the repro and with the root cause. I'll keep this thread updated.

@akhilesh-godi
Copy link
Author

I've confirmed that removing the code corresponding to the summary metric ingress_upstream_latency_seconds and the corresponding references to the variable upstreamLatency fixes the leak. Since this is a metric that is deprecated, it should be removed soon to fix this.

However, it is very odd that this is not replicable on low throughput. I'll put some thought into why this might be so.

@akhilesh-godi
Copy link
Author

akhilesh-godi commented Jul 6, 2023

@odinsy, @domcyrus, @longwuyuan for 👀

@ranediwa
Copy link

ranediwa commented Mar 5, 2024

Is this change to come anytime soon? Will it work if we just switch off this metric from being collected?

@exherb
Copy link

exherb commented Apr 11, 2024

any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

5 participants