nginx pods crash with OOM | Prometheus metric collector memory leak #10141

akhilesh-godi · 2023-06-27T12:17:21Z

What happened:
We run the nginx ingress controller in AWS EKS. We use the controller under very high loads (~250M rpm)

When under stress, the metrics handler responses are delayed. A simple curl shows that it takes about 13s for the endpoint to respond.

We notice that after an inflection point, the overall memory of the process starts to increase and keeps increasing until it hits OOM and crashes.

This is consistently reproducible under load.

The heap profile clearly reflects the same:

Memory leak: (48GB is the max memory per pod)

Open FDs:

Throughput: (unreliable because metrics endpoint is latent)

I strongly second the issue reported here #9738. However, the mitigation to exclude metrics is not feasible as we have already excluded all that we can and the mitigation provided in #9770 is not feasible. Any more reduction would mean we run blind on what is happening inside the controller.

What you expected to happen:
The metrics endpoint shouldn't be latent. There should be configurable timeouts. The goroutine/memory leak should be fixed.

longwuyuan · 2023-06-28T03:03:07Z

/triage accepted
/priority important-longterm

Yes there are open issues on performance. But not much progress is happening for lack of actionable info. We do have a k6 test option in CI but that is just K6 against a vanilla kind cluster workload.

If the small tiny difficult-to-find details are available as actionable info, I think some progress is possible. For one thing, replicating the environ not so much for load but moe so for config is the challenge.

akhilesh-godi · 2023-06-28T12:27:05Z

Ah! I see.

I'm not sure of the exact root cause yet. However, I have a feeling that the circumstances under which this happens might not be necessary to reproduce the leak.

I'm considering placing an artificial sleep in the handler on the metrics path, and having the client close the connection before the server responds to see if that helps. I haven't gotten my hands dirty with the code yet. I will report back in case I have any findings.

akhilesh-godi · 2023-07-06T17:23:58Z

Sorry, I hadn't been able to spend time on this again, until today.
I took a goroutine dump after having noticed the goroutine leak.
Here are further findings:

ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'semacquire' | wc -l
13678
ip-10-108-95-131:/etc/nginx$ cat goroutine_4.out | grep 'goroutine ' | wc -l
13780

A significant of these are attributed to a lock that was probably not released effectively. About 13678 out of 13780 are goroutines that are waiting on the lock

goroutine 64208 [semacquire]:
sync.runtime_SemacquireMutex(0x193c180?, 0x0?, 0x1b80c81?)
        runtime/sema.go:77 +0x25                                                              
sync.(*Mutex).lockSlow(0xc0010b0010)                             
        sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)                                   
        sync/mutex.go:90
github.com/prometheus/client_golang/prometheus.(*summary).Observe(0xc0010b0000, 0x3f789374bc6a7efa)
        github.com/prometheus/client_golang@v1.14.0/prometheus/summary.go:285 +0x6e
k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).handleMessage(0xc000135780, {0xc62f4f8000, 0x1729d6, 0x1ac000})
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:336 +0x10d9
k8s.io/ingress-nginx/internal/ingress/metric/collectors.handleMessages({0x7f9a986b8d28?, 0xc0fe3cac40}, 0xc2aea626b0)
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:529 +0xb7
created by k8s.io/ingress-nginx/internal/ingress/metric/collectors.(*SocketCollector).Start
        k8s.io/ingress-nginx/internal/ingress/metric/collectors/socket.go:402 +0xed

This should help with the repro and with the root cause. I'll keep this thread updated.

akhilesh-godi · 2023-07-06T18:48:02Z

I've confirmed that removing the code corresponding to the summary metric ingress_upstream_latency_seconds and the corresponding references to the variable upstreamLatency fixes the leak. Since this is a metric that is deprecated, it should be removed soon to fix this.

However, it is very odd that this is not replicable on low throughput. I'll put some thought into why this might be so.

akhilesh-godi · 2023-07-06T18:58:59Z

@odinsy, @domcyrus, @longwuyuan for 👀

ranediwa · 2024-03-05T06:02:24Z

Is this change to come anytime soon? Will it work if we just switch off this metric from being collected?

exherb · 2024-04-11T15:52:37Z

any update on this issue?

akhilesh-godi added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nginx pods crash with OOM | Prometheus metric collector memory leak #10141

nginx pods crash with OOM | Prometheus metric collector memory leak #10141

akhilesh-godi commented Jun 27, 2023

longwuyuan commented Jun 28, 2023

akhilesh-godi commented Jun 28, 2023

akhilesh-godi commented Jul 6, 2023 •

edited

akhilesh-godi commented Jul 6, 2023

akhilesh-godi commented Jul 6, 2023 •

edited

ranediwa commented Mar 5, 2024

exherb commented Apr 11, 2024

nginx pods crash with OOM | Prometheus metric collector memory leak #10141

nginx pods crash with OOM | Prometheus metric collector memory leak #10141

Comments

akhilesh-godi commented Jun 27, 2023

longwuyuan commented Jun 28, 2023

akhilesh-godi commented Jun 28, 2023

akhilesh-godi commented Jul 6, 2023 • edited

akhilesh-godi commented Jul 6, 2023

akhilesh-godi commented Jul 6, 2023 • edited

ranediwa commented Mar 5, 2024

exherb commented Apr 11, 2024

akhilesh-godi commented Jul 6, 2023 •

edited

akhilesh-godi commented Jul 6, 2023 •

edited