New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nginx pods crash with OOM | Prometheus metric collector memory leak #10141
Comments
/triage accepted Yes there are open issues on performance. But not much progress is happening for lack of actionable info. We do have a k6 test option in CI but that is just K6 against a vanilla kind cluster workload. If the small tiny difficult-to-find details are available as actionable info, I think some progress is possible. For one thing, replicating the environ not so much for load but moe so for config is the challenge. |
Ah! I see. I'm not sure of the exact root cause yet. However, I have a feeling that the circumstances under which this happens might not be necessary to reproduce the leak. I'm considering placing an artificial sleep in the handler on the metrics path, and having the client close the connection before the server responds to see if that helps. I haven't gotten my hands dirty with the code yet. I will report back in case I have any findings. |
Sorry, I hadn't been able to spend time on this again, until today.
A significant of these are attributed to a lock that was probably not released effectively. About 13678 out of 13780 are goroutines that are waiting on the lock
This should help with the repro and with the root cause. I'll keep this thread updated. |
I've confirmed that removing the code corresponding to the summary metric However, it is very odd that this is not replicable on low throughput. I'll put some thought into why this might be so. |
@odinsy, @domcyrus, @longwuyuan for 👀 |
Is this change to come anytime soon? Will it work if we just switch off this metric from being collected? |
any update on this issue? |
What happened:
We run the nginx ingress controller in AWS EKS. We use the controller under very high loads (~250M rpm)
When under stress, the metrics handler responses are delayed. A simple curl shows that it takes about 13s for the endpoint to respond.
We notice that after an inflection point, the overall memory of the process starts to increase and keeps increasing until it hits OOM and crashes.
This is consistently reproducible under load.
The heap profile clearly reflects the same:
Memory leak: (48GB is the max memory per pod)
Open FDs:
Throughput: (unreliable because metrics endpoint is latent)
I strongly second the issue reported here #9738. However, the mitigation to exclude metrics is not feasible as we have already excluded all that we can and the mitigation provided in #9770 is not feasible. Any more reduction would mean we run blind on what is happening inside the controller.
What you expected to happen:
The metrics endpoint shouldn't be latent. There should be configurable timeouts. The goroutine/memory leak should be fixed.
The text was updated successfully, but these errors were encountered: