New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus receiver: Memory Leak #31591
Comments
Hello @henrikrexed, is there a specific reason you believe this to be caused by the Prometheus receiver instead of any of the other components included in your configuration? |
Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
i have done test with only logs , logs and traces, and then adding prometheus receiver. BTW when using the target allocator , the memory consumption is much more stable. |
Would you mind providing a simpler reproduction case? Ideally with just the prometheus receiver + otlp exporter, and a single |
let me build a collector pipeline with only metrics ( but i want to run it during 24h to confirm the behavior). |
Hi, so i have created a statefulset collector with 2 replicas , that is only collecting prometheus metrics and enriching the data.
|
@braydonk, who was looking into memory usage recently. |
Hey @henrikrexed, I'm wondering if you might be able to gather some profiles from these collectors if possible. My initial guess for this steady increase is the shape of the metrics causing the scrape cache to grow continuously, but I can't easily verify this without more info, and I hope profiles will help. These are the steps to get the info:
extensions:
pprof:
...
service:
extensions: [pprof] # along with whatever other extensions you have This starts a pprof endpoint on
If you're running the test over a long period of time, maybe doing this hourly would be good with a scheduled job of some kind. With these profiles I can take a look at the differences over time to see what part of the heap is growing/verify there aren't leaking goroutines. |
like promised during kubecon, i have run test with a scrape config:
both tests had pprof enabled. i have more than 12 pprof files. how can i share those? |
I think it should be possible to attach the profile files directly in a GitHub comment. If not, perhaps you could email or Slack them to me, since I'm not sure if other folks will care as much to see the profiles themselves. |
Thanks @henrikrexed for sending the profiles over to. I took a look last night. I can see where the memory is growing, but I don't understand why. The growing region of memory is the cache of metric identities in the opentelemetry-collector-contrib/processor/cumulativetodeltaprocessor/internal/tracking/tracker.go Lines 95 to 117 in c07d1e6
The metric cache keys are a hash of the opentelemetry-collector-contrib/processor/cumulativetodeltaprocessor/internal/tracking/identity.go Lines 16 to 26 in c07d1e6
If any values of the This is where I get lost though, I tried various things to reproduce this setup short of spinning up an actual k8s cluster with the identical configured apps, but I never got this scenario where the cumulativetodelta cache constantly grows in any of my setups. So at this point it's unclear to me if this is a legitimate problem with how the receivers/processors are interacting in this scenario, or if this is some really hard to spot configuration footgun. The biggest hunch that I have is that the |
Adding a recommendation here; I would recommend using the |
Component(s)
No response
What happened?
Description
After running several benchmarks to compare fluentbit and the opentelemetry collector.
i discovered a memory leak on the collector when using the prometheus receiver ( using a scrape config kubernetes_sd_config)
Steps to Reproduce
Here is the repo using all the assets for my tests:
https://github.com/isItObservable/fluentbit-vs-collector
Expected Result
The memory usage of the collector should be the same with the same load.
Actual Result
When running a 24h test with constant load, the collector is consuming 10g of RAM and then crash.
benchmarks
Collector version
v0.90.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
GKE cluster with --machine-type=e2-standard-4 --num-nodes=2
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: