Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus memory leak #4372

Closed
Fadih opened this Issue Jul 12, 2018 · 18 comments

Comments

Projects
None yet
5 participants
@Fadih
Copy link

Fadih commented Jul 12, 2018

i had memory leak when i run my prometheus for 5 hours ,
as i notice if i deleting all the files under /wall/* this will resolve the issue for a while

Environment
production

  • System information:

Linux 4.9.81-35.56.amzn1.x86_64 x86_64

  • Prometheus version:

    version 2.3.0

  • Alertmanager version:

    0.14.0

  • Prometheus configuration file:

global:
scrape_interval: 30s # By default, scrape targets every 15 seconds.

rule_files:
- /etc/prometheus/itrs-alerts.yml
- /etc/prometheus/itms-alerts.yml
- /etc/prometheus/common-alerts.yml
- /etc/prometheus/recording-rules.yml

scrape_configs:

  • job_name: 'prometheus'
    static_configs:

    • targets: ['localhost:9090']
  • job_name: 'ecs-cluster'
    static_configs:

    • targets: ['$EXPORTERS_DNS']
      metrics_path: /ecs-exporter-metrics
      scheme: http
  • job_name: 'prometheus-ec2-instance-devtools'
    ec2_sd_configs:

    • region: $REGION
      access_key: $AWS_ACCESS_KEY_ID
      secret_key: $AWS_SECRET_ACCESS_KEY
      port: 9100
      relabel_configs:
    • source_labels: [__meta_ec2_tag_Name]
      regex: devtools-prometheus-monitoring
      action: keep
      metrics_path: /metrics
      scheme: http
  • job_name: 'ecs-instances-tagger-devtools'
    ec2_sd_configs:

    • region: $REGION
      access_key: $AWS_ACCESS_KEY_ID
      secret_key: $AWS_SECRET_ACCESS_KEY
      port: 1234
      relabel_configs:
    • source_labels: [__meta_ec2_tag_Name]
      regex: devtools-prometheus-monitoring
      action: keep
      metrics_path: /metrics
      scheme: http
  • job_name: 'instrumentedTest-server-node'
    ec2_sd_configs:

    • region: $REGION
      access_key: $AWS_ACCESS_KEY_ID
      secret_key: $AWS_SECRET_ACCESS_KEY
      port: 9100
      relabel_configs:
    • source_labels: [__meta_ec2_tag_Name]
      regex: espresso-server
      action: keep
    • source_labels: [__meta_ec2_tag_project]
      regex: .*
      action: keep
    • source_labels: [__meta_ec2_tag_project]
      target_label: project
    • source_labels: [__meta_ec2_tag_Name]
      target_label: name
    • source_labels: [__meta_ec2_instance_id]
      target_label: instance_id
      metrics_path: /metrics
      scheme: http
@maurorappa

This comment has been minimized.

Copy link

maurorappa commented Jul 12, 2018

bit generic, eh?
you know you can plot also Prometheus Go memory stats using 'localhost:9090/metrics'?
scrape it and plot 'go_memstats_alloc_bytes' for example.
You will also need Pprof data to claim it's a mem leak, you can periocally collect them using
'localhost:9090/debug/pprof/heap?debug=1'

Leaving to the more experts people now :)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 12, 2018

@maurorappa thanks!

@Fadih yeah we need more info here. Why do you think there is a memory leak would be a good starts.

the 2.3.2 release #4370 will include a debug command for the promtool so you can use it attach the debug info.

@tonobo

This comment has been minimized.

Copy link

tonobo commented Jul 16, 2018

@maurorappa We could see the same behavior. We've discoverd the query which will trigger the exception.

count(abc_connections{instance=~"^host-.*"}) / count(connections{instance=~"^host-.*"}) * 100

The requested memory stats below:

$ # curl localhost:9090/metrics -s | grep go_memstats_alloc_bytes
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 7.0115873912e+10
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.11637139248e+11

heap_debug.txt

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 16, 2018

When you say a memory leak, does that mean the memory usage starts growing to the point where Prometheus gets OOM killed?
If that is the case can you all please run the promtool debug tool to gather all the info we need to troubleshoot this.

Also gathering some snapshots from the Grafana would be useful.
https://grafana.com/dashboards/6725
This is for the 2.3 Prometheus version.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 16, 2018

the promtool debug is included in the latest 2.3.2 release.
https://github.com/prometheus/prometheus/releases

@tonobo

This comment has been minimized.

Copy link

tonobo commented Jul 16, 2018

I'm not sure about if its a memory problem. The memory grows rapidly but the prometheus is completely out of cpu. Please see the attached memory graph. This seems to be quite normal memory usage, but there are few gaps, which prometheus wasn't able to scrape the metric by itself.

image

image

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 16, 2018

@tonobo I am thinking the chances are that your issue is not related to the original report so could you please open a new one.

Please be as specific as possible with steps and config to replicate and attach the file produced by the promtool debug tool and the grafana dashboard at the time of the issue. It is important to run the debug tool during the unexpected high CPU/Memory usage so that the profile is captured with the info we need.

Everyone here want's to help, but We can't determine if the problem is with your config, the local setup or an actual bug if we don't have enough information.

@tonobo

This comment has been minimized.

Copy link

tonobo commented Jul 17, 2018

I'll reopen another issue.

I'm unable to find the debug option, the promtool from latest release wouldn't support this option.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 18, 2018

yes sorry , because of the moratorium this didn't go in the last release. I just merged it so you can build the promtool from source or use the binary attached here that I have just build for linux.
promtool.zip

@tonobo

This comment has been minimized.

Copy link

tonobo commented Jul 19, 2018

Great! Thank you. Please see the attached debug output.
debug.tar.gz

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 23, 2018

Thanks,
I am working on other fronts right now so if anyone else has time to have a look it would be appreciated.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 23, 2018

@tonobo I suspect that you've got too many time series for Prometheus to handle. Your report says that prometheus_tsdb_head_chunks is 6.5685078e+07 (more than 60,000,000).

@Fadih if you still have the issue, please attach the output of promtool debug tool from #4372 (comment).

@tonobo

This comment has been minimized.

Copy link

tonobo commented Jul 23, 2018

@simonpasquier This sounds legit. In my opinon this worked quite better before upgrading prometheus?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 24, 2018

@tonobo you're hitting the limits of Prometheus so maybe your server was ingesting a little less time series before or your query load has increased. But in any case this doesn't look like a memory leak in your case.

@tonobo

This comment has been minimized.

Copy link

tonobo commented Jul 24, 2018

Ok, thank you for clarification.

@Fadih

This comment has been minimized.

Copy link
Author

Fadih commented Jul 24, 2018

hi ,

i found what caused a memory leak
i have a java process that integrated with prometheus ,this process sending a lot of data to prometheus every 30 second , each response has a 2mb of data ,i have about 15 java processes like this that start running about 2 months ago,
i just restarted my java processes ,and the received data have only 5k size

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 24, 2018

@Fadih thanks for the follow up. I'm closing the issue then.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.