Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaking fds until rlimits reached #4498

Closed
crosbymichael opened this Issue Aug 14, 2018 · 15 comments

Comments

Projects
None yet
4 participants
@crosbymichael
Copy link

crosbymichael commented Aug 14, 2018

Bug Report

What did you do?

Run prom with consul discovery

What did you expect to see?

Prom not become unresponsive and eat fds.

What did you see instead? Under which circumstances?

image

Environment

  • System information:

    Linux 4.15.0-30-generic x86_64

  • Prometheus version:

    2.3.2

  • Prometheus configuration file:

global:
  scrape_interval:     30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
        - job_name: 'prometheus'
          static_configs:
          - targets: ['localhost:9090']
        - job_name: 'overwritten-default'
          consul_sd_configs:
          - server:   'yy:8500'
            services: []
            tag: 'metrics'

          relabel_configs:
          - source_labels: ['__meta_consul_service']
            regex:         '(.*)'
            target_label:  'job'
            replacement:   '$1'
          - source_labels: ['__meta_consul_address']
            regex:         '(.*)'
            target_label:  '__meta_consul_service_address'
            replacement:   '$1'
            action: 'replace'
          - source_labels: ['__meta_consul_service_id']
            regex:         '(.*)'
            target_label:  'instance'
            replacement:   '$1'
          - source_labels: ['__meta_consul_service']
            regex: 'containerd'
            target_label: '__metrics_path__'
            replacement: '/v1/metrics'
  • Logs:
Aug 14 05:30:59 aegis-03 boss[2388]: level=error ts=2018-08-14T09:30:59.703921368Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="open /var/lib/prometheus/01CMVWCBQQS6RTQMR7PVRSD1BR.tmp: too many open files"
Aug 14 05:30:59 aegis-03 boss[2388]: level=error ts=2018-08-14T09:30:59.703982733Z caller=db.go:272 component=tsdb msg="compaction failed" err="persist head block: open chunk writer: open /var/lib/prometheus/01CMVWCBQQS6RTQMR7PVRSD1BR.tmp/chunks: too many open files"
Aug 14 05:31:06 aegis-03 boss[2388]: level=error ts=2018-08-14T09:31:06.309900608Z caller=consul.go:460 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=metrics tag=metrics err="Get http://j:8500/v1/catalog/service/metrics?stale=&tag=metrics&wait=30000ms: dial tcp: lookup f on 192.168.1.42:53: dial udp 192.168.1.42:53: socket: too many open files"
Aug 14 05:31:36 aegis-03 boss[2388]: level=error ts=2018-08-14T09:31:36.313416194Z caller=consul.go:460 component="discovery manager scrape" discovery=consul msg="Error refreshing service" service=metrics tag=metrics err="Get http://aegis-03.node.aegis:8500/v1/catalog/service/metrics?stale=&tag=metrics&wait=30000ms: dial tcp: lookup v on 192.168.1.42:53: dial udp 192.168.1.42:53: socket: too many open files"
@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 17, 2018

@crosbymichael
Let me try to replicate this locally. Then I'll look at what's going wrong here.

@crosbymichael

This comment has been minimized.

Copy link
Author

crosbymichael commented Aug 18, 2018

@sipian thanks!

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 19, 2018

@crosbymichael
I didn't get this error when I set up a consul cluster (a small one though :)).
Can you give me an idea about the number of targets/services that the prometheus-server was scraping?
Also it would we helpful if you could give some details of the consul setup

@crosbymichael

This comment has been minimized.

Copy link
Author

crosbymichael commented Aug 19, 2018

Ya. The consul setup is pretty standard.

image

Prom Args:

        "--storage.tsdb.retention=15d",
        "--storage.tsdb.min-block-duration=30m",
        "--storage.tsdb.max-block-duration=1h"

I had a bug that was SIGHUP'ing prom every 5min to cause it to reload the configuration and after I fixed that, it's still growing but at a slower rate now. So it is probably related to SIGHUP and reload causing the fast consumption.

image

You can see after the 19th, when i pushed the fix, it is still growing but not as fast as before.

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 19, 2018

Yeah looks like repeated SIGHUP is causing this.
@gouthamve @brian-brazil Can you confirm that this is the expected behaviour?

@crosbymichael

This comment has been minimized.

Copy link
Author

crosbymichael commented Aug 19, 2018

Well I can tell you that it isn't expected ;)

Probably just a small bug somewhere on reload not causing old conns or clients to be closed before creating new ones.

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 19, 2018

@brian-brazil
After doing some digging in consul.go, I found that new http.Transport & http.Client structs are made when NewDiscovery is called (on reloading config).

And after reloading config, lsof -p $(pgrep prometheus) tells that 2 new TCP connections have been established and none of the previous ones to localhost:8500 have been closed.

I think this might be the culprit.

According to net/http-godoc,

Clients and Transports are safe for concurrent use by multiple goroutines and for efficiency should only be created once and re-used.

golang/go#4049 might be of interest.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Aug 20, 2018

Does this fix it: 1cd29f7
? @sipian @simonpasquier

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 20, 2018

Yeah, the above commit fixes it on my consul setup.
@crosbymichael Can you test this on your setup?

I didn't notice this before but a similar issue has been reported before #4425.
And closed by #4443

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 20, 2018

@gouthamve
I noticed that a similar use of http.Client & http.Transport is used for triton & marathon service discovery.
Although, I haven't tested them, I suspect these might also have the same problem.

What are your views on this?
If such a behaviour is expected on them too, I will open a PR to fix it.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Aug 20, 2018

Try to replicate it before patching it? In general it is a bad idea to fix code that you haven't run. You could test it, but sadly I don't think any SD mechanisms have tests :(

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 20, 2018

Yes it seems identical to #4425. Hopefully @crosbymichael will confirm that it fixes the problem.

@sipian

This comment has been minimized.

Copy link
Contributor

sipian commented Aug 20, 2018

@gouthamve
That's why we have prometheus/prombench#133 :)

@crosbymichael

This comment has been minimized.

Copy link
Author

crosbymichael commented Aug 20, 2018

Thanks, i'll build and deploy master and let you know if that commit fixes it.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Oct 4, 2018

I'm closing this as it is most probably fixed in #4443 Please let me know if it's otherwise.

@gouthamve gouthamve closed this Oct 4, 2018

@lock lock bot locked and limited conversation to collaborators Apr 2, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.