Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul SD Config connection leak #4425

Closed
ashepelev opened this Issue Jul 26, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@ashepelev
Copy link

ashepelev commented Jul 26, 2018

Proposal

Use case. Why is this important?
Consul SD Target Discovery

Bug Report

What did you do?
Using consul_sd_config
What did you expect to see?
Discovering targets with stable work without restarts
What did you see instead? Under which circumstances?
Host is running out of FDs. The FDs are used for established connections with consul
Environment
Ubuntu 16.04

  • System information:
    Linux 4.4.0-83-generic x86_64

  • Prometheus version:
    prometheus, version 2.3.2 (branch: HEAD, revision: 71af5e2)
    build user: root@5258e0bd9cc1
    build date: 20180712-14:02:52
    go version: go1.10.3

  • Alertmanager version:
    alertmanager, version 0.15.1 (branch: HEAD, revision: 8397de1830f154535a31150f9262da0072d8725d)
    build user: root@efde7f9485ae
    build date: 20180712-18:25:27
    go version: go1.10.3

  • Prometheus configuration file:

    • job_name: consul-services
      consul_sd_configs:
      • server: consul-server
        token: token-token-token
        datacenter: dc
        relabel_configs:
      • source_labels: [__meta_consul_tags]
        regex: ^.prometheus_exporter.$
        action: keep
  • Alertmanager configuration file:

  • Logs:
    Logs don't provide relevant information.

FD Usage:
lsof -u prometheus | grep consul | wc -l
4518

process_open_fds on the host:
image

On the July 12 we've upgraded from 2.0.0 to 2.3.2 release
Dropdowns are service restarts.

There was already closed issue: https://github.com/prometheus/prometheus/issues/3096

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 26, 2018

Do the net_conntrack_dialer metrics also indicate it's consul?

@iksaif

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Jul 27, 2018

@brian-brazil
Thank you for your answer and correct metric to investigate. We've run this query on our Prom infra.
image
As we can see only one node started to ignore connection utilization. Unfortunately we don't know the reasons. Our infra consists of two prometheus nodes, each has similar configuration, collecting similar targets. They are deployed from the similar configuration role. The only difference, that the second one is backup and we disable alerting on it so we won't have alert duplicates. The alerting is enabled back if we detect that there are less than two prometheus servers are healthy. Something like this:

{{ if lt (len ( service "exporter_prometheus_server|passing" )) 2 }}
alerting:
  alertmanagers:
    - timeout: 1s
      static_configs:
        - targets: {{ key "prometheus/alertmanager_targets" }}
      {{ end }}
@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Jul 27, 2018

FYI: I won't have a stable internet connection in the next two weeks but I'll look at this issue as soon as I'm back.

Looking at your configuration, you use metric relabeling to filter services. If you have a lot of services this is likely to watch all of them, then drop the ones that don't have the tag. Check https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cconsul_sd_config%3E, you can use the tag option to directly look only at services with a given tag.

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Jul 30, 2018

I guess I came up to this problem a little closer.
Generrally its not consul_sd_config who keeps the connections established. Generally it's the prometheus server, who doesn't free socket FDs.

lsof -i -a -p 10708 |wc -l
1914

netstat -antp | grep 10708 |wc -l
1918

ss -anp | grep prometheus | wc -l
1909

While on the other node I have this value below 900. The configurations are the same.
After I've inspected FD timestamps - I've found that I have FD with timestamps like 30 minutes ago which are currently stay as 'ESTABLISHED'. Currently I'm looking forward for some additional information about this.
The most intersting thing - is that it happens only on one node and it started to happen after the prometheus service update. Both nodes were updated though.

upd:
Most of them are tied to node_exporters:

ss -anp | grep prometheus | grep 9100 | wc -l
1066

That thing surprised me as I have only 263 node targets.

upd2:
It's not only node_exporter but many of exporters have duplicated socket connections to them.
All in all prometheus server holds 100+ connections to consul server to retrieve services. Even after I switch configuration to filter service by tag option in config.
This one counts the number of exporters that have 2+ socket-FD opened.

ss -anp | grep prometheus | awk '{print $6}' | sort | uniq -c | sort -n |grep -E '^ +[2-9]+ .*' | wc -l
633
@surmehta

This comment has been minimized.

Copy link

surmehta commented Jul 30, 2018

We are using prometheus version 2.3.1 and facing same issue. We also changed from metric relabeling option to use the tag for filtering services and still seeing connection leaks.

- job_name: consul-nodes sample_limit: 100000 consul_sd_configs: - server: 'consul.service.vci:8500' tag: 'prometheus_enabled' relabel_configs: - source_labels: [__meta_consul_service] target_label: job - action: labelmap regex: __meta_consul_metadata_prometheus_(.*)

Here is the graph with net_conntrack_dialer_conn_established_total metrics

image

There seems to be a correlation between connection leaks and configuration reloads http://localhost:9090/-/reload. With each reload there is an increase in connection count and the total connections count never goes down.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 30, 2018

net_conntrack_dialer_conn_established_total is a counter so will always go up, if it's increasing faster than net_conntrack_dialer_conn_closed_total you have a problem.

@surmehta

This comment has been minimized.

Copy link

surmehta commented Jul 30, 2018

net_conntrack_dialer_conn_closed_total{dialer_name="consul_sd",instance="localhost:9090",job="scrape-targets"} metrics count is zero.

image

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 30, 2018

That sounds like a leak then, but not in Consul.

This is something we encountered before but never fully resolved. Are you using HTTPS?

@surmehta

This comment has been minimized.

Copy link

surmehta commented Jul 30, 2018

No. we are using http

@ashepelev

This comment has been minimized.

Copy link
Author

ashepelev commented Jul 31, 2018

@brian-brazil
Hello, I've got some more interesting graphics based on net_conntrack metrics.
Generally this is the net_conntract details on healthy node (total - closed)
image
You can notice that only consul-services job doesn't utilize resources well, though most of our targets (90-95%) targets are configured through consul. Lets switch to the second node we had problems with.
image
Generally here we can see that the difference is much more bigger. And again it's job consul-services who doesn't close connections.

@surmehta
That's a good notice that you have also constantly reloading server configuration. Do you use consul-template for dynamic config change & reload? Because we use it too and I see this config-reload events in logs.

Another notice I have is that our second node which we actually have problems with is flipping with active/failed statuses in Consul Serf Monitor. So actually it does at least constant server reload as consul-agent sees that node is back active. I'm not sure how it can affect connection pool utilization. Maybe we can have a solution for this soon.

@surmehta

This comment has been minimized.

Copy link

surmehta commented Jul 31, 2018

@ashepelev We are using client-go library to watch for changes in specific kubernetes resources and call reload API

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jul 31, 2018

I confirm that Consul SD is leaking connections on reloads. Working on a fix.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.