Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow connection leak (v2 beta 0) #3039

Closed
aecolley opened this Issue Aug 8, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@aecolley
Copy link

aecolley commented Aug 8, 2017

What did you do?
Left prometheus running with 700 monitor targets (scrape intervals variously 1m or 2m; scrape timeout 20s) for a week.

What did you expect to see?
process_open_fds (of prometheus itself) plateauing at about 1000.

What did you see instead? Under which circumstances?
process_open_fds increased from 1325 to 12531 over 5 days. At the end of that time, prometheus had multiple established TCP connections open to some monitor targets. Specifically, 485 targets had 1 connection; 36 targets had 30-40 open connections each; and 3 targets had 392 open connections each. These are counts taken using ss on the prometheus host.

I haven't yet established a cause, but there's an odd correlation with the value of the Content-Type header in the response from the monitor targets. All of the 1-connection targets say "Content-Type: text/plain; version=0.0.4; charset=UTF-8"; all of the 30-connection targets say "Content-Type: application/octet-stream"; and all of the 392-connection targets say "Content-Type: text/plain; version=0.0.4".

The 3 392-connection targets are two pushgateway servers and one node_exporter server.

Environment

  • System information:

    uname -srm: Linux 3.10.0-514.21.2.el7.x86_64 x86_64

  • Prometheus version:

    prometheus, version 2.0.0-beta.0 (branch: tripadvisor/releng-2.0.0, revision: tr-prometheus-2.0.0-0.3.el7)
    build user: mockbuild
    build date: 20170726-18:54:10
    go version: go1.8.3

Yes, this is a private build forked from the 2.0.0-beta.0 release tag. I'm trying to reproduce it using the official release binary in a dev environment. I'm filing this bug report early in case others see the same thing.

  • Prometheus configuration file:
global:
  scrape_interval: 1m
  scrape_timeout: 20s
  evaluation_interval: 1m
[...]
scrape_configs:
- job_name: target_with_1_conn_each
  scrape_interval: 2m
  scrape_timeout: 20s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /path/to/generated_target_list.json
    refresh_interval: 1h
  relabel_configs:
  - source_labels: [job]
    separator: ;
    regex: collectd_exporter
    replacement: $1
    action: drop
- job_name: target_with_30_conns_each
  scrape_interval: 1m
  scrape_timeout: 20s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /path/to/targetlist.yml
    refresh_interval: 1h
[...]
- job_name: node
  scrape_interval: 1m
  scrape_timeout: 20s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - 127.0.0.1:9100
    labels:
      instance: myhostname:9100
- job_name: pushgateway
  scrape_interval: 1m
  scrape_timeout: 20s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - files:
    - /path/to/pushgateway.yml
    refresh_interval: 1h
[...]
  • Logs:

nothing seems to correlate

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 8, 2017

Thanks for the detailed report – that's really awesome!

Indeed we made some changes in that area: always scraping in text format and keeping connections alive.
The issue is probably somewhere in the former and symptom caused by the latter.

From your description it seems like all targets are scraped without any issues?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 8, 2017

Also what kind of application is the one returning "Content-Type: application/octet-stream"? (used client library etc.)

We are possibly just not cleaning something up when we are reloading internals and starting new scrape loops.
Speaking against that is that node exporter has a static configuration, which never causes something to reload. Given that it would be particularly odd to have the exact same number of open connections too.

In case you still have the server running, could you run changes(up[7d]) and check whether there might be some correlation there.

@aecolley

This comment has been minimized.

Copy link
Author

aecolley commented Aug 8, 2017

Yes, there's no problem scraping all the targets. There are no scrape failures, and the server never enters any kind of throttling (at least, nothing that's worth a log entry or an alert by the existing metamonitoring rules I have).

The application returning the application/octet-stream is a Java application which is using io.prometheus.simpleclient_hotspot version 0.0.16. However, it doesn't use a standard exporter; it has a custom exporter which calls TextFormat.write004 directly. I'll get that fixed.

An odd thing is that graphs of process_open_fds for the two pushgateways and the node_exporter are almost exactly the same shape and size. Anyway, changes is zero for all three up series, over the last 7d.

I noticed that prometheus's go_goroutines almost exactly matches its process_open_fds curve (goroutines are consistently just over 2x fds). Even after it hit 13660 goroutines and reduced them dramatically to 1805 over the next 9 hours (yesterday); I suppose this was garbage collection because it coincided with a surge in go_memstats_heap_released_bytes.

The node_exporter and pushgateways report a steady rate of /metrics scrapes, so I'm pretty sure the extra connections aren't being used for duplicate scrapes.

Most annoyingly, my test prometheus servers, with a single node_exporter target, are not showing any increase in fd use. Not even the one with the local build of prometheus. But the ones carrying production load both show the same symptom.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 9, 2017

From all that info, it's probably not really related to the returned Content-Type. FWIW, we aren't even looking at it and just attempt to read the text format. Since that works, the problem must be elsewhere.

I initially thought my testing setup also showed what you described. But it ultimately was just the slowly increasing number of open files in the storage.

SIGHUPing it however sharply increased open FDs by exactly the number of targets. Goroutines spiked in sync with that. Something is not being terminated properly.
While it's a different pattern obviously, the root cause might be as guessed some internal reloading. Even though I'm still unsure why it would only affect some of your jobs.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 9, 2017

Okay, seems the abandoning the client does not terminate the goroutines of its idle keep-alive connections. We could explicitly close but just setting a max timeout is the most transparent solution that should work.
The most likely case is that this also solves your issue.

With our current setting they are just kept open forever, which doesn't quite explain why you suddenly saw them being killed after all after more than a week. But that might just have been the OS forcefully killing the connections.

@aecolley

This comment has been minimized.

Copy link
Author

aecolley commented Aug 23, 2017

This appears to be fixed in v2.0.0-beta.2.

@fabxc fabxc closed this Sep 14, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.