Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upSlow connection leak (v2 beta 0) #3039
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for the detailed report – that's really awesome! Indeed we made some changes in that area: always scraping in text format and keeping connections alive. From your description it seems like all targets are scraped without any issues? |
fabxc
added this to the v2.x milestone
Aug 8, 2017
fabxc
added
component/scraping
dev-2.0
kind/bug
priority/P1
labels
Aug 8, 2017
This comment has been minimized.
This comment has been minimized.
|
Also what kind of application is the one returning We are possibly just not cleaning something up when we are reloading internals and starting new scrape loops. In case you still have the server running, could you run |
This comment has been minimized.
This comment has been minimized.
|
Yes, there's no problem scraping all the targets. There are no scrape failures, and the server never enters any kind of throttling (at least, nothing that's worth a log entry or an alert by the existing metamonitoring rules I have). The application returning the An odd thing is that graphs of I noticed that prometheus's The node_exporter and pushgateways report a steady rate of Most annoyingly, my test prometheus servers, with a single node_exporter target, are not showing any increase in fd use. Not even the one with the local build of prometheus. But the ones carrying production load both show the same symptom. |
This comment has been minimized.
This comment has been minimized.
|
From all that info, it's probably not really related to the returned Content-Type. FWIW, we aren't even looking at it and just attempt to read the text format. Since that works, the problem must be elsewhere. I initially thought my testing setup also showed what you described. But it ultimately was just the slowly increasing number of open files in the storage. SIGHUPing it however sharply increased open FDs by exactly the number of targets. Goroutines spiked in sync with that. Something is not being terminated properly. |
This comment has been minimized.
This comment has been minimized.
|
Okay, seems the abandoning the client does not terminate the goroutines of its idle keep-alive connections. We could explicitly close but just setting a max timeout is the most transparent solution that should work. With our current setting they are just kept open forever, which doesn't quite explain why you suddenly saw them being killed after all after more than a week. But that might just have been the OS forcefully killing the connections. |
This comment has been minimized.
This comment has been minimized.
|
This appears to be fixed in v2.0.0-beta.2. |
fabxc
closed this
Sep 14, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
aecolley commentedAug 8, 2017
What did you do?
Left prometheus running with 700 monitor targets (scrape intervals variously 1m or 2m; scrape timeout 20s) for a week.
What did you expect to see?
process_open_fds (of prometheus itself) plateauing at about 1000.
What did you see instead? Under which circumstances?
process_open_fds increased from 1325 to 12531 over 5 days. At the end of that time, prometheus had multiple established TCP connections open to some monitor targets. Specifically, 485 targets had 1 connection; 36 targets had 30-40 open connections each; and 3 targets had 392 open connections each. These are counts taken using
sson the prometheus host.I haven't yet established a cause, but there's an odd correlation with the value of the
Content-Typeheader in the response from the monitor targets. All of the 1-connection targets say "Content-Type: text/plain; version=0.0.4; charset=UTF-8"; all of the 30-connection targets say "Content-Type: application/octet-stream"; and all of the 392-connection targets say "Content-Type: text/plain; version=0.0.4".The 3 392-connection targets are two
pushgatewayservers and onenode_exporterserver.Environment
System information:
uname -srm: Linux 3.10.0-514.21.2.el7.x86_64 x86_64Prometheus version:
prometheus, version 2.0.0-beta.0 (branch: tripadvisor/releng-2.0.0, revision: tr-prometheus-2.0.0-0.3.el7)
build user: mockbuild
build date: 20170726-18:54:10
go version: go1.8.3
Yes, this is a private build forked from the 2.0.0-beta.0 release tag. I'm trying to reproduce it using the official release binary in a dev environment. I'm filing this bug report early in case others see the same thing.
nothing seems to correlate