Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus stops scraping some targets #1587
Comments
brian-brazil
added
the
bug
label
Apr 25, 2016
This comment has been minimized.
This comment has been minimized.
|
That sounds like it could be a deadlock in the target management code. Which service discovery methods are you using? |
This comment has been minimized.
This comment has been minimized.
|
I am using |
This comment has been minimized.
This comment has been minimized.
|
Target manager code was essentially 95% rewritten. If 0.17.0 shows the same behavior it must be in the small intersection. |
This comment has been minimized.
This comment has been minimized.
|
I probably need to clarify that by 0.17 I meant pre-0.18 build of |
fabxc
added this to the v1.0.0 milestone
Apr 25, 2016
fabxc
added
kind/bug
and removed
bug
labels
Apr 28, 2016
This comment has been minimized.
This comment has been minimized.
|
@beorn7 for confirmation, that was the issue fixed via upgrade of the |
fabxc
added
the
priority/P0
label
May 24, 2016
This comment has been minimized.
This comment has been minimized.
|
That's pretty likely. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
0.19.2 has been running everywhere at SoundCloud without any issues. |
beorn7
closed this
Jun 8, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
hudashot commentedApr 25, 2016
Every once in a while my Prometheus server stops scraping some targets.
/statuspage still lists those, but their "Last Scrape" time is N hours ago.What I noticed is:
/status, but some say "context deadline exceeded";node_exporters, some areelasticsearch_exporters, some are custom apps that expose metrics using text exposition format;Here's a goroutine dump. 1857 minutes ago is when some targets stopped getting scraped. It looks as if Prometheus stopped reading from a number of established connections (which I can still see open using
lsof).I assume this might have been caused by intermittent loss of network connectivity. I have not read the code too closely, but I would have expected
scrape_timeoutto kick in (default value is 10 seconds, right?) and abort those scrapes instead of hanging like this.