Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus stops scraping some targets #1587

Closed
hudashot opened this Issue Apr 25, 2016 · 9 comments

Comments

Projects
None yet
4 participants
@hudashot
Copy link

hudashot commented Apr 25, 2016

Every once in a while my Prometheus server stops scraping some targets. /status page still lists those, but their "Last Scrape" time is N hours ago.

What I noticed is:

  • usually a number of targets stop getting scraped around the same time;
  • most targets in this state don't have any errors on /status, but some say "context deadline exceeded";
  • it's not limited to a specific target type: some are node_exporters, some are elasticsearch_exporters, some are custom apps that expose metrics using text exposition format;
  • I am now using 0.18.0 binaries from the Releases page, but I've seen the same behavior with 0.17 that I compiled myself.

Here's a goroutine dump. 1857 minutes ago is when some targets stopped getting scraped. It looks as if Prometheus stopped reading from a number of established connections (which I can still see open using lsof).

I assume this might have been caused by intermittent loss of network connectivity. I have not read the code too closely, but I would have expected scrape_timeout to kick in (default value is 10 seconds, right?) and abort those scrapes instead of hanging like this.

@brian-brazil brian-brazil added the bug label Apr 25, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

That sounds like it could be a deadlock in the target management code. Which service discovery methods are you using?

@hudashot

This comment has been minimized.

Copy link
Author

hudashot commented Apr 25, 2016

I am using file_sd_configs with a list of targets generated by another system as a JSON file.
Here's my config.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Apr 25, 2016

Target manager code was essentially 95% rewritten. If 0.17.0 shows the same behavior it must be in the small intersection.

@hudashot

This comment has been minimized.

Copy link
Author

hudashot commented Apr 25, 2016

I probably need to clarify that by 0.17 I meant pre-0.18 build of master.
It might have already had the new code in question.

@fabxc fabxc added this to the v1.0.0 milestone Apr 25, 2016

@fabxc fabxc added kind/bug and removed bug labels Apr 28, 2016

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented May 24, 2016

@beorn7 for confirmation, that was the issue fixed via upgrade of the ctxhttp package?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 24, 2016

That's pretty likely.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 8, 2016

@beorn7, @hudashot did you encounter this again or can we close this?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 8, 2016

0.19.2 has been running everywhere at SoundCloud without any issues.
Since there were no further complaints from the community, I'll close this.
Please re-open if mistaken.

@beorn7 beorn7 closed this Jun 8, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.