Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug logs for failed scrapes #2820

Closed
discordianfish opened this Issue Jun 8, 2017 · 7 comments

Comments

Projects
None yet
5 participants
@discordianfish
Copy link
Member

discordianfish commented Jun 8, 2017

Hi,

I need to debug intermediate scrape failures and realize there is no way to differentiate between various reasons for a scrape failure after the fact. I can make an educated guess whether a scrape timed out or failed otherwise by using scrape_duration_seconds but I can't differentiate between DNS issues, connection refused, resetted connections, EOF and so on.

Therefor I propose we log scrape failures. I'd propose at info level severity. I consider a scrape failure important enough to not have to enable debug logging, especially given that info is already quite talky with the maintenance sweep messages.

Alternatively, which would work for me as much, we could try to come up with a timeseries or additional labels for up, but I think people would prefer logging.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 8, 2017

Scrape failures is roughly speaking request logging, it doesn't belong in application logs and could get extremely spammy (potentially hundreds of lines per second). Debug would be the appropriate log level.

@discordianfish

This comment has been minimized.

Copy link
Member Author

discordianfish commented Jun 8, 2017

For context: On my cluster the sweep messages are pretty spamming (every 10-15 seconds) which lead me to suggest info log level.
While I'd still prefer not to enable debug logging to get scrape failures, I'd be okay with that.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 8, 2017

I agree that debug would be appropriate level here, as it's not a concern of the health of the Prometheus server itself and can indeed get very noisy in larger Prometheus servers.

You can't just put additional error labels on up, as that'd break any query that wants to look at the up time series of a given target over time (because it would become multiple series).

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jul 22, 2017

For those looking to take this up, the relevant code path is here: https://github.com/prometheus/prometheus/blob/master/retrieval/scrape.go#L303 You could check each instance where an error is being produced and log accordingly.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 22, 2017

This change should probably be made in the dev-2.0 branch, as this code path has changed considerably.

https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L388

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Aug 22, 2017

AFAICT we don't have access to the logger from the scrape function, so debug logging of scrape errors could happen in this block: https://github.com/prometheus/prometheus/blob/dev-2.0/retrieval/scrape.go#L638-L642

Am I missing anything?

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.