Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping HTTP keep-alive #2498

Closed
iobestar opened this Issue Mar 15, 2017 · 20 comments

Comments

Projects
None yet
8 participants
@iobestar
Copy link

iobestar commented Mar 15, 2017

Why is scraping use 'Connection: close' header?
Would it be ok to add config flag for this option (default: keep-alive disabled)?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 15, 2017

Actually I changed that for the dev-2.0 branch. I'd like to try making it a default there and only allow disabling once a use case appears. The number of required file descriptors will go up, but is still bounded.
Especially when scraping with TLS this saves unnecessary round trips.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 15, 2017

An easy solution could be to make it depend on the scrape interval. With a 1m scrape interval, keep-alive is pretty wasteful. With a 1s scrape interval, the opposite is true.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 15, 2017

I don't think that's generally applicable. First of all – what is wasteful in that context? Open FDs are not exactly expensive. Certainly not in the cardinality we are thinking of. So I'm not sure it's worth distinguishing to begin with, given that it additionally will catch people off guard if they reduce their scrape interval and suddenly have to adjust their ulimit.

In general, it's a question of of connects/second. And 10 targets at 1s have fewer of those than 10k targets at 1m scrape interval. On top, the total impact of a connect is amortized over number of series per scrape, which is also not known.
So several variables playing into it and from my benchmarks keep-alive in general has notable positive impact on the CPU load added by the ingestion path.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 15, 2017

I thought we had keep-alive already, it must have regressed at some point.

@iobestar

This comment has been minimized.

Copy link
Author

iobestar commented Mar 15, 2017

I don't see how scrape timeout affects keep-alive?
The problem is connection closing initiated by DisableKeepAlives: true. Adding a option for enabling/disabling keep-alive is a reasonable solution.
Default value can be true but from version 2.x.x (in my opinion) because this is breaking change.
Adding this option with default value false in 1.5.x version should not be a problem :)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 16, 2017

One of the selling points of Prometheus and its pull model is that you can easily cope with very large amount of targets if you only scrape at a sufficiently low frequency.

The wasteful part is if the keep-alive timeout is slightly smaller than the scrape interval. Default idle timeout in Go is 30s, for example. Default scrape interval in Prometheus is 1m. With this combination of timings, every TCP connection is kept open for 30s for no benefit at all.

If you want to avoid the effect of surprising increases in open fd's after reducing the scrape interval, you also have to manage the idle timeout on connections to be larger than the scrape interval.

On the other hand, if you have to think about tweaking the ulimit, it proves we are talking about cardinalities that matter. Assuming the 30s default idle timeout are tailored to be a good trade-off between cost of an open connection and re-establishing one, we can simply activate keep-alive for scrape intervals of up to 30s and switch it off otherwise.

@iobestar

This comment has been minimized.

Copy link
Author

iobestar commented Mar 16, 2017

I agree with @beorn7 . If we can configure scrape interval then it would be reasonable to configure other HTTP client parameters (connection idle timeout, keep-alive, connection timeout) aswell.

Default configuration can be as you propose: keep-alive enabled implicitly if scrape interval is under 30s. For TLS targets keep-alive should be enabled by default unrelated to scrape interval?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 16, 2017

So increase the keep alive timeout? The default value isn't relevant to the discussion and the described problem doesn't exist basically.

you also have to manage the idle timeout on connections to be larger than the scrape interval.

Why? just make it 2x the scrape interval and that's all there is to it. There's no reason to set any other value.

I did some manual benchmarking and found connecting to be a significant CPU impact. And that holds true by the formula I mentioned regardless of your scrape interval. (Okay, not regardless but not sufficiently defined by it.)

Assuming the 30s default idle timeout are tailored to be a good trade-off between cost of an open connection and re-establishing one, we can simply activate keep-alive for scrape intervals of up to 30s and switch it off otherwise.

Trade off between what? It's a value tuned to the average web server client connection which may or may not be reused. In Prometheus we know almost any connection we start will be reused hundreds of times – it's just a different use case and the default has no place here.

What's the benefit of switching it off for intervals larger than 30s? What changes when setting it to 31s?

The only feasible option here is making it a boolean flag and let the user decide – globally for everything. Either you want it or don't. There is no magic boundary within our typical interval of 1s to 2m where it becomes feasible or unfeasible because the interval is not a sufficient parameter for the function estimating the cost/benefit.

@agaoglu

This comment has been minimized.

Copy link
Contributor

agaoglu commented Mar 17, 2017

It might be a better option to make it overridable per job/target definition. You know, some servers don't like keeping connections open and too shy to simply close on their end.

@iobestar

This comment has been minimized.

Copy link
Author

iobestar commented Mar 17, 2017

Keep-alive is useless if scrape interval is greater than connection idle timeout.

I agree with @agaoglu to make keep-alive overridable per job/target but if keep-alive is enabled then connection idle timeout must be set to (at least) scrape interval value.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 20, 2017

So increase the keep alive timeout? The default value isn't relevant to the discussion and the described problem doesn't exist basically.

I only brought in the default value as a starting point. You are right that it is meant for a "generic" connection where we don't have the information if we will re-use at all, which is different for the typical Prometheus scrape where we know exactly if and when we will re-use.

So yes, we can of course change the timeout in Prometheus to something like 2x the scrape interval. But there are a number of reasons why there might still be a problem:

  • Even if CPU cycles are usually more costly than open FDs, a Prometheus server slowly scraping 10k targets might very well have plenty of CPU cycles to spare but might be limited to fewer than 10k open FDs.
  • On the side of the monitored target, we usually don't provide an HTTP server owned by the Prometheus client library but piggyback on an existing server implementation. So we have to deal with a wide range of implementation, which all handle keep-alive differently. I'd expect most of them to not accept arbitrarily high keep-alive timeouts from HTTP clients (or if they do, I'd wonder if it was done with the consequences in mind). Also, I'd be very sensitive on second-guessing the resource bottle-neck of the monitored target. Prometheus depleted the FD allowance of monitored targets before…
  • Finally, in between Prometheus and the target, there might be proxies, connection tracking firewalls, NAT, …, all of which will interact with the keep-alive behavior in various ways: HTTP proxies with a different idea about max idle timeout, connection-tracking dropping idle TCP connection at their own discretion.

I don't think it's trivially true that 2x scrape interval as keep-alive timeout is a sane default. I guess there needs to be some sanity interval where you don't do keep-alive anymore (which doesn't have to be the 30s I brought in as the default from the Go HTTP package). Given the many different possible scenarios, making it configurable seems unavoidable.

@darkk

This comment has been minimized.

Copy link

darkk commented Aug 13, 2017

I don't think it's trivially true that 2x scrape interval as keep-alive timeout is a sane default

I think that the only sane default is "the default major browsers use" exactly because of proxies, NATs and others.

@AlekSi

This comment has been minimized.

Copy link
Contributor

AlekSi commented Sep 25, 2017

Are you interested in adding options to enable keep-alive connections (with default value false) to 1.8? I can work on it.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 25, 2017

That change was already inadvertently made for 1.8, it's being reverted in #3173

@AlekSi

This comment has been minimized.

Copy link
Contributor

AlekSi commented Sep 25, 2017

#3173 is about remote storage as opposed to scraping targets, isn't it?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 25, 2017

The scope of the PR was much broader than the title indicated, thus the revert as this question remains unsettled.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented Nov 24, 2017

I believe the code in 2.0 has keep-alive on for all connections.

However see also #3510

@AlekSi

This comment has been minimized.

Copy link
Contributor

AlekSi commented Dec 19, 2017

Was it fixed in 2.0? Can this issue be closed?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 19, 2017

Yes, it's changed in 2.0.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.