Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upScraping HTTP keep-alive #2498
Comments
This comment has been minimized.
This comment has been minimized.
|
Actually I changed that for the dev-2.0 branch. I'd like to try making it a default there and only allow disabling once a use case appears. The number of required file descriptors will go up, but is still bounded. |
This comment has been minimized.
This comment has been minimized.
|
An easy solution could be to make it depend on the scrape interval. With a 1m scrape interval, keep-alive is pretty wasteful. With a 1s scrape interval, the opposite is true. |
This comment has been minimized.
This comment has been minimized.
|
I don't think that's generally applicable. First of all – what is wasteful in that context? Open FDs are not exactly expensive. Certainly not in the cardinality we are thinking of. So I'm not sure it's worth distinguishing to begin with, given that it additionally will catch people off guard if they reduce their scrape interval and suddenly have to adjust their ulimit. In general, it's a question of of connects/second. And 10 targets at 1s have fewer of those than 10k targets at 1m scrape interval. On top, the total impact of a connect is amortized over number of series per scrape, which is also not known. |
This comment has been minimized.
This comment has been minimized.
|
I thought we had keep-alive already, it must have regressed at some point. |
This comment has been minimized.
This comment has been minimized.
|
I don't see how scrape timeout affects keep-alive? |
This comment has been minimized.
This comment has been minimized.
|
One of the selling points of Prometheus and its pull model is that you can easily cope with very large amount of targets if you only scrape at a sufficiently low frequency. The wasteful part is if the keep-alive timeout is slightly smaller than the scrape interval. Default idle timeout in Go is 30s, for example. Default scrape interval in Prometheus is 1m. With this combination of timings, every TCP connection is kept open for 30s for no benefit at all. If you want to avoid the effect of surprising increases in open fd's after reducing the scrape interval, you also have to manage the idle timeout on connections to be larger than the scrape interval. On the other hand, if you have to think about tweaking the ulimit, it proves we are talking about cardinalities that matter. Assuming the 30s default idle timeout are tailored to be a good trade-off between cost of an open connection and re-establishing one, we can simply activate keep-alive for scrape intervals of up to 30s and switch it off otherwise. |
This comment has been minimized.
This comment has been minimized.
|
I agree with @beorn7 . If we can configure scrape interval then it would be reasonable to configure other HTTP client parameters (connection idle timeout, keep-alive, connection timeout) aswell. Default configuration can be as you propose: keep-alive enabled implicitly if scrape interval is under 30s. For TLS targets keep-alive should be enabled by default unrelated to scrape interval? |
This comment has been minimized.
This comment has been minimized.
|
So increase the keep alive timeout? The default value isn't relevant to the discussion and the described problem doesn't exist basically.
Why? just make it 2x the scrape interval and that's all there is to it. There's no reason to set any other value. I did some manual benchmarking and found connecting to be a significant CPU impact. And that holds true by the formula I mentioned regardless of your scrape interval. (Okay, not regardless but not sufficiently defined by it.)
Trade off between what? It's a value tuned to the average web server client connection which may or may not be reused. In Prometheus we know almost any connection we start will be reused hundreds of times – it's just a different use case and the default has no place here. What's the benefit of switching it off for intervals larger than 30s? What changes when setting it to 31s? The only feasible option here is making it a boolean flag and let the user decide – globally for everything. Either you want it or don't. There is no magic boundary within our typical interval of 1s to 2m where it becomes feasible or unfeasible because the interval is not a sufficient parameter for the function estimating the cost/benefit. |
This comment has been minimized.
This comment has been minimized.
|
It might be a better option to make it overridable per job/target definition. You know, some servers don't like keeping connections open and too shy to simply close on their end. |
This comment has been minimized.
This comment has been minimized.
|
Keep-alive is useless if scrape interval is greater than connection idle timeout. I agree with @agaoglu to make keep-alive overridable per job/target but if keep-alive is enabled then connection idle timeout must be set to (at least) scrape interval value. |
This comment has been minimized.
This comment has been minimized.
I only brought in the default value as a starting point. You are right that it is meant for a "generic" connection where we don't have the information if we will re-use at all, which is different for the typical Prometheus scrape where we know exactly if and when we will re-use. So yes, we can of course change the timeout in Prometheus to something like 2x the scrape interval. But there are a number of reasons why there might still be a problem:
I don't think it's trivially true that 2x scrape interval as keep-alive timeout is a sane default. I guess there needs to be some sanity interval where you don't do keep-alive anymore (which doesn't have to be the 30s I brought in as the default from the Go HTTP package). Given the many different possible scenarios, making it configurable seems unavoidable. |
beorn7
referenced this issue
Apr 2, 2017
Closed
Too many open files (established connections to same nodes) #1873
brian-brazil
added
priority/Pmaybe
kind/enhancement
labels
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
darkk
commented
Aug 13, 2017
I think that the only sane default is "the default major browsers use" exactly because of proxies, NATs and others. |
This was referenced Sep 11, 2017
This comment has been minimized.
This comment has been minimized.
|
Are you interested in adding options to enable keep-alive connections (with default value |
This comment has been minimized.
This comment has been minimized.
|
That change was already inadvertently made for 1.8, it's being reverted in #3173 |
This comment has been minimized.
This comment has been minimized.
|
#3173 is about remote storage as opposed to scraping targets, isn't it? |
This comment has been minimized.
This comment has been minimized.
|
The scope of the PR was much broader than the title indicated, thus the revert as this question remains unsettled. |
This comment has been minimized.
This comment has been minimized.
|
I believe the code in 2.0 has keep-alive on for all connections. However see also #3510 |
This comment has been minimized.
This comment has been minimized.
|
Was it fixed in 2.0? Can this issue be closed? |
This comment has been minimized.
This comment has been minimized.
|
Yes, it's changed in 2.0. |
brian-brazil
closed this
Dec 19, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
iobestar commentedMar 15, 2017
Why is scraping use 'Connection: close' header?
Would it be ok to add config flag for this option (default: keep-alive disabled)?