Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upConfigurable limit to scrape concurrency #4408
Comments
hawkw
referenced this issue
Jul 23, 2018
Closed
Controller must work in clusters with >100 pods #1322
This comment has been minimized.
This comment has been minimized.
|
I'm not sure that this would actually help you as if your proxy already can't handle the load, then moving the load around a bit isn't going to help as scrapes are already splayed over time to avoid spikes. In addition this would lead to scrapes not happening at regular intervals, which could cause artifacts. Failing as it currently is is probably the best behaviour in this scenario. I'd suggest either having fewer targets, or removing the concurrency limit in your proxy. Allowing Prometheus to scrape the targets directly might also be beneficial, and removes a failure mode. |
hawkw
referenced this issue
Jul 23, 2018
Merged
Increase outbound router capacity for Prometheus pod's proxy #1358
This comment has been minimized.
This comment has been minimized.
briansmith
commented
Jul 23, 2018
|
Let's not fixate on the specific issue of a proxy imposing some limit on Prometheus's scraper. It sounds like Prometheus is probably doing something close to what we're asking for anyway.
That sounds like what we ultimately would want anyway, however we're not seeing the splaying done in the way we expected. Could you please point us to where we can understand how this splaying is done? Are there any tuning parameters to control it? |
This comment has been minimized.
This comment has been minimized.
|
It's based on a hash of the target labels, and with 100+ targets it should
be reasonably uniform. It is not configurable.
…On Mon 23 Jul 2018, 19:40 Brian Smith, ***@***.***> wrote:
Let's not fixate on the specific issue of a proxy imposing some limit on
Prometheus's scraper. It sounds like Prometheus is probably doing something
close to what we're asking for anyway.
scrapes are already splayed over time to avoid spikes.
That sounds like what we ultimately would want anyway, however we're not
seeing the splaying done in the way we expected. Could you please point us
to where we can understand how this splaying is done? Are there any tuning
parameters to control it?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4408 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGyTdgXgz6HK2nvBg6tD2QiqNydp3VHLks5uJhixgaJpZM4VbdUR>
.
|
This comment has been minimized.
This comment has been minimized.
Lines 125 to 139 in 6a464ae |
hawkw commentedJul 23, 2018
Proposal
When Prometheus is configured with more than one scrape target, it sends requests to all those targets concurrently. In some cases, a very high number of concurrent requests can cause issues.
I propose adding a configuration option that limits the number of concurrent requests when scraping. When Prometheus issues a request to a scrape target, it should track the number of requests that are currently in-flight, and test to see whether it has reached the concurrency limit, if one has been configured. If the limit has been reached, Prometheus should wait until some in-flight requests complete before making more requests.
Use case. Why is this important?
Linkerd 2 deploys a Prometheus instance behind a proxy. This proxy places a limit on the number of routes (in this case, HTTP authorities) which can concurrently have requests in flight to them. When the limit is reached, the proxy will return errors. When the Prometheus instance attempts to scrape more targets than the proxy's route limit concurrently, the route cache bound is reached and subsequent scrape requests result in errors. Thus, the scrape is incomplete. If scrape concurrency could be limited, then routes whose requests have finished would be considered "inactive" by the proxy and could be evicted from the route cache, allowing new requests to succeed.
See linkerd/linkerd2#1234 and linkerd/linkerd2#1322 for more information on this specific use-case.