Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

julz · 2020-06-19T08:14:06Z

Describe the feature

We've discussed this a few times in slack but it seems worth having an issue for it and I didn't spot one, so here one is :).

As of right now we spawn pod scrapes in parallel for ~every revision with no real bound on the amount we'll attempt at once across revisions. This means we can potentially have an extremely large number of active outgoing scrapes, causing more contention than needed, exceeding keep-alive connection pool size and buffer pools etc. In our larger environments, we see large numbers of connections waiting on DNS and GC when revisions*pods gets large.

We should probably introduce some sort of work pool for the scrapers.

markusthoemmes · 2020-06-19T08:18:58Z

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

vagababov · 2020-06-19T17:48:53Z

For the color revisions*pods is exactly 16*revisions, when pod count goes towards infinity, as in it's bounded in number of pods, but not in number of revisions.

vagababov · 2020-06-19T17:50:33Z

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

Here, I am a bit on the border, given that we probably have requests queueing up in activator that we want to start forwarding as fast as possible. And until the pod is probed it receives no requests... so 🤷

markusthoemmes · 2020-06-19T17:56:30Z

Not sure we talk about the same thing @vagababov? What does the activator have to do with pod probing in the autoscaler? 🤔

vagababov · 2020-06-19T18:22:05Z

We only probe during scale to 0. If we consider 500 revisions all going to 0 at the same time, it's unfortunate, but a much more corner case than just 500 revisions that we need to scrape :)

vagababov · 2020-06-19T18:22:13Z

But we do a lot of probing in activator :)

markusthoemmes · 2020-06-19T18:24:30Z

Okay, we clarified offline: In my comment about switching to workqueue based stuff I meant metric scraping, not probing. Sorry for the confusion.

skonto · 2020-07-09T10:42:17Z

@julz @vagababov @markusthoemmes I could help here if possible.

markusthoemmes · 2020-07-09T11:01:51Z

Yep, just know that this might be a somewhat bigger refactor of the current scraper code.

I think we first need to agree that we actually want work-pooling here as it comes with the risk of us setting the concurrency too low.

julz · 2020-07-10T07:49:21Z

so how do we want to proceed on this? Would a feature proposal doc to hash out the details and any foreseeable edge cases be the best next step?

markusthoemmes · 2020-07-10T07:50:12Z

I believe so @julz

julz · 2020-07-10T07:51:51Z

cool - @skonto happy to work together on that if you like?

skonto · 2020-07-10T22:24:57Z

@julz sure let's do it :) I will start a doc and we can iterate on it. I will ping you.

vagababov · 2020-07-13T15:53:36Z

Note, that this has to be a configurable feature, since a user might wish to throw CPU at the AS to ensure enough resources to support more revisions.
Finally, I'd like us to think about this with the scope of shared autoscaler that @yanweiguo is working on.
As in, will creating more autoscalers solve this problem?

skonto · 2020-07-13T22:25:25Z

@vagababov, as you probably noticed we (cc @julz @markusthoemmes) had a long discussion on slack (autoscaling channel), so absolutely that is one option that we discussed, but still within the context of one instance there should be space for improvements. I will try to summarize here. What can be improved needs verification in any case.

The total max cost is max_scraped_pods_per_revision*number_of_revisions. The first factor is 16.

We could either make 16 less or limit the two factors at any time eg. control scraping or revisions served.

Using a workqueue for limiting scraping has some other benefits, besides limiting concurrency, for example as @markusthoemmes mentioned:

That'd enable us to easily enable/disable scraping for specific revision, which then in turn enables us to dynamically disable scraping if the activator is put into the routing path.

But also there are challenges as mentioned in #8610. Note here though that we already use a workqueue for probing.

In addition to that, the fact of currently having a max of 16 pods scraped per revision could be less assuming we have a smarter algorithm besides relying on the normal distribution. Maybe a time-series algo could do better here with less pods used in scraping.

The key question, as discussed, is whether we should limit the number of concurrently served revisions or how much scraping we do per revision at any time. For the later a related issue is this one. In that issue max keep alive was increased but this could lead to running out of file descriptors if we keep increasing it.

IMHO, from a UX and operations perspective we could make configurable both the number of revisions served and the workqueue concurrency or other options (if we go down that path).
In analogy to how http servers limit requests we can set a global maximum anyway per AS instance with some good default.
Also we could think of more advanced concepts like QoS and priorities per group of revisions (not sure if that concept exists at the moment).

skonto · 2020-07-20T14:21:03Z

@vagababov @julz should we move on with a workqueue approach, have a first version and take it from there?

markusthoemmes · 2020-07-20T14:23:26Z

We might want to think about the guarding measures we want to take first maybe. That makes accepting/denying this a no-brainer imo.

How do we want to measure success/failure or rather better/worse once we do this?

skonto · 2020-07-20T14:51:34Z

@markusthoemmes from my POV, my definition of success would be to add configuration capabilities keeping the same performance as in the unbounded case and avoid corner cases as described in #8610.

markusthoemmes · 2020-07-20T15:12:06Z

Hm, but wouldn't we want to show that implementing this generates an advantage over the existing way of doing things (which doesn't require hand-tuning either).

We might want to produce a micro-benchmark that gives us an idea on the scale needed to reap benefits and the scale needed to push things so far that they start to detriment.

skonto · 2020-07-20T17:08:43Z

The way I see this is that the workqueue imposes a rate limiting strategy in order to optimize revision scraping throughput and latency. The current approach from operations perspective implicitly requires hand-tuning if you have to increase your connection pool size in large envs and btw there is no control with busty traffic.
Setting a configurable limit brings its own value that is my POV if it protects you from failures, keeps system stable and helps with making progress with pending processing eg. revision processing. For the latter we had the question about what matters first revision progress or scraping process in general.
So, yes we also need to show how the workqueue will help us without increasing connection pools in large envs, because if at the end of the day the unbounded approach makes progress and just queues up work then who cares. However I suspect there are issues already eg. GC as stated in the description.

vagababov · 2020-07-20T17:43:03Z

I would also want to see how StatefulSet AS scaleout works, since I don't want us to compilicate the system if there's no necessity to.

skonto · 2020-07-21T10:16:59Z

@vagababov reading these benchmarks from the ants lib shows that most likely unbounded goroutines (independently from scraping) are not the best choice beyond a concurrency number eg. 100K tasks in that benchmark. So I think we need a better approach anyway because I suspect that although scaling out would solve the problem, an optimized instance could lead to less replicas when needed.

vagababov · 2020-07-21T20:53:42Z

100K go routines is probably strongly above what we support per task in any case....

skonto · 2020-07-21T21:24:11Z

@julz describes an issue that occurs in large envs. So we need to clarify what is the number of pods being scraped when issues occur. Btw in theory if we follow the guidelines for building large clusters, we could have thousands of pods and the limit is 150K. However, these benchmarks I mentioned depend on the processing being done which is a dummy function. I dont think the threshold is the same with a realistic workload. In the latter case cost per scraping request could be way more in ms, could use more resources because of http connections in exceeded keep-alive pools etc .
I think a benchmark could shed more light on this as also suggested by @markusthoemmes (assuming its realistic enough).

markusthoemmes · 2020-07-22T05:52:54Z

Yeah let's write a Golang benchmark (I don't think an E2E one is necessary) and see if we can find a break even point between just spawning new goroutines and pooling them for our workloads.

github-actions · 2020-10-21T01:26:58Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

nader-ziada · 2022-05-17T18:28:56Z

/reopen

/lifecycle frozen

knative-prow · 2022-05-17T18:29:03Z

@nader-ziada: Reopened this issue.

In response to this:

/reopen

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2022-08-02T18:31:57Z

/triage accepted
/milestone Icebox

julz added the kind/feature Well-understood/specified features, ready for coding. label Jun 19, 2020

knative-prow-robot added the area/autoscale label Jun 19, 2020

markusthoemmes mentioned this issue Jun 29, 2020

Don't scrape pods via HTTP if the activator in path. #7324

Open

julz mentioned this issue Jul 13, 2020

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Open

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2020

github-actions bot closed this as completed Nov 21, 2020

knative-prow bot reopened this May 17, 2022

knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2022

knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Aug 2, 2022

dprotaso added this to the Icebox milestone Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

julz commented Jun 19, 2020

markusthoemmes commented Jun 19, 2020

vagababov commented Jun 19, 2020 •

edited

Loading

vagababov commented Jun 19, 2020

markusthoemmes commented Jun 19, 2020

vagababov commented Jun 19, 2020

vagababov commented Jun 19, 2020

markusthoemmes commented Jun 19, 2020

skonto commented Jul 9, 2020

markusthoemmes commented Jul 9, 2020

julz commented Jul 10, 2020

markusthoemmes commented Jul 10, 2020

julz commented Jul 10, 2020

skonto commented Jul 10, 2020

vagababov commented Jul 13, 2020

skonto commented Jul 13, 2020 •

edited

Loading

skonto commented Jul 20, 2020

markusthoemmes commented Jul 20, 2020

skonto commented Jul 20, 2020 •

edited

Loading

markusthoemmes commented Jul 20, 2020

skonto commented Jul 20, 2020 •

edited

Loading

vagababov commented Jul 20, 2020

skonto commented Jul 21, 2020 •

edited

Loading

vagababov commented Jul 21, 2020

skonto commented Jul 21, 2020 •

edited

Loading

markusthoemmes commented Jul 22, 2020

github-actions bot commented Oct 21, 2020

nader-ziada commented May 17, 2022

knative-prow bot commented May 17, 2022

dprotaso commented Aug 2, 2022

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

Comments

julz commented Jun 19, 2020

Describe the feature

markusthoemmes commented Jun 19, 2020

vagababov commented Jun 19, 2020 • edited Loading

vagababov commented Jun 19, 2020

markusthoemmes commented Jun 19, 2020

vagababov commented Jun 19, 2020

vagababov commented Jun 19, 2020

markusthoemmes commented Jun 19, 2020

skonto commented Jul 9, 2020

markusthoemmes commented Jul 9, 2020

julz commented Jul 10, 2020

markusthoemmes commented Jul 10, 2020

julz commented Jul 10, 2020

skonto commented Jul 10, 2020

vagababov commented Jul 13, 2020

skonto commented Jul 13, 2020 • edited Loading

skonto commented Jul 20, 2020

markusthoemmes commented Jul 20, 2020

skonto commented Jul 20, 2020 • edited Loading

markusthoemmes commented Jul 20, 2020

skonto commented Jul 20, 2020 • edited Loading

vagababov commented Jul 20, 2020

skonto commented Jul 21, 2020 • edited Loading

vagababov commented Jul 21, 2020

skonto commented Jul 21, 2020 • edited Loading

markusthoemmes commented Jul 22, 2020

github-actions bot commented Oct 21, 2020

nader-ziada commented May 17, 2022

knative-prow bot commented May 17, 2022

dprotaso commented Aug 2, 2022

vagababov commented Jun 19, 2020 •

edited

Loading

skonto commented Jul 13, 2020 •

edited

Loading

skonto commented Jul 20, 2020 •

edited

Loading

skonto commented Jul 20, 2020 •

edited

Loading

skonto commented Jul 21, 2020 •

edited

Loading

skonto commented Jul 21, 2020 •

edited

Loading