Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

Open
julz opened this issue Jun 19, 2020 · 29 comments
Open

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

julz opened this issue Jun 19, 2020 · 29 comments
Labels
area/autoscale kind/feature Well-understood/specified features, ready for coding. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@julz
Copy link
Member

julz commented Jun 19, 2020

Describe the feature

We've discussed this a few times in slack but it seems worth having an issue for it and I didn't spot one, so here one is :).

As of right now we spawn pod scrapes in parallel for ~every revision with no real bound on the amount we'll attempt at once across revisions. This means we can potentially have an extremely large number of active outgoing scrapes, causing more contention than needed, exceeding keep-alive connection pool size and buffer pools etc. In our larger environments, we see large numbers of connections waiting on DNS and GC when revisions*pods gets large.

We should probably introduce some sort of work pool for the scrapers.

@julz julz added the kind/feature Well-understood/specified features, ready for coding. label Jun 19, 2020
@markusthoemmes
Copy link
Contributor

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

@vagababov
Copy link
Contributor

vagababov commented Jun 19, 2020

For the color revisions*pods is exactly 16*revisions, when pod count goes towards infinity, as in it's bounded in number of pods, but not in number of revisions.

@vagababov
Copy link
Contributor

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

Here, I am a bit on the border, given that we probably have requests queueing up in activator that we want to start forwarding as fast as possible. And until the pod is probed it receives no requests... so 🤷

@markusthoemmes
Copy link
Contributor

Not sure we talk about the same thing @vagababov? What does the activator have to do with pod probing in the autoscaler? 🤔

@vagababov
Copy link
Contributor

We only probe during scale to 0. If we consider 500 revisions all going to 0 at the same time, it's unfortunate, but a much more corner case than just 500 revisions that we need to scrape :)

@vagababov
Copy link
Contributor

But we do a lot of probing in activator :)

@markusthoemmes
Copy link
Contributor

Okay, we clarified offline: In my comment about switching to workqueue based stuff I meant metric scraping, not probing. Sorry for the confusion.

@skonto
Copy link
Contributor

skonto commented Jul 9, 2020

@julz @vagababov @markusthoemmes I could help here if possible.

@markusthoemmes
Copy link
Contributor

Yep, just know that this might be a somewhat bigger refactor of the current scraper code.

I think we first need to agree that we actually want work-pooling here as it comes with the risk of us setting the concurrency too low.

@julz
Copy link
Member Author

julz commented Jul 10, 2020

so how do we want to proceed on this? Would a feature proposal doc to hash out the details and any foreseeable edge cases be the best next step?

@markusthoemmes
Copy link
Contributor

I believe so @julz

@julz
Copy link
Member Author

julz commented Jul 10, 2020

cool - @skonto happy to work together on that if you like?

@skonto
Copy link
Contributor

skonto commented Jul 10, 2020

@julz sure let's do it :) I will start a doc and we can iterate on it. I will ping you.

@vagababov
Copy link
Contributor

Note, that this has to be a configurable feature, since a user might wish to throw CPU at the AS to ensure enough resources to support more revisions.
Finally, I'd like us to think about this with the scope of shared autoscaler that @yanweiguo is working on.
As in, will creating more autoscalers solve this problem?

@skonto
Copy link
Contributor

skonto commented Jul 13, 2020

@vagababov, as you probably noticed we (cc @julz @markusthoemmes) had a long discussion on slack (autoscaling channel), so absolutely that is one option that we discussed, but still within the context of one instance there should be space for improvements. I will try to summarize here. What can be improved needs verification in any case.

The total max cost is max_scraped_pods_per_revision*number_of_revisions. The first factor is 16.

We could either make 16 less or limit the two factors at any time eg. control scraping or revisions served.

Using a workqueue for limiting scraping has some other benefits, besides limiting concurrency, for example as @markusthoemmes mentioned:

That'd enable us to easily enable/disable scraping for specific revision, which then in turn enables us to dynamically disable scraping if the activator is put into the routing path.

But also there are challenges as mentioned in #8610. Note here though that we already use a workqueue for probing.

In addition to that, the fact of currently having a max of 16 pods scraped per revision could be less assuming we have a smarter algorithm besides relying on the normal distribution. Maybe a time-series algo could do better here with less pods used in scraping.

The key question, as discussed, is whether we should limit the number of concurrently served revisions or how much scraping we do per revision at any time. For the later a related issue is this one. In that issue max keep alive was increased but this could lead to running out of file descriptors if we keep increasing it.

IMHO, from a UX and operations perspective we could make configurable both the number of revisions served and the workqueue concurrency or other options (if we go down that path).
In analogy to how http servers limit requests we can set a global maximum anyway per AS instance with some good default.
Also we could think of more advanced concepts like QoS and priorities per group of revisions (not sure if that concept exists at the moment).

@skonto
Copy link
Contributor

skonto commented Jul 20, 2020

@vagababov @julz should we move on with a workqueue approach, have a first version and take it from there?

@markusthoemmes
Copy link
Contributor

We might want to think about the guarding measures we want to take first maybe. That makes accepting/denying this a no-brainer imo.

How do we want to measure success/failure or rather better/worse once we do this?

@skonto
Copy link
Contributor

skonto commented Jul 20, 2020

@markusthoemmes from my POV, my definition of success would be to add configuration capabilities keeping the same performance as in the unbounded case and avoid corner cases as described in #8610.

@markusthoemmes
Copy link
Contributor

Hm, but wouldn't we want to show that implementing this generates an advantage over the existing way of doing things (which doesn't require hand-tuning either).

We might want to produce a micro-benchmark that gives us an idea on the scale needed to reap benefits and the scale needed to push things so far that they start to detriment.

@skonto
Copy link
Contributor

skonto commented Jul 20, 2020

The way I see this is that the workqueue imposes a rate limiting strategy in order to optimize revision scraping throughput and latency. The current approach from operations perspective implicitly requires hand-tuning if you have to increase your connection pool size in large envs and btw there is no control with busty traffic.
Setting a configurable limit brings its own value that is my POV if it protects you from failures, keeps system stable and helps with making progress with pending processing eg. revision processing. For the latter we had the question about what matters first revision progress or scraping process in general.
So, yes we also need to show how the workqueue will help us without increasing connection pools in large envs, because if at the end of the day the unbounded approach makes progress and just queues up work then who cares. However I suspect there are issues already eg. GC as stated in the description.

@vagababov
Copy link
Contributor

I would also want to see how StatefulSet AS scaleout works, since I don't want us to compilicate the system if there's no necessity to.

@skonto
Copy link
Contributor

skonto commented Jul 21, 2020

@vagababov reading these benchmarks from the ants lib shows that most likely unbounded goroutines (independently from scraping) are not the best choice beyond a concurrency number eg. 100K tasks in that benchmark. So I think we need a better approach anyway because I suspect that although scaling out would solve the problem, an optimized instance could lead to less replicas when needed.

@vagababov
Copy link
Contributor

100K go routines is probably strongly above what we support per task in any case....

@skonto
Copy link
Contributor

skonto commented Jul 21, 2020

@julz describes an issue that occurs in large envs. So we need to clarify what is the number of pods being scraped when issues occur. Btw in theory if we follow the guidelines for building large clusters, we could have thousands of pods and the limit is 150K. However, these benchmarks I mentioned depend on the processing being done which is a dummy function. I dont think the threshold is the same with a realistic workload. In the latter case cost per scraping request could be way more in ms, could use more resources because of http connections in exceeded keep-alive pools etc .
I think a benchmark could shed more light on this as also suggested by @markusthoemmes (assuming its realistic enough).

@markusthoemmes
Copy link
Contributor

Yeah let's write a Golang benchmark (I don't think an E2E one is necessary) and see if we can find a break even point between just spawning new goroutines and pooling them for our workloads.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2020
@nader-ziada
Copy link
Member

/reopen

/lifecycle frozen

@knative-prow knative-prow bot reopened this May 17, 2022
@knative-prow
Copy link

knative-prow bot commented May 17, 2022

@nader-ziada: Reopened this issue.

In response to this:

/reopen

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2022
@dprotaso
Copy link
Member

dprotaso commented Aug 2, 2022

/triage accepted
/milestone Icebox

@knative-prow knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Aug 2, 2022
@dprotaso dprotaso added this to the Icebox milestone Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/feature Well-understood/specified features, ready for coding. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

7 participants