Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a limit to the number of in-flight requests that a server processes. #6207

Merged
1 commit merged into from Apr 2, 2015

Conversation

@brendandburns
Copy link
Contributor

@brendandburns brendandburns commented Mar 31, 2015

@smarterclayton @alex-mohr @fabioy @quinton-hoole

Closes #5866

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 31, 2015

What happens when all the inflight requests are consumed by watches? Starvation?

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Mar 31, 2015

yes. I suppose we might want to have a separate counter for watches?

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Mar 31, 2015

Or we could just ignore watches for now. We managed to DOS an apiserver out of file descriptors by spamming it. That's bad :P hence this PR.

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 31, 2015

We are already talking about pushing some operations deeper into apiserver, so if it's high level for now when those other steps complete it will be easier to move this down.

@@ -66,6 +66,22 @@ func ReadOnly(handler http.Handler) http.Handler {
})
}

// MaxInFlight limits the number of in flight requests to the number of
Copy link

@ghost ghost Mar 31, 2015

Typo?

Copy link
Contributor Author

@brendandburns brendandburns Mar 31, 2015

fixed.

@ghost
Copy link

@ghost ghost commented Mar 31, 2015

How do we monitor the number of inflight requests over time? I'm guessing that the current answer is "we don't", in which case please add an issue requesting this, at least.

@vmarmol vmarmol assigned ghost Mar 31, 2015
@vmarmol
Copy link
Contributor

@vmarmol vmarmol commented Mar 31, 2015

Assigning to @quinton-hoole, feel free to reassign.

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Mar 31, 2015

Comments addressed, and #6227 filed for the issue Quinton asked for.

@ghost
Copy link

@ghost ghost commented Mar 31, 2015

Getting back to smarterclayton@ 's comment above, surely with the low default that you've set to 20 inflight requests in this PR, watch requests will pretty much always use up all of those, and we'll just serve errors for everything else? Am I missing something?

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Mar 31, 2015

Also clients can open watches pretty easily, so suddenly our watch could could be 50-100

----- Original Message -----

Getting back to smarterclayton@ 's comment above, surely with the low default
that you've set to 20 inflight requests in this PR, watch requests will
pretty much always use up all of those, and we'll just serve errors for
everything else? Am I missing something?


Reply to this email directly or view it on GitHub:
#6207 (comment)

@ghost
Copy link

@ghost ghost commented Mar 31, 2015

One option would be to set the default limit closer to the default per-process file handle limit of 1024. A better solution, as you mentioned, would be to split the limits for watches and other stuff into separate counters.

@alex-mohr
Copy link
Contributor

@alex-mohr alex-mohr commented Mar 31, 2015

+1 to separate buckets for long-running requests vs. short ones. May also be worth having separate buckets for different operation types that have different dependencies -- e.g. reading from etcd vs. writing?

Is there a separate issue to call go's equivalent of setrlimit(RLIMIT_NOFILE) to something higher? filedescs are relatively cheap and 1024 is tiny. See golang/go#5949 for sample code?

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Mar 31, 2015

Ok, added a regexp to not count watch and proxy, expanded tests to cover that functionality.

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Mar 31, 2015

Please take another look.

block.Add(1)
sem := make(chan bool, Iterations)

re := regexp.MustCompile("[.*\\/watch][^\\/proxy.*]")
Copy link
Contributor

@smarterclayton smarterclayton Mar 31, 2015

In v1beta3 we are not going to be using /watch - we'll be using a query param on the list response. I don't know where that leaves us.

Copy link
Contributor Author

@brendandburns brendandburns Apr 1, 2015

To fix the implementation so the test passes when that goes live?

Copy link
Contributor

@smarterclayton smarterclayton Apr 1, 2015

its there now, it's just no one is using it yet. It doesn't have to block this, but it's not clear to me how we tell that.

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Apr 1, 2015

Fixed gofmt issues (and installed the git hooks on my new machine... ;)

@ghost
Copy link

@ghost ghost commented Apr 2, 2015

Travis is still whining about a build failure. I don't know enough about Travis yet to know whether or not we can trust it's judgement. Other than that, I think that this can be merged.

@zmerlynn
Copy link
Member

@zmerlynn zmerlynn commented Apr 2, 2015

Travis kicked.

@ghost
Copy link

@ghost ghost commented Apr 2, 2015

Travis still complaining :-(

@ghost
Copy link

@ghost ghost commented Apr 2, 2015

PS: zmerlynn@ when you say "Travis kicked", what exactly do you kick?

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Apr 2, 2015

There's a restart build button in the travis UI if you have commit writes on the underlying repo

On Apr 2, 2015, at 2:26 PM, Quinton Hoole notifications@github.com wrote:

PS: zmerlynn@ when you say "Travis kicked", what exactly do you kick?


Reply to this email directly or view it on GitHub.

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Apr 2, 2015

Travis is now happy. I can haz mergez?

@brendandburns
Copy link
Contributor Author

@brendandburns brendandburns commented Apr 2, 2015

(and you need to login to travis via github)

ghost pushed a commit that referenced this issue Apr 2, 2015
Add a limit to the number of in-flight requests that a server processes.
@ghost ghost merged commit 4a2000c into kubernetes:master Apr 2, 2015
4 checks passed
@timothysc
Copy link
Member

@timothysc timothysc commented Apr 6, 2015

Can folks please label this as perf for tracking.

@roberthbailey
Copy link
Member

@roberthbailey roberthbailey commented Apr 6, 2015

@timothysc done.

@timothysc
Copy link
Member

@timothysc timothysc commented Apr 6, 2015

@wojtek-t have you assessed the impact of this and #6203 yet?

@wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Apr 6, 2015

Sorry - we have a holiday in Poland today - will take a look tomorrow.

@fgrzadkowski
Copy link
Contributor

@fgrzadkowski fgrzadkowski commented Apr 7, 2015

Side effect of this PR is that density test is creating pods much slower than earlier (~6 pods per second). For 50 node cluster and rc with 1500 replicas (30 pods per node) this gives 300 seconds just to create rc.

@fgrzadkowski
Copy link
Contributor

@fgrzadkowski fgrzadkowski commented Apr 7, 2015

Actually my previous comment should be in #6203

@timothysc
Copy link
Member

@timothysc timothysc commented Apr 7, 2015

@fgrzadkowski thx, I figured this would be the case, we're currently working through issues with running load-balanced apiservers as a result.

@abhgupta, @smarterclayton, @rrati ^ FYI.

@jayunit100
Copy link
Member

@jayunit100 jayunit100 commented Apr 7, 2015

related ???

  • I'm seeing very slow proxy returns in networking.go in my cluster.
  • However, i don't see this when i run locally (i.e. hack/local-up-cluster.sh), so only in a real cluster do these 6 second latencies for requests happen.
INFO: Proxy status call returned in 6.015694382s
INFO: About to make a proxy status call
INFO: Proxy status call returned in 6.013796121s 
...

@timothysc
Copy link
Member

@timothysc timothysc commented Apr 7, 2015

The cluster response curve from *this & #6203 are bad, imho this is not the correct route to take, or at least default off until good defaults can be found. Time to fill a pool has regressed sharply.

image

~2.3 pods per second which is a fairly drastic decrease from about a week ago.

@wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Apr 8, 2015

@timothysc - I agree. I observed a very similar scheduling rate and indeed it's much worse than before it.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.