Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load balancing: support creating additional connections when MAX_CONCURRENT_STREAM is reached #21386

Open
gkousik opened this issue Dec 5, 2019 · 17 comments

Comments

@gkousik
Copy link

gkousik commented Dec 5, 2019

What version of gRPC and what language are you using?

1.24.3

What operating system (Linux, Windows,...) and version?

Linux - Ubuntu 14.04 (GCE instance)

What runtime / compiler are you using (e.g. python version or version of gcc)

Go - 1.13

What did you do?

If possible, provide a recipe for reproducing the error. Try being specific and include code snippets if helpful.

  1. Connect to remotebuildexecution endpoint (remotebuildexecution.googleapis.com:443) using github.com/bazelbuild/remote-apis-sdks (https://github.com/bazelbuild/remote-apis-sdks/blob/master/go/pkg/client/client.go#L185)
  2. Make ~500 concurrent requests to Execute() streaming API with "sleep 45" command using a single connection (as seen in the code for Dial() linked in (1) ). API proto - https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/execution/v2/remote_execution.proto#L106 .

What did you expect to see?

All the 500 rpc's execute in parallel

What did you see instead?

Only 100 rpc's run concurrently, the remaining 400 block until the first 100 finishes running.

Anything else we should know about your project / environment?

  1. We previously reported a similar issue for the Java client at Use multiple connections to avoid the server's SETTINGS_MAX_CONCURRENT_STREAMS limit #11704 - the work around suggested does not work for us now since the DNS name resolution returns IPv6 addresses, which don't resolve on GCE instances (which is where our client would run). We get a Immediate connect fail for 2a00:1450:400c:c06::93: Network is unreachable error when we try to connect to IPv6 addresses.
  2. The grpc-go client has a similar issue reported in its repository - Control MAX_CONCURRENT_STREAMS server-side and account for it on client-side grpc-go#2412. It also has a comment about the work involved in the Go client for the feature.
  3. Ideally we would like to avoid having to manage multiple connections (connection pool) in our client.

Also this issue is blocking our Go client (which we are writing for our API) from achieving high parallelism, which makes the client less useful to our customers.

@gkousik
Copy link
Author

gkousik commented Dec 11, 2019

FYI @ola-rozenfeld @buchgr - In relation to #11704, I suppose if you run Bazel on GCE (where it doesn't connect to IPv6 addresses), you will also remote execution parallelism being constrained to a around ~200 concurrent requests.

@dfawley
Copy link
Member

dfawley commented Dec 11, 2019

This would have to be a feature provided by all languages, not just Go.

cc @ejona @markdroth

@dfawley dfawley changed the title Grpc-go client - support creating additional connections when MAX_CONCURRENT_STREAM is reached load balancing: support creating additional connections when MAX_CONCURRENT_STREAM is reached Dec 11, 2019
@ejona86
Copy link
Member

ejona86 commented Dec 11, 2019

I expect #7957 is really the appropriate tracking issue for this. Generally when you want multiple connections there is a proxy involved.

We've considered this some in the past, but there's some complexities. In particular, when MAX_CONCURRENT_STREAMS is used when a backend is overloaded (like C core can do) it is unclear what actions are appropriate. Creating more connections in that case is clearly a bad idea. Sending traffic to other backends when using round_robin may be fine, but there's also risk there.

For a case going through a proxy, using multiple connections is good and proper, but the client isn't aware of the server's topology. Although there should also clearly be a limit. It is very normal for clients to have bursty workloads, creating 1000s or 10s of thousands of RPCs all-at-once.

@ejona86
Copy link
Member

ejona86 commented Dec 11, 2019

Oh, and to be clear, the solution to date from my perspective is "create multiple Channels." Although I also understand that is much easier to do in Java than many other languages, since you can hide the behavior behind the Channel interface and use stubs like normal (instead of needing to round-robin across multiple stubs).

@stale
Copy link

stale bot commented May 6, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions!

@ajmath
Copy link

ajmath commented Oct 21, 2021

Oh, and to be clear, the solution to date from my perspective is "create multiple Channels." Although I also understand that is much easier to do in Java than many other languages, since you can hide the behavior behind the Channel interface and use stubs like normal (instead of needing to round-robin across multiple stubs).

@ejona86 Do you know of any public examples of someone doing this with the Java API? I'd also love to hear about examples in the other languages if you know of any.

Edit:
Found this example in bazel.

@ejona86
Copy link
Member

ejona86 commented Oct 21, 2021

@ajmath, there's also https://github.com/googleapis/gax-java/blob/main/gax-grpc/src/main/java/com/google/api/gax/grpc/ChannelPool.java (ignore the "refreshing" piece of it). Not that if you choose to implement just Channel instead of ManagedChannel, then it becomes even easier.

@patrickfreed
Copy link

@dfawley mentioned there was an effort to create a cross-language design for this functionality, are there any updates on that?

@dfawley
Copy link
Member

dfawley commented Nov 1, 2022

@patrickfreed We do have a design proposal that was mostly agreed upon internally, but the effort to implement it was deprioritized since our primary use case no longer required it. Long-term it is still something we'd like to do. @ejona86 @markdroth what do you think about the priority of this vs. our other projects in the next few months? Should I write up a gRFC for the design even if we don't have resources assigned to implement it?

@ejona86
Copy link
Member

ejona86 commented Nov 3, 2022

gRFC without implementation resources doesn't seem all that helpful. Maybe just externalize the internal doc (copy the contents to your personal account and share it)? That lets people determine if the solution would help them, gives them an idea of the work involved, and would be a guide for any would-be contributors that want to step up.

@markdroth
Copy link
Member

Didn't we say that the internal design we had been evaluating wasn't going to work in the xDS case? If so, I don't think we'd actually want to pursue that anyway.

@patrickfreed
Copy link

gRFC without implementation resources doesn't seem all that helpful. Maybe just externalize the internal doc (copy the contents to your personal account and share it)? That lets people determine if the solution would help them, gives them an idea of the work involved, and would be a guide for any would-be contributors that want to step up.

Either one would be super helpful and much appreciated.

Didn't we say that the internal design we had been evaluating wasn't going to work in the xDS case? If so, I don't think we'd actually want to pursue that anyway.

I think this feature is still very useful for the L4 load balancer case (#7957).

@markdroth
Copy link
Member

I think this feature is still very useful for the L4 load balancer case (#7957).

I do understand that, but we don't want to have to support two mechanisms for the same thing. Ultimately, we will need a mechanism that solves this problem for the xDS case, so if we introduce this mechanism now for the L4 case and then need to introduce a separate mechanism later for the xDS case, then we're stuck supporting two mechanisms for the same thing. I would prefer not to introduce a mechanism for this until we know that it will solve all the use cases that we care about.

@patrickfreed
Copy link

I think it would still be valuable if the internal doc were released publicly, since it's possible the community could help brainstorm ideas for a solution that accommodates the xDS case too.

@bogdan-patraucean
Copy link

Any updates on this?

@vincentyl
Copy link

we're seeing the same issue in python as well

@bgdnvk
Copy link

bgdnvk commented Apr 8, 2024

bump for visibility - really interested in this and found this issue from the docs page

grpc performance note:

Side note: The gRPC team has plans to add a feature to fix these performance issues (see grpc/grpc#21386 for more info), so any solution involving creating multiple channels is a temporary workaround that should eventually not be needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment