Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS storm with round robin load balancing in grpc-js #2023

Closed
iscopetta opened this issue Jan 13, 2022 · 15 comments
Closed

DNS storm with round robin load balancing in grpc-js #2023

iscopetta opened this issue Jan 13, 2022 · 15 comments

Comments

@iscopetta
Copy link

Problem description

We are having some problems with the load balancing in grpc-js.

We are seeing uneven distribution of calls to our service pods, which ends up sometimes overloading some of them, while keeping others at very low load.

We think this might be because of the default load balancing strategy of "pick first", so we tried enabling round robin but this caused a bunch of issues

  • The distribution, albeit more consistent, was still not uniform
  • The pods where the client was running started using 3 times the CPU and created a flurry of requests to our DNS

Any ideas how we could address this uneven distribution issue, and what could be wrong with load balancing?


Reproduction steps

Our (singleton) clients get instantiated with the DNS address of the service.
The DNS returns the IP of all the available pods for the given service.
We enable round robin load balancing by providing this configuration to the client:


'grpc.service_config': JSON.stringify({ loadBalancingConfig: [{ round_robin: {} }], })

There was no other change to the clients besides the lb config.

Environment

  • OS name, version and architecture: Debian GNU/Linux 10 (buster) x86
  • Node version: 14.17.6
  • Node installation method: yarn
  • Package name and version: 1.4.5

Additional context

When we tried to deploy the mentioned config change this the behavior we saw:


(the baseline is for ~100 pods, while the spike is for just 4 canary pods where a single client configuration was changed)
CPU:
CPU spike

DNS requests:


Screen Shot 2022-01-13 at 11 01 04 AM

@murgatroid99
Copy link
Member

I think I see what is happening here: the clients are failing to connect to some of the addresses returned by the DNS. Those connection failures trigger DNS re-resolution attempts, which do not back off in this situation. The lack of a backoff here is a bug that I will fix. The connection failures would also explain the uneven request distribution.

You can get logs with more information about what is happening here by setting the environment variables GRPC_TRACE=channel,round_robin,subchannel,dns_resolver and GRPC_VERBOSITY=DEBUG.

@iscopetta
Copy link
Author

iscopetta commented Jan 13, 2022

Will try to gather some extra logging like you suggested and see if we can add some information here, thank you!

What I am wondering is why would we have connection failures in the first place?
We saw that sometimes with the default config, where a single pod gets saturated until it stops responding and we see a big spike of ERROR 14 calls, which end up triggering our circuit breakers.

But with a proper round robin we should be hitting them all more evenly, which we also did not see happen.

For example when we did a different trial enabling round robin this is what we saw on the server side (calls per clients)
Screen Shot 2022-01-06 at 12 18 17 PM

The load was a lot more consistent, but still with a big disparity of ~20 between them (enough that for services with lower qps we saw some pods constantly at ~20 qps while others where at 0)

@murgatroid99
Copy link
Member

I don't know anything about your network setup, so I have no information about why you would have connection failures.

It's not clear to me what that graph shows, but it's hard to evaluate the behavior of the round robin algorithm here because the connection failures will cause unevenness: each client will send all of its traffic only to the servers it is connected to.

@murgatroid99
Copy link
Member

I have published grpc-js 1.5.1 with some throttling on DNS requests. Can you try that out and see what impact it has?

@MichaelSitter
Copy link

Thanks for the quick response! I will have some cycles early next week to test.

@MichaelSitter
Copy link

MichaelSitter commented Jan 19, 2022

I managed to do a little data collection today with 1.5.1. I can't easily share any raw log data, but what I am seeing is a ton of logs from subchannels entering into the transient failure state:
D 2022-01-19T00:56:10.560Z | subchannel | (43986) 172.30.59.92:50051 READY -> TRANSIENT_FAILURE

Makes sense to see that with the extra DNS lookups happening. Something else I'm seeing a lot of logs for is this pick result line here:

D 2022-01-19T01:05:46.270Z | channel | (11) dns:///<host> Pick result: QUEUE subchannel: undefined status: undefined undefined

Is this just broken logging or a smoking gun?

edit: ah, i think that might be a red herring. those are for clients which aren't configured to use the round robin picker.

@murgatroid99
Copy link
Member

The first log indicates that a connection is dropped. In most cases, I would expect to see a nearby log with some error details, but it's not guaranteed. The second log is more of an internal debugging log. It just indicates that there is a pending request but it is not starting yet because a connection has not yet been established. It's not broken, just kind of lazy. The undefineds have other values when QUEUE is something else.

A large number of the READY -> TRANSIENT_FAILURE logs indicates that connections are getting dropped and reestablished frequently. Because of differences in how pick first and round robin manage connections, I would expect round robin to trigger more DNS requests as a result. This is probably the primary problem here: that connections are getting created and destroyed very frequently for some reason.

Also, have you checked the DNS requests graph after upgrading to version 1.5.1? It should have a lower peak.

@MichaelSitter
Copy link

Yeah, can confirm lower peak. So thank you for that fix! I'm getting a little more input from our compute team, and it sounds like in our case part of the issue might be the upstream services. Apparently there is a... pattern? ...of servers sending a GOAWAY to prevent sticky connections to pods which are no longer part of an active deployment. What happens to a subchannel in that case? i'm guessing it gets recreated and the host needs to be re-resolved.

@murgatroid99
Copy link
Member

I don't think the GOAWAYs are the issue. Those should show up in the logs as READY -> IDLE transitions in the subchannel state. But yes, when that happens, the client does name resolution again and connects again.

If you search your logs for a single subchannel ID ("43986" in that first log line, for example), you can see the full lifecycle for a subchannel. If you look at a few random subchannels that have a READY -> TRANSIENT_FAILURE transition, you can see how long they were connected by looking for the timestamps of the CONNECTING -> READY and READY -> TRANSIENT_FAILURE transitions. Does the connection duration match what you would expect for getting rid of sticky connections?

@MichaelSitter
Copy link

MichaelSitter commented Jan 19, 2022

So I filtered my logs for a few subchannels, in some cases I seem to be missing the subchannel initialization but they all seem to behave similarly. I'm a little confused why i'm seeing different IPs for the same subchannel, maybe a side effect of the service using cluster mode?

D 2022-01-19T01:07:36.579Z | subchannel | (132658) 172.30.145.36:50051 READY -> TRANSIENT_FAILURE
D 2022-01-19T01:07:36.602Z | subchannel | (132658) 172.30.145.36:50051 TRANSIENT_FAILURE -> IDLE
D 2022-01-19T01:07:43.347Z | subchannel | (132658) 172.30.115.248:50051 Subchannel constructed with options {
D 2022-01-19T01:07:44.761Z | subchannel | (132658) 172.30.115.248:50051 CONNECTING -> READY
D 2022-01-19T01:07:59.161Z | subchannel | (132658) 172.30.183.244:50051 TRANSIENT_FAILURE -> IDLE
D 2022-01-19T01:07:59.258Z | subchannel | (132658) 172.30.8.222:50051 READY -> TRANSIENT_FAILURE
D 2022-01-19T01:07:59.445Z | subchannel | (132658) 172.30.8.222:50051 TRANSIENT_FAILURE -> IDLE

The other notable observation I made was if I filter for a single IP address, I'm seeing a lot of hits for READY -> TRANSIENT_FAILURE. In the attached sample 104 of 671 log lines are showing that state transition. Is there any debug logging which would help me understand why those events are happening?

grpc-js IP logs - grpc-logs.csv

edit - another finding, I can see a very similar number of events if i filter for the following:

  • connection closed by GOAWAY with code 0
  • READY -> TRANSIENT_FAILURE

second edit - i'm realizing one thing which is complicating this investigation is i'm getting debug logs for clients which are using still pick-first. currently we've only enabled RRLB for a single client integration, so I think my earlier findings are probably a little flawed. 🤦

@MichaelSitter
Copy link

MichaelSitter commented Jan 25, 2022

I had a chance to run another round of testing and have a clearer picture what is happening. Here is the trace logs for a single subchannel of a client configured to use round robin LB:

D 2022-01-25T04:25:03.356Z | subchannel | (1577) 172.30.140.27:50051 connection closed by GOAWAY with code 0
D 2022-01-25T04:25:03.356Z | subchannel | (1577) 172.30.140.27:50051 READY -> IDLE
D 2022-01-25T04:25:03.356Z | subchannel | (1577) 172.30.140.27:50051 IDLE -> CONNECTING
D 2022-01-25T04:25:03.356Z | subchannel | (1577) 172.30.140.27:50051 creating HTTP/2 session
D 2022-01-25T04:25:03.357Z | subchannel | (1577) 172.30.140.27:50051 CONNECTING -> READY
D 2022-01-25T04:25:05.357Z | subchannel | (1577) 172.30.140.27:50051 connection closed by GOAWAY with code 0
D 2022-01-25T04:25:05.358Z | subchannel | (1577) 172.30.140.27:50051 READY -> IDLE
D 2022-01-25T04:25:05.358Z | subchannel | (1577) 172.30.140.27:50051 IDLE -> CONNECTING
D 2022-01-25T04:25:05.358Z | subchannel | (1577) 172.30.140.27:50051 creating HTTP/2 session
D 2022-01-25T04:25:05.359Z | subchannel | (1577) 172.30.140.27:50051 CONNECTING -> READY
D 2022-01-25T04:25:07.359Z | subchannel | (1577) 172.30.140.27:50051 connection closed by GOAWAY with code 0
D 2022-01-25T04:25:07.359Z | subchannel | (1577) 172.30.140.27:50051 READY -> IDLE
D 2022-01-25T04:25:07.360Z | subchannel | (1577) 172.30.140.27:50051 IDLE -> CONNECTING
D 2022-01-25T04:25:07.360Z | subchannel | (1577) 172.30.140.27:50051 creating HTTP/2 session
D 2022-01-25T04:25:07.362Z | subchannel | (1577) 172.30.140.27:50051 CONNECTING -> READY
D 2022-01-25T04:25:09.363Z | subchannel | (1577) 172.30.140.27:50051 connection closed by GOAWAY with code 0

So basically every 2ish seconds this subchannel gets a GOAWAY and has to reconnect. And here is what the dns_resolver traces show:

D 2022-01-25T04:37:50.031Z | dns_resolver | Resolution update requested for target dns:///*****
D 2022-01-25T04:37:50.032Z | dns_resolver | Resolution update requested for target dns:///*****
D 2022-01-25T04:37:50.033Z | dns_resolver | Resolution update requested for target dns:///*****
D 2022-01-25T04:37:50.034Z | dns_resolver | Resolution update requested for target dns:///*****
D 2022-01-25T04:37:50.037Z | dns_resolver | Resolution update requested for target dns:///*****
D 2022-01-25T04:37:50.047Z | dns_resolver | Resolved addresses for target dns:///*****: [...]
D 2022-01-25T04:37:50.047Z | dns_resolver | Resolved addresses for target dns:///*****: [...]
D 2022-01-25T04:37:50.047Z | dns_resolver | Resolved addresses for target dns:///*****: [...]

So it looks like the clients are basically running dns resolutions non-stop. I said originally the connection backoff patch helped reduce the peaks, but that doesn't seem to be the case. Even a small number of clients which aren't handling traffic are generating a huge number of DNS lookups. For this particular deployment I was testing with we have 2 pods, each running 4 workers processes, and 128 upstream hosts. According to our internal DNS metrics this was generating ~1.1k DNS lookups/sec.

Does a subchannel going from IDLE -> CONNECTING cause a DNS re-resolution to happen? Any insights why the pick-first resolver doesn't cause this same behavior?

edit - ok yeah i found the line. the RRLB state transition listener runs requestReresolution whenever the new state is IDLE:

this.channelControlHelper.requestReresolution();

This confirms what you originally said: subchannels keep getting GOAWAY, this causes them to go into IDLE, which causes the DNS re-resolution. Is the idea here when a subchannel goes idle that the host list needs to be updated? Outside of changing the upstream behavior here, what can I do here? Is it possible to add caching in front of the library DNS resolution?

@murgatroid99
Copy link
Member

I am sorry for the delayed response. This issue dropped off my radar for a little while. Yes, the idea is that when the backend sends a GOAWAY, that is a signal that the set of available backends may have changed, and the client should do name resolution again to check that.

One thing I notice is that the last log looks like multiple separate channel objects. Each channel has its own separate DNS resolver object, which will do separate DNS requests. There is an option that you may be able to use to mitigate this: gRPC Client class constructors accept a channelOverride option, which should be a Channel object. If you construct a single Channel object and use it for all clients that are currently constructed with identical parameters, that should reduce DNS traffic.

@murgatroid99
Copy link
Member

I just published grpc-js version 1.6.1 with support for the grpc.dns_min_time_between_resolutions_ms option, which controls the minimum delay between successful DNS requests. The default is 30,000 (30 seconds), so even if you don't configure it, this should help with your DNS traffic.

@ruimarinho
Copy link

@murgatroid99 does this need to be enabled in any way? I’m not being able to the re-resolution timer with default config values. I’m using the round robin load balancing config option.

@murgatroid99
Copy link
Member

It should just work by default. What exactly are you seeing that indicates that it is not working? Note that this setting only limits the frequency of successful DNS requests. Failing DNS requests are limited by a separate exponential backoff timer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants