Ruby client deadline greatly exceeded with unavailable hosts #14685

jcalvert · 2018-03-13T13:22:49Z

What version of gRPC and what language are you using?

grpc 1.10/ ruby

What operating system (Linux, Windows, …) and version?

Linux Debian 3.16.39-1+deb8u1~bpo70+1 (2017-02-24)

What runtime / compiler are you using (e.g. python version or version of gcc)

Ruby 2.4/gcc version 4.7.2

What did you do?

In an application, it was noticed that we would sometimes have requests that went toward unavailable hosts far exceed the specified deadline, roughly 127 seconds instead of 1. This is important for predictable performance when GRPC services are unavailable. I deduced that this condition was relatively repeatable and I used gdb to attach to the process during one of these extended "hands" and upon getting the contents found that it was always in gpr_cv_wait similar to the issue described here.

What did you expect to see?

That the grpc client either fails with Unavailable when a host/port is not bound or at least always respects a 200ms deadline.

What did you see instead?

Wanting to see if this was our application code, I looked at the Ruby examples and modified the simple file there as this gist - strangely enough this same 'hang' would happen every 100th iteration and every 200th iteration if the deadline was cut in half. It also seems to not happen if the stub is created each time and everything fails almost immediately with a more expected GRPC::Unavailable if localhost is used instead of a remote one.

I tried setting GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG to get more information and in an additional gist I have attached that output.

Anything else we should know about your project / environment?

I tried running this same script on my El Capitan MacBook Pro and was unable to reproduce the behavior, so I am wondering if this an issue with the settings under which the extension is compiled when the gem is installed.

The text was updated successfully, but these errors were encountered:

gaffer-93 · 2018-10-26T15:07:14Z

Hey guys, I might be able to shed some light on this issue! In late June I opened issue #15889 which similarly to this issue involved rpc calls not respecting the set deadlines and blocking for large periods of time. All the information about the bug and it's eventual solution can be found in the now closed issue linked. The result of all this was proposal A18. Initially seeing my original reproduction case as non-reproducible with the nightly builds after PR #16419 was merged was great.

We have however since uncovered another edge case where by passed rpc call deadlines are not respected which I believe is the exact same issue the OP is experiencing here. What we're currently experiencing is while connectivity is broken (iptables rule is active from the original repro case on #15889) the rpc calls timeout as expected but if you wait for the first re-connection (approx. initial_reconnect_backoff) you will see the next rpc call block in a similar fashion to what's reported by OP. Raising the initial_reconnect_backoff directly relates to the amount of time it takes for a blocking call to occur, it's pretty clear to us that this blocking (which is not respecting our timeout) happens when calling an rpc while the channel is attempting to re-connect.

What we've discovered is that, in our case this blocking consistently lasts exactly 127s at which point a DEADLINE_EXCEEDED exception is finally raised. We thought this was quite suspicious and we rather quickly suspected that the max tcp_syn_retries set on our sockets might influence this timeout. For context, the default for our version of linux is 6 which according to the kernel docs results in a maximum connection timeout of 127s.

tcp_syn_retries - INTEGER
    Number of times initial SYNs for an active TCP connection attempt
    will be retransmitted. Should not be higher than 127. Default value
    is 6, which corresponds to 63seconds till the last retransmission
    with the current initial RTO of 1second. With this the final timeout
    for an active TCP connection attempt will happen after 127seconds.

So we re-tested our repro case with varying values for tcp_syn_retries from 1 all the way up to 6 by doing something like:
echo 1 > /proc/sys/net/ipv4/tcp_syn_retries

Low and behold the deadline on rpc calls while grpc was attempting a reconnect directly correlated to the max expected connection timeout determined by the current value of tcp_syn_retries

syn_retries	re-connect timeout
1	3s
2	7s
3	15s
4	31s
5	63s
6	127s

All of these timings were consistent between multiple iterations of the test.

Our solution to this was to set the tcp_syn_retries to 2 globally on the hosts running the grpc based service in question but we would really prefer to be able to set this as channel option within GRPC. A similar solution to proposal A18 would be ideal, because that way you're not forced to set such an important kernel setting system-wide.

yashykt · 2018-11-01T21:27:10Z

@AspirinSJL Do you know from the top of your head why the deadline on the RPC might be getting delayed?

AspirinSJL · 2018-11-01T23:31:27Z

For the subchannel part, I can't think of anything that can delay the RPC deadline.

At first glance, I thought #17010 might be related.

stale · 2019-09-04T18:42:14Z

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 180 days. It will be closed automatically if no further update occurs in 1 day. Thank you for your contributions!

AspirinSJL · 2019-10-04T17:52:23Z

@yashykt Do you think this issue can be solved by something similar to https://github.com/grpc/proposal/blob/master/A18-tcp-user-timeout.md as analyzed and suggested in #14685 (comment)?

yashykt · 2019-10-04T22:51:06Z

Before providing any such option, I would like more information as to why gRPC is not canceling the RPC immediately. I believe we have timers on a connect call and we should not be waiting that long.

stale · 2020-04-02T01:09:43Z

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 180 days. It will be closed automatically if no further update occurs in 1 day. Thank you for your contributions!

nathanielmanistaatgoogle added the lang/ruby label Mar 13, 2018

nathanielmanistaatgoogle assigned apolcyn Mar 13, 2018

apolcyn added kind/bug priority/P0 labels May 15, 2018

apolcyn added priority/P1 and removed priority/P0 labels Aug 13, 2018

AspirinSJL self-assigned this Nov 2, 2018

stale bot added the disposition/stale label Sep 4, 2019

AspirinSJL removed the disposition/stale label Sep 4, 2019

AspirinSJL added priority/P2 and removed priority/P1 labels Oct 5, 2019

stale bot added the disposition/stale label Apr 2, 2020

stale bot closed this as completed Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruby client deadline greatly exceeded with unavailable hosts #14685

Ruby client deadline greatly exceeded with unavailable hosts #14685

jcalvert commented Mar 13, 2018 •

edited

Loading

gaffer-93 commented Oct 26, 2018 •

edited

Loading

yashykt commented Nov 1, 2018

AspirinSJL commented Nov 1, 2018

stale bot commented Sep 4, 2019

AspirinSJL commented Oct 4, 2019

yashykt commented Oct 4, 2019

stale bot commented Apr 2, 2020

Ruby client deadline greatly exceeded with unavailable hosts #14685

Ruby client deadline greatly exceeded with unavailable hosts #14685

Comments

jcalvert commented Mar 13, 2018 • edited Loading

What version of gRPC and what language are you using?

What operating system (Linux, Windows, …) and version?

What runtime / compiler are you using (e.g. python version or version of gcc)

What did you do?

What did you expect to see?

What did you see instead?

Anything else we should know about your project / environment?

gaffer-93 commented Oct 26, 2018 • edited Loading

yashykt commented Nov 1, 2018

AspirinSJL commented Nov 1, 2018

stale bot commented Sep 4, 2019

AspirinSJL commented Oct 4, 2019

yashykt commented Oct 4, 2019

stale bot commented Apr 2, 2020

jcalvert commented Mar 13, 2018 •

edited

Loading

gaffer-93 commented Oct 26, 2018 •

edited

Loading