Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby client deadline greatly exceeded with unavailable hosts #14685

Closed
jcalvert opened this issue Mar 13, 2018 · 7 comments
Closed

Ruby client deadline greatly exceeded with unavailable hosts #14685

jcalvert opened this issue Mar 13, 2018 · 7 comments

Comments

@jcalvert
Copy link

jcalvert commented Mar 13, 2018

What version of gRPC and what language are you using?

grpc 1.10/ ruby

What operating system (Linux, Windows, …) and version?

Linux Debian 3.16.39-1+deb8u1~bpo70+1 (2017-02-24)

What runtime / compiler are you using (e.g. python version or version of gcc)

Ruby 2.4/gcc version 4.7.2

What did you do?

In an application, it was noticed that we would sometimes have requests that went toward unavailable hosts far exceed the specified deadline, roughly 127 seconds instead of 1. This is important for predictable performance when GRPC services are unavailable. I deduced that this condition was relatively repeatable and I used gdb to attach to the process during one of these extended "hands" and upon getting the contents found that it was always in gpr_cv_wait similar to the issue described here.

What did you expect to see?

That the grpc client either fails with Unavailable when a host/port is not bound or at least always respects a 200ms deadline.

What did you see instead?

Wanting to see if this was our application code, I looked at the Ruby examples and modified the simple file there as this gist - strangely enough this same 'hang' would happen every 100th iteration and every 200th iteration if the deadline was cut in half. It also seems to not happen if the stub is created each time and everything fails almost immediately with a more expected GRPC::Unavailable if localhost is used instead of a remote one.

I tried setting GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG to get more information and in an additional gist I have attached that output.

Anything else we should know about your project / environment?

I tried running this same script on my El Capitan MacBook Pro and was unable to reproduce the behavior, so I am wondering if this an issue with the settings under which the extension is compiled when the gem is installed.

@gaffer-93
Copy link

gaffer-93 commented Oct 26, 2018

Hey guys, I might be able to shed some light on this issue! In late June I opened issue #15889 which similarly to this issue involved rpc calls not respecting the set deadlines and blocking for large periods of time. All the information about the bug and it's eventual solution can be found in the now closed issue linked. The result of all this was proposal A18. Initially seeing my original reproduction case as non-reproducible with the nightly builds after PR #16419 was merged was great.

We have however since uncovered another edge case where by passed rpc call deadlines are not respected which I believe is the exact same issue the OP is experiencing here. What we're currently experiencing is while connectivity is broken (iptables rule is active from the original repro case on #15889) the rpc calls timeout as expected but if you wait for the first re-connection (approx. initial_reconnect_backoff) you will see the next rpc call block in a similar fashion to what's reported by OP. Raising the initial_reconnect_backoff directly relates to the amount of time it takes for a blocking call to occur, it's pretty clear to us that this blocking (which is not respecting our timeout) happens when calling an rpc while the channel is attempting to re-connect.

What we've discovered is that, in our case this blocking consistently lasts exactly 127s at which point a DEADLINE_EXCEEDED exception is finally raised. We thought this was quite suspicious and we rather quickly suspected that the max tcp_syn_retries set on our sockets might influence this timeout. For context, the default for our version of linux is 6 which according to the kernel docs results in a maximum connection timeout of 127s.

tcp_syn_retries - INTEGER
    Number of times initial SYNs for an active TCP connection attempt
    will be retransmitted. Should not be higher than 127. Default value
    is 6, which corresponds to 63seconds till the last retransmission
    with the current initial RTO of 1second. With this the final timeout
    for an active TCP connection attempt will happen after 127seconds.

So we re-tested our repro case with varying values for tcp_syn_retries from 1 all the way up to 6 by doing something like:
echo 1 > /proc/sys/net/ipv4/tcp_syn_retries

Low and behold the deadline on rpc calls while grpc was attempting a reconnect directly correlated to the max expected connection timeout determined by the current value of tcp_syn_retries

syn_retries re-connect timeout
1 3s
2 7s
3 15s
4 31s
5 63s
6 127s

All of these timings were consistent between multiple iterations of the test.

Our solution to this was to set the tcp_syn_retries to 2 globally on the hosts running the grpc based service in question but we would really prefer to be able to set this as channel option within GRPC. A similar solution to proposal A18 would be ideal, because that way you're not forced to set such an important kernel setting system-wide.

@yashykt
Copy link
Member

yashykt commented Nov 1, 2018

@AspirinSJL Do you know from the top of your head why the deadline on the RPC might be getting delayed?

@AspirinSJL
Copy link
Member

For the subchannel part, I can't think of anything that can delay the RPC deadline.

At first glance, I thought #17010 might be related.

@AspirinSJL AspirinSJL self-assigned this Nov 2, 2018
@stale
Copy link

stale bot commented Sep 4, 2019

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 180 days. It will be closed automatically if no further update occurs in 1 day. Thank you for your contributions!

@AspirinSJL
Copy link
Member

@yashykt Do you think this issue can be solved by something similar to https://github.com/grpc/proposal/blob/master/A18-tcp-user-timeout.md as analyzed and suggested in #14685 (comment)?

@yashykt
Copy link
Member

yashykt commented Oct 4, 2019

Before providing any such option, I would like more information as to why gRPC is not canceling the RPC immediately. I believe we have timers on a connect call and we should not be waiting that long.

@stale
Copy link

stale bot commented Apr 2, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 180 days. It will be closed automatically if no further update occurs in 1 day. Thank you for your contributions!

@stale stale bot closed this as completed Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants