New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC channels blocking indefinitely and not respecting deadlines on network disconnect #15889

Closed
gaffer-93 opened this Issue Jun 27, 2018 · 34 comments

Comments

Projects
None yet
@gaffer-93

gaffer-93 commented Jun 27, 2018

What version of gRPC and what language are you using?

gRPC==1.12.1
python

What operating system (Linux, Windows, …) and version?

Linux Ubuntu "14.04.5 LTS, Trusty Tahr"

What runtime / compiler are you using (e.g. python version or version of gcc)

gcc==4:4.8.2-1ubuntu6
python==2.7

What did you do?

I consistently reproduced a bug that causes gRPC channels to block indefinitely when sending to a destination that continuously times out.

I slightly modified the greeter_client.py from the python helloworld example, added an infinite while loop, added a timeout of 0.3 to the SayHello() grpc call and added exception handling to catch any DEADLINE_EXCEEDED errors.

Modified greeter_client.py

Running this script and an un-modified greeter_server.py (on a different host listening on port 6106) I was able to observe the two communicating as expected.

To reproduce the bug in question, I add an iptables rule to the server side to drop all traffic arriving at the port in use:
/sbin/iptables -I INPUT 1 -s xx.xxx.xxx.xxx -j DROP

What did you expect to see?

Adding the iptables rule results in lots of DEADLINE_EXCEEDED (timeout) exceptions as expected.

Traceback (most recent call last):
  File "greeter_client.py", line 36, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
  File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 500, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 434, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>

I expected to see the greeter_client.py continue iterating over the while loop, timing out in exactly 300ms on every call to response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)

What did you see instead?

If you leave the client running with everything in this state, it eventually blocks indefinitely on the send response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3) and never times out. This is an example of the bug we've been trying to figure out these past few weeks. After some digging we found that the blocking call here is: https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_channel.py#L508

Once the iptables rule is removed from the server side, the client un-blocks and continues to send.

Now what's interesting is that if you raise the size of the msg variable here, to say... 'a' * 1000 the client breaks much faster (starts to block indefinitely).

On inspection we discovered that when we add the iptables rule to drop the traffic, the send-q for the socket in use by the grpc client continues to grow as the client continues to attempt to send and timeout. The send-q grows because grpc is buffering the request for sending but never receiving an ack so the data sits there in the queue piling up on every attempt. Until.. the send-q seems to reach it's maximum threshold, at which point the client exhibits the indefinite blocking behaviour. We believe the kernel refuses to add any more data to the send-q causing grpc to block/spin deep within the C code.

For every reproduction of this, when sending msg='a', once the send-q hits 33425 the client blocks.

root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0  33425 xx.xxx.xxx.xxx:45048    xxx.xxx.xx.xxx:6106     ESTABLISHED 19834/python

For larger messages the breaking threshold rises slightly, for example sending msg='a'*1000, the send-q reaches 46278 every time before blocking

root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0  46278 xx.xxx.xxx.xxx:36000    xxx.xxx.xx.xxx:6106     ESTABLISHED 27166/python

Anything else we should know about your project / environment?

No

@mehrdada

This comment has been minimized.

Contributor

mehrdada commented Jun 28, 2018

@gaffer-93 Can you try using an example gRPC client from another Core-based language such as C++, C#, or Ruby to see if you can reproduce it in a non-Python clients?

@FranGM

This comment has been minimized.

FranGM commented Jun 28, 2018

Hi,

I work with @gaffer-93 and did some work on trying to reproduce this issue with a ruby client today.

Here's my modified ruby client: https://gist.github.com/FranGM/b22704e11d39e5f634ba5098131cd6d5 (essentially the same modifications done to the python client, which is adding a 0.3s timeout and making it loop while catching the Deadline Exceeded exceptions).

The behaviour under linux is the same:

"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"

... (Here I broke connectivity between the two hosts)

"Timed out after 0.301071361 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.301656398 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.301254051 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"

...    (Here I restored connectivity between the two hosts)

"Timed out after 52.628246245 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>

Now here's something else I found that was interesting. While modifying the ruby client I managed to mostly reproduce the issue from my laptop (running macos 10.13.5). I say mostly because when running the client from my laptop it still doesn't respect the timeout, but it eventually gives up before connectivity is restored:

"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"
"Greetings"

... (Connectivity is stopped here and never restored)

"Timed out after 0.305188 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.301285 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.306501 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.30456 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.302066 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 24.359898 seconds"         <--- note the 24 seconds to time out
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.301519 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>
"Greetings"
"Timed out after 0.305626 seconds"
#<GRPC::DeadlineExceeded: 4:Deadline Exceeded>

After the initial long timeout, the other timeouts happen every 0.3 seconds as expected. From my limited testing the initial timeout takes between 20 and 30 seconds.

This seems to point at a difference between osx's and linux's TCP stacks 🤔

For completeness sake, the linux ruby code was running on the following kernel:

Linux my-hostname 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux```
@FranGM

This comment has been minimized.

FranGM commented Jun 29, 2018

I've also straced the client process (https://gist.github.com/FranGM/98083cc0c617fb18be1b3357f84eb436), and haven't gone too deep into it but there's an interesting change of behaviour around the sendmsg calls that matches @gaffer-93 's observations.

When everything is fine a sendmsg call looks like this:

sendmsg(6, {msg_name(0)=NULL, msg_iov(8)=[{"\0\0L\1\4\0\0\0\25\307\306\0176\34", 14}, {"myhost.domain.com:6106", 28}, {"\0175\34", 3}, {"/helloworld.Greeter/SayHello", 28}, {"\303\302\301\300\277\17/\004300m\0\0\4\10\0\0\0\0\25\0\0", 23}, {"\0\5\0\3\360\0\1\0\0\0\25\0\0\0\3\353", 16}, {"\n\350\7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1003}, {"\0\0\4\10\0\0\0\0\0\0\0\3\370", 13}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1128

When connectivity is broken, sendmsg still looks exactly the same, the kernel is happily taking our messages and adding them to the send buffer, even if nothing is being sent out:

sendmsg(6, {msg_name(0)=NULL, msg_iov(7)=[{"\0\0L\1\4\0\0\2y\307\306\0176\34", 14}, {"myhost.domain.com:6106", 28}, {"\0175\34", 3}, {"/helloworld.Greeter/SayHello", 28}, {"\303\302\301\300\277\17/\004300m\0\0\4\10\0\0\0\2y\0\0", 23}, {"\0\5\0\3\360\0\1\0\0\2y\0\0\0\3\353", 16}, {"\n\350\7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1003}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1115

This is fine though, as we never get a response in time and time out accordingly.

Once the send buffer is full, the behaviour changes, as sendmsg stops succeeding:

sendmsg(6, {msg_name(0)=NULL, msg_iov(7)=[{"\0\0L\1\4\0\0\2\227\307\306\0176\34", 14}, {"myhost.domain.com:6106", 28}, {"\0175\34", 3}, {"/helloworld.Greeter/SayHello", 28}, {"\303\302\301\300\277\17/\004300m\0\0\4\10\0\0\0\2\227\0\0", 23}, {"\0\5\0\3\360\0\1\0\0\2\227\0\0\0\3\353", 16}, {"\n\350\7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1003}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xee8354, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xee8350, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0xee8328, FUTEX_WAKE_PRIVATE, 1)  = 1
sendmsg(6, {msg_name(0)=NULL, msg_iov(7)=[{"\0\0L\1\4\0\0\2\227\307\306\0176\34", 14}, {"myhost.domain.com:6106", 28}, {"\0175\34", 3}, {"/helloworld.Greeter/SayHello", 28}, {"\303\302\301\300\277\17/\004300m\0\0\4\10\0\0\0\2\227\0\0", 23}, {"\0\5\0\3\360\0\1\0\0\2\227\0\0\0\3\353", 16}, {"\n\350\7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1003}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = -1 EAGAIN (Resource temporarily unavailable)

And from that point on we just hang indefinitely

clock_gettime(CLOCK_REALTIME, {1530273195, 68155052}) = 0
clock_gettime(CLOCK_REALTIME, {1530273195, 68259698}) = 0
clock_gettime(CLOCK_MONOTONIC, {3095142, 542521765}) = 0
futex(0x7ffd03cf79c4, FUTEX_WAIT_BITSET_PRIVATE, 1, {3095142, 743082514}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x124b960, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {3095142, 743961734}) = 0
clock_gettime(CLOCK_REALTIME, {1530273195, 270160930}) = 0
clock_gettime(CLOCK_REALTIME, {1530273195, 270430411}) = 0
clock_gettime(CLOCK_MONOTONIC, {3095142, 744933593}) = 0
futex(0x7ffd03cf79c4, FUTEX_WAIT_BITSET_PRIVATE, 1, {3095142, 945082514}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x124b960, FUTEX_WAKE_PRIVATE, 1) = 0

...

So it seems like we're not correctly handling the case where the send buffer is full and the kernel returns EAGAIN?

@FranGM

This comment has been minimized.

FranGM commented Jun 29, 2018

Here's also a full trace when running the client with GRPC_VERBOSITY=debug GRPC_TRACE=all

https://gist.github.com/FranGM/f7289720ea58c587e1652f8926df44fb

(can easily navigate to the part where it breaks by looking for write: delayed)

I0629 12:28:40.500742349   28698 tcp_posix.cc:649]           WRITE 0x1b1ee70 (peer=ipv4:1.2.3.4:6106): 0a e8 07 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 '...aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
I0629 12:28:40.500790503   28698 tcp_posix.cc:671]           write: delayed
I0629 12:28:40.500803920   28698 tcp_posix.cc:223]           TCP:0x1b1ee70 notify_on_write
I0629 12:28:40.500815931   28698 tcp_posix.cc:181]           BACKUP_POLLER: cover cnt 0->2
I0629 12:28:40.500828878   28698 tcp_posix.cc:189]           BACKUP_POLLER:0x1acbf90 create
I0629 12:28:40.500842391   28698 executor.cc:228]            EXECUTOR: try to schedule 0x1acbf98 (long) to thread 0
I0629 12:28:40.500870575   28698 tcp_posix.cc:204]           BACKUP_POLLER:0x1acbf90 add 0x1b1ee70
E0629 12:28:40.500884225   28698 lockfree_event.cc:94]       LockfreeEvent::NotifyOn: 0x7f90d00024e0 curr=0x2 closure=0x1b1f038
I0629 12:28:40.500897445   28698 tcp_posix.cc:234]           TCP:0x1b1ee70 got_write: "No Error"
I0629 12:28:40.500909538   28698 tcp_posix.cc:170]           BACKUP_POLLER:0x1acbf90 uncover cnt 2->1
I0629 12:28:40.500925349   28698 tcp_posix.cc:621]           write: delayed
I0629 12:28:40.500936820   28698 tcp_posix.cc:223]           TCP:0x1b1ee70 notify_on_write
I0629 12:28:40.500948222   28698 tcp_posix.cc:181]           BACKUP_POLLER: cover cnt 1->3
I0629 12:28:40.500938608   28699 executor.cc:175]            EXECUTOR[0]: execute
I0629 12:28:40.501003384   28699 executor.cc:72]             EXECUTOR: run 0x1acbf98
I0629 12:28:40.500959634   28698 tcp_posix.cc:204]           BACKUP_POLLER:0x1acbf90 add 0x1b1ee70
I0629 12:28:40.501036837   28699 tcp_posix.cc:132]           BACKUP_POLLER:0x1acbf90 run
I0629 12:28:40.501066829   28698 tcp_posix.cc:170]           BACKUP_POLLER:0x1acbf90 uncover cnt 3->2
I0629 12:28:40.501094231   28699 ev_epoll1_linux.cc:681]     PS:0x1acbfc0 BEGIN_STARTS:0x7f90d4faa970
E0629 12:28:40.501118475   28698 lockfree_event.cc:94]       LockfreeEvent::NotifyOn: 0x7f90d00024e0 curr=(nil) closure=0x1b1f038
I0629 12:28:40.501144809   28699 ev_epoll1_linux.cc:700]     PS:0x1acbfc0 BEGIN_REORG:0x7f90d4faa970 kick_state=UNKICKED is_reassigning=1
I0629 12:28:40.501187039   28699 ev_epoll1_linux.cc:769]     PS:0x1acbfc0 BEGIN_DONE:0x7f90d4faa970 kick_state=DESIGNATED_POLLER shutdown=0 kicked_without_poller: 0
I0629 12:28:40.501214838   28698 completion_queue.cc:853]    grpc_completion_queue_next(cq=0x1ad25e0, deadline=gpr_timespec { tv_sec: 1530275320, tv_nsec: 701206171, clock_type: 1 }, reserved=(nil))

And in case it's any use here's a trace where connectivity was broken and then restored after our client was hanging: https://gist.github.com/FranGM/317e06682b82623666a05b73153a4924

And an strace of the disconnect/reconnect case, as I realised it was missing from my previous comment: https://gist.github.com/FranGM/0017f4f10c844f0f82c9531dc53bdd28

@FranGM

This comment has been minimized.

FranGM commented Jul 2, 2018

I did a bit more digging into what exactly happens on the kernel side (even if it's not immediately helpful for addressing the grpc bug itself, but might help provide some context for other people experiencing a similar issue).

The client doesn't actually hang indefinitely as originally reported, but it will roughly take 15-20 minutes for it to time out.

When connectivity is broken between server and client, the client "succeeds" every time it wants to send data (as long as there's room in the send queue) so it times out normally when it doesn't get a reply within the deadline. Once the send queue is full the sendmsg will actually fail (it gets an EAGAIN) and, despite not receiving a response, the client will stay in this state until either the kernel succeeds sending the data (so there's again room in the send queue) or the connection gets killed (forcing grpc to create a new connection, so sends will start "succeeding" again).

The number of retries the kernel is willing to do before giving up on a connection is governed by net.ipv4.tcp_retries2 (which seems to default to 15). The timeout between retries is calculated dynamically and is based in the RTT of the connection, but we're looking at around 15-20 minutes overall for all the retries. Once all the retries have been exhausted the kernel will close the connection.

@srini100

This comment has been minimized.

Contributor

srini100 commented Jul 6, 2018

@FranGM, thanks for digging into this. Your findings are very helpful. This is a bug in the Core and applies to all wrapped languages. I tried a C++ client and see the same behavior.

@sreecha

This comment has been minimized.

Contributor

sreecha commented Jul 6, 2018

Thanks @FranGM for the investigation. It is strange that despite having a 300 millisecond timeout on each call, the call is not timing out in some cases (i.e when sendmsg returns EAGAIN).

@ncteisen or @hcaseyal: Can one of you take a look at this bug ? - Looks like @srini100 has a repro with C++ client..

@sreecha sreecha assigned hcaseyal and unassigned sreecha Jul 6, 2018

@srini100

This comment has been minimized.

Contributor

srini100 commented Jul 8, 2018

@hcaseyal, repro with C++ client is here.

@kpayson64

This comment has been minimized.

Contributor

kpayson64 commented Jul 11, 2018

Thanks for the detailed analysis!

I was able to reproduce the issue, and verified it is fixed by #15983.

@hcaseyal hcaseyal removed their assignment Jul 11, 2018

@vadimi

This comment has been minimized.

vadimi commented Jul 19, 2018

I'm seeing something similar in one of my apps. Any chance we can expect a fix in grpc 1.14?

@emauton

This comment has been minimized.

emauton commented Jul 24, 2018

I see #15983 is merged - thanks @kpayson64!

~When can we expect a 1.14 release?

@kpayson64

This comment has been minimized.

Contributor

kpayson64 commented Jul 24, 2018

1.14 release candidates should be published shortly (this week).

@kpayson64

This comment has been minimized.

Contributor

kpayson64 commented Jul 25, 2018

1.14 pre-release has been published:
https://pypi.org/project/grpcio/#history

@kpayson64 kpayson64 closed this Jul 25, 2018

@gaffer-93

This comment has been minimized.

gaffer-93 commented Jul 30, 2018

We have pulled in the new 1.14.0rc1 package and re-deployed one of the services that we were experiencing this bug with. Unfortunately we are still seeing the same behaviour as with 1.13.0.

To be sure I tried to reproduce the issue again with my original Modified greeter_client.py

(grpc) root@client ~ # pip freeze
enum34==1.1.6
futures==3.2.0
grpcio==1.14.0rc1
protobuf==3.6.0
six==1.11.0
(grpc) root@server ~ # pip freeze
enum34==1.1.6
futures==3.2.0
grpcio==1.14.0rc1
protobuf==3.6.0
six==1.11.0
ERROR:root:Resonse Recieved
ERROR:root:Resonse Recieved
ERROR:root:Resonse Recieved
ERROR:root:Resonse Recieved

*BREAK CONNECTIVITY WITH IPTABLES*

ERROR:root:GRPC TIMEOUT: elapsed 301.83506012ms
Traceback (most recent call last):
  File "greeter_client.py", line 36, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 514, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 448, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1532962723.729290705","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1095,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
ERROR:root:GRPC TIMEOUT: elapsed 302.653074265ms
Traceback (most recent call last):
  File "greeter_client.py", line 36, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 514, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 448, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1532962724.033290171","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1095,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
ERROR:root:GRPC TIMEOUT: elapsed 302.798986435ms
Traceback (most recent call last):
  File "greeter_client.py", line 36, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 514, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 448, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1532962724.337292221","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1095,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

*HANGS UNTIL IPTABLES RULE IS REMOVED*

ERROR:root:GRPC TIMEOUT: elapsed 25122.6000786ms
Traceback (most recent call last):
  File "greeter_client.py", line 36, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 514, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/.venvs/grpc/local/lib/python2.7/site-packages/grpc/_channel.py", line 448, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1532962724.641294148","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1095,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
@emauton

This comment has been minimized.

emauton commented Aug 3, 2018

Hi, thanks @srini100. Can you confirm that grpc.max_reconnect_backoff_ms and friends should allow us to change these parameters?

Testing so far (and staring at code) suggests that while it's parsed out as configuration, it doesn't actually do anything. Is #10684 relevant to retries in this case? If so, is there a new way to specify retry policy for a channel via options, etc.?

We may be missing something, or there may be a documentation gap here (happy to write some up once we have a resolution, if so!)

@dgquintas

This comment has been minimized.

dgquintas commented Aug 3, 2018

grpc.max_reconnect_backoff_ms et al. should certainly have an effect on the connection's (aka subchannel) retry behavior. In particular:

  1. Parsing of the channel args happens here as part of the subchannel creation. The following line inits c->backoff using the captured options.
  2. Throughout subchannel.cc, c->backoff is used to inform retry attempt times. In particular, consider the body of continue_connect_locked. The most interesting bit is the invocation of c->backoff->NextAttemptTime(), whose body is here.

I tried to make the backoff implementation and its documentation consistent as part of #8854 and #13121, but it's a surprisingly tricky business and there may still be sharp edges somewhere. Please let me know if you find any.

@srini100

This comment has been minimized.

Contributor

srini100 commented Aug 7, 2018

@gaffer-93, the fix for deadline issue (#16179) needs to be reverted from 1.14.0 release. A patch will be coming out soon reverting this fix. This fix is causing some tests to fail and it is not safe to use this release. Meanwhile to get around this issue, you can enable keepalive that will detect connection breakage. See https://github.com/grpc/grpc/blob/master/doc/keepalive.md. We are actively looking into a solution for the original issue.

@srini100 srini100 assigned yashykt and unassigned dgquintas Aug 7, 2018

@alexvy86

This comment has been minimized.

Contributor

alexvy86 commented Aug 7, 2018

@srini100 are there notes on why is it dangerous to use 1.14.0? I recently updated it in an app precisely because of this thread. TBH I'm not 100% sure the problem I was seeing was exactly the same, but I also saw an instance of the client hanging for ~15 minutes before throwing an error (referenced here by OP), so I thought I'd upgrade and see if came up again.

@srini100

This comment has been minimized.

Contributor

srini100 commented Aug 7, 2018

@alexvy86, this fix has unfortunately introduced a timing related issue in some corner cases where a stream is cancelled (due to deadline exceed or any other valid reason) and error returned to the application but the application provided buffer may still be in use by the transport.

@yashykt

This comment has been minimized.

Member

yashykt commented Aug 9, 2018

First, to briefly summarize the bug that is causing gRPC channels to block indefinitely, which can be reproduced by adding rules to drop packets to certain IP addresses -
The transport keeps trying to send packets but they are always dropped due to the mentioned rule. This means that the TCP packets are never acknowledged. This is likely causing the network buffers in the kernel to fill up. This means that the socket will no longer be writable at this point. All future writes will hence be blocked. Inside gRPC core, there are two places where user provided slices can get blocked. 1) HTTP2 layer - This is a buffer which coalesces data from all streams (RPCs) that are to be written to a single transport.
2) TCP layer - This is a buffer received from the HTTP2 layer, when the TCP layer is ready for more writes. Some slices from the buffer might have been written to the kernel buffers, but there might still possibly be some slices that have not yet been written.

Now, when a deadline on an RPC expires, we start a cancel Op down the filter stack. When the cancel op reaches the HTTP2 layer, we check if can readily run closures that would finally result in the RPC getting cancelled. The fix in #16179 ignored the slices that could have been potentially stuck in the two buffers mentioned above. This leads to the ASAN bug that we started seeing on our tests.

To explain in slightly more detail what the ASAN issue is -
GRPC Core does not make a copy of the data that the user provides to send over the transport. For example, the status string provided to GRPC core is not copied and a slice is created using the string as the store. This slice is then provided to the HTTP2 and TCP layers. If we return back to the application without somehow removing the slices from the two buffers, we encounter a possible heap-use-after-free bug if the string is freed immediately in the application.

We are still looking into potential ways to solve this without causing a severe performance degradation.

@actgardner

This comment has been minimized.

actgardner commented Aug 9, 2018

Given the current state of this issue, what's the safest version of grpc to run? It sounds like both 1.14.0 and 1.14.1 have bugs, should we stick with 1.13.0?

@srini100

This comment has been minimized.

Contributor

srini100 commented Aug 9, 2018

1.14.1 is safe to use. The behavior w.r.t to this issue is same in 1.13.0 and 1.14.1. The attempt to fix this issue in 1.14.0 introduced an unsafe bug which is fixed in 1.14.1 by rolling back that fix.

If you are running into this issue, you can use keepalive (as mentioned earlier) until we have a better way to resolve this.

@gaffer-93

This comment has been minimized.

gaffer-93 commented Aug 20, 2018

Just checking in on the progress of this issue. Do you know when we can expect a fix for this?

We have implemented a semi-workaround using the keepalive as you suggested. Our client service averages over 200 long lived channels open at any one time, we are sending ~100kb batched objects down each channel about every 400ms. As you can imagine, with that many unique destinations, we were running into this bug quite regularly. Our semi-workaround was to enable keepalive and set the following options:

GRPC_ARG_KEEPALIVE_TIME_MS: 100
GRPC_ARG_KEEPALIVE_TIMEOUT_MS: 350

This allowed us to preemptively check the connectivity state of a channel before sending data by calling channel._channel.check_connectivity_state(True), if the state transitions from 2 -> 1 or 0 we don't execute the grpc call to send data and avoid the bug and any stalling. Obviously this isn't a guaranteed fix for the issue. In the case of a bad network incident, where many destination hosts go down, the keepalive may not close the transports quickly enough and grpc calls can still stall.

I was wondering if there's something we are missing in your original suggestion to use keepalive as a workaround? For our edge case, is there anything you could suggest we do to improve our situation?

Thanks

@yashykt

This comment has been minimized.

Member

yashykt commented Aug 20, 2018

We are planning on introducing TCP_USER_TIMEOUT to avoid issues where the tcp packets are not ACKed in reasonable time. That should solve all the edge conditions that we have seen in regards with channels getting blocked indefinitely (and keepalive too).

@srini100

This comment has been minimized.

Contributor

srini100 commented Aug 24, 2018

@gaffer-93, discussion on adding TCP_USER_TIMEOUT is here if you would like to chime in with your use case.

@yashykt

This comment has been minimized.

Member

yashykt commented Sep 10, 2018

The proposal is merged and the pull request is out for review #16419

@yashykt

This comment has been minimized.

Member

yashykt commented Sep 13, 2018

#16419 was merged.

@FranGM

This comment has been minimized.

FranGM commented Sep 14, 2018

Thank you! Is this expected to be a part of 1.15 or will it need to wait for 1.16 to be released?

@srini100

This comment has been minimized.

Contributor

srini100 commented Sep 14, 2018

This will be available in 1.16 which will be out in about 6 weeks (since 1.15 just came out). If you want to try it now you can use nightly build pkgs from https://packages.grpc.io/.

@FranGM

This comment has been minimized.

FranGM commented Sep 14, 2018

Cheers, that gives us an idea of the timeline. We'll see if we can give the nightlies a go in the meantime :)

@yashykt

This comment has been minimized.

Member

yashykt commented Sep 18, 2018

Marking this closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment