-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gRPC with TLS fails to complete handshake on an OS that does not honor O_NONBLOCK #12711
Comments
FWIW, I encountered the same issue on App Engine and created a reproduction case: I attempted a fix here, but it didn't respect invariants of internal code and was abandoned. |
We've noticed this issue. If connect() returns err=0, the endpoint will not be added to the connector's pollset_set, so that no completion queue will be driving the handshake process, causing DEADLINE_EXCEEDED. Please try patching #12219, we will merge it soon. |
Also, this should not cause DEADLINE_EXCEEDED after b2170fe. Could you please confirm the version of gRPC? I could not find the commit d0eebf3bbcadee56755bc91fda96fb7a1a27f951. |
You are right, sorry. That commit hash was from our own fork of gRPC. (We only forked for the sake adding a BUILD.gn file). The commit at which we forked is bf2d87c. |
Please note that on Fuchsia there is no e-poll. We use poll. |
Since there are no comments on this issue since the merger of #12219, I'm marking it to close. |
cobalt.out.zip
Please answer these questions before submitting your issue.
Should this be an issue in the gRPC issue tracker?
Yes, this is a bug.
What version of gRPC and what language are you using?
Close to the current HEAD
commit d0eebf3bbcadee56755bc91fda96fb7a1a27f951
What operating system (Linux, Windows, …) and version?
Fuchsia (see https://en.wikipedia.org/wiki/Google_Fuchsia and https://fuchsia.googlesource.com/)
What runtime / compiler are you using (e.g. python version or version of gcc)
C++/clang
What did you do?
We are wrting a C++ application that uses gRPC. Things work well in all of the following configurations:
But things do not work using:
The external behavior we see is that the request times out: gRPC returns DEADLINE_EXCEEDED.
We have done a fair bit of debugging and below I provide the details:
We believe we have isolated the problem to the block of code around line 292 in the function
tcp_client_connect_impl() in the file //src/core/lib/iomgr/tcp_client_posix.c.
https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_client_posix.c#L292
On line 284 connect() is invoked
and the block of code at line 292 is responsible for handling the case that
connect succeeds immediately. The thing is, elsewhere in gRPC code the flag O_NONBLOCK has
been set on the file descriptor. I believe this means that when the connect() is executed on Linux,
the block of code on line 292 is never executed because Linux always returns err < 0 and
errno = EINPROGRESS so that the block of code on around line 306 is always executed:
and the block around line 292 (i.e. the success path) is never executed. We believe this acccounts
for that fact that there is a bug in that block of code--it never gets tested on Linux.
But it does get exercised when this code is run on Fuchsia. This is due to what may be considered a bug in Fuchsia: Fuchsia has does not yet honor the O_NONBLOCK flag. This means that
when connect() is invoked on Fuchsia, Fuchsia performs a blocking connect and returns
err = 0 so that the block of code referenced above does get executed. When TLS is being used this
fails. So we recognize that Fuchsia is not behaving correctly--it should honor the O_NONBLOCK flag.
Nevertheless it seems as if the code path in gRPC to handle the case that connect() returns err=0
is broken and should be fixed.
More details about what happens:
I am including a file called cobalt.out.zip that contains a tcpdump pcap file. You can see that the TCP handshake proceeds and the TLS handshake starts and the server sends the certificate and the client ACKs (in packet number 9 at time 0.082340 ) and then there is nothing for 20 seconds and then there is a FIN, ACK. (Then this repeats multiple times as our application, called Cobalt, does retries.)
gRPC sets up BoringSSL with memory-backed read and write BIOs in create_tsi_ssl_handshaker in src/core/tsi/ssl_transport_security.c.
That same function then calls SSL_do_handshake. As BoringSSL progresses through the state machine, it writes the ClientHello into the write buffer and flushes it.
gRPC correctly puts these bytes on the wire.
BoringSSL continues in its state machine until it tries to read the ServerHello. This message is being sent by the server, as can be seen in the pcaps, but it isn't being placed in the read buffer.
BoringSSL errors out with SSL_ERROR_WANT_READ, which create_tsi_ssl_handshaker was expecting. It sets its result to TSI_HANDSHAKE_IN_PROGRESS.
But, for reasons unknown, gRPC never calls tcp_read from tcp_posix.c.
The text was updated successfully, but these errors were encountered: