Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC with TLS fails to complete handshake on an OS that does not honor O_NONBLOCK #12711

Closed
MitchRudominer opened this issue Sep 25, 2017 · 6 comments

Comments

@MitchRudominer
Copy link

cobalt.out.zip
Please answer these questions before submitting your issue.

Should this be an issue in the gRPC issue tracker?

Yes, this is a bug.

What version of gRPC and what language are you using?

Close to the current HEAD
commit d0eebf3bbcadee56755bc91fda96fb7a1a27f951

What operating system (Linux, Windows, …) and version?

Fuchsia (see https://en.wikipedia.org/wiki/Google_Fuchsia and https://fuchsia.googlesource.com/)

What runtime / compiler are you using (e.g. python version or version of gcc)

C++/clang

What did you do?

We are wrting a C++ application that uses gRPC. Things work well in all of the following configurations:

  • TLS, client running on Linux
  • No TLS, client running on Linux
  • No TLS, client running on Fuchsia

But things do not work using:

  • TLS, client running on Fuchsia

The external behavior we see is that the request times out: gRPC returns DEADLINE_EXCEEDED.

We have done a fair bit of debugging and below I provide the details:

We believe we have isolated the problem to the block of code around line 292 in the function
tcp_client_connect_impl() in the file //src/core/lib/iomgr/tcp_client_posix.c.
https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_client_posix.c#L292

 if (err >= 0) {
    *ep =
        grpc_tcp_client_create_from_fd(exec_ctx, fdobj, channel_args, addr_str);
    GRPC_CLOSURE_SCHED(exec_ctx, closure, GRPC_ERROR_NONE);
    goto done;
  }

On line 284 connect() is invoked

 err =
        connect(fd, (const struct sockaddr *)addr->addr, (socklen_t)addr->len);

and the block of code at line 292 is responsible for handling the case that
connect succeeds immediately. The thing is, elsewhere in gRPC code the flag O_NONBLOCK has
been set on the file descriptor. I believe this means that when the connect() is executed on Linux,
the block of code on line 292 is never executed because Linux always returns err < 0 and
errno = EINPROGRESS so that the block of code on around line 306 is always executed:

 grpc_pollset_set_add_fd(exec_ctx, interested_parties, fdobj);

  ac = (async_connect *)gpr_malloc(sizeof(async_connect));
  ac->closure = closure;
  ac->ep = ep;
  ac->fd = fdobj;
  ac->interested_parties = interested_parties;
  ac->addr_str = addr_str;
  addr_str = NULL;
  gpr_mu_init(&ac->mu);
  ac->refs = 2;
.......

and the block around line 292 (i.e. the success path) is never executed. We believe this acccounts
for that fact that there is a bug in that block of code--it never gets tested on Linux.

But it does get exercised when this code is run on Fuchsia. This is due to what may be considered a bug in Fuchsia: Fuchsia has does not yet honor the O_NONBLOCK flag. This means that
when connect() is invoked on Fuchsia, Fuchsia performs a blocking connect and returns
err = 0 so that the block of code referenced above does get executed. When TLS is being used this
fails. So we recognize that Fuchsia is not behaving correctly--it should honor the O_NONBLOCK flag.
Nevertheless it seems as if the code path in gRPC to handle the case that connect() returns err=0
is broken and should be fixed.

More details about what happens:
I am including a file called cobalt.out.zip that contains a tcpdump pcap file. You can see that the TCP handshake proceeds and the TLS handshake starts and the server sends the certificate and the client ACKs (in packet number 9 at time 0.082340 ) and then there is nothing for 20 seconds and then there is a FIN, ACK. (Then this repeats multiple times as our application, called Cobalt, does retries.)

gRPC sets up BoringSSL with memory-backed read and write BIOs in create_tsi_ssl_handshaker in src/core/tsi/ssl_transport_security.c.
That same function then calls SSL_do_handshake. As BoringSSL progresses through the state machine, it writes the ClientHello into the write buffer and flushes it.
gRPC correctly puts these bytes on the wire.

BoringSSL continues in its state machine until it tries to read the ServerHello. This message is being sent by the server, as can be seen in the pcaps, but it isn't being placed in the read buffer.
BoringSSL errors out with SSL_ERROR_WANT_READ, which create_tsi_ssl_handshaker was expecting. It sets its result to TSI_HANDSHAKE_IN_PROGRESS.
But, for reasons unknown, gRPC never calls tcp_read from tcp_posix.c.

@kpayson64
Copy link
Contributor

FWIW, I encountered the same issue on App Engine and created a reproduction case:
https://github.com/grpc/grpc/pull/11574/files

I attempted a fix here, but it didn't respect invariants of internal code and was abandoned.
#11571

@y-zeng
Copy link
Contributor

y-zeng commented Sep 25, 2017

We've noticed this issue. If connect() returns err=0, the endpoint will not be added to the connector's pollset_set, so that no completion queue will be driving the handshake process, causing DEADLINE_EXCEEDED.

Please try patching #12219, we will merge it soon.

@y-zeng
Copy link
Contributor

y-zeng commented Sep 25, 2017

Also, this should not cause DEADLINE_EXCEEDED after b2170fe.

Could you please confirm the version of gRPC? I could not find the commit d0eebf3bbcadee56755bc91fda96fb7a1a27f951.

@MitchRudominer
Copy link
Author

MitchRudominer commented Sep 25, 2017

You are right, sorry. That commit hash was from our own fork of gRPC. (We only forked for the sake adding a BUILD.gn file). The commit at which we forked is bf2d87c.

@MitchRudominer
Copy link
Author

Please note that on Fuchsia there is no e-poll. We use poll.

@vjpai
Copy link
Member

vjpai commented Dec 11, 2017

Since there are no comments on this issue since the merger of #12219, I'm marking it to close.

@ctiller ctiller closed this as completed Dec 11, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Sep 30, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants