New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC deadline not honored when resolver does not return #12897

Closed
NanXiao opened this Issue Oct 9, 2017 · 12 comments

Comments

Projects
None yet
4 participants
@NanXiao

NanXiao commented Oct 9, 2017

What version of gRPC and what language are you using?

The master branch, C++.

What operating system (Linux, Windows, …) and version?

ArchLinux.

What runtime / compiler are you using (e.g. python version or version of gcc)

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.1.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp
Thread model: posix
gcc version 7.1.1 20170630 (GCC)

What did you do?

I modify greeter_client.cc:

        GreeterClient greeter(grpc::CreateChannel(
          "localhost:50051", grpc::InsecureChannelCredentials()));

to use a fake host name, such as "badhost":

        GreeterClient greeter(grpc::CreateChannel(
          "badhost:50051", grpc::InsecureChannelCredentials()));

What did you expect to see?

The client's SayHello() can return.

What did you see instead?

The client blocks the SayHello() for ever.

@NanXiao NanXiao changed the title from The Client will block forever when the host address faked. to The Client will block forever when the host address is faked. Oct 9, 2017

@apolcyn

This comment has been minimized.

Show comment
Hide comment
@apolcyn

apolcyn Oct 10, 2017

Contributor

This looks similar to #12272.

cc @y-zeng - I believe the issue is that failed DNS resolution (with the native resolver) doesn't follow wait_for_ready semantics. (i.e., if an RPC is made with the default setting of fail_fast to truewait_for_ready=true, and the hostname fails to resolve, then the resolver will repeatedly retry name resolution rather than setting connectivity state to transient failure and ending the call).

Contributor

apolcyn commented Oct 10, 2017

This looks similar to #12272.

cc @y-zeng - I believe the issue is that failed DNS resolution (with the native resolver) doesn't follow wait_for_ready semantics. (i.e., if an RPC is made with the default setting of fail_fast to truewait_for_ready=true, and the hostname fails to resolve, then the resolver will repeatedly retry name resolution rather than setting connectivity state to transient failure and ending the call).

@NanXiao

This comment has been minimized.

Show comment
Hide comment
@NanXiao

NanXiao Oct 11, 2017

@apolcyn

Thanks for your comment!

The Client't main thread is blocking in grpc::CompletionQueue::Pluck:

#0  0x00007ffff6619076 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x00007ffff7145b78 in ?? () from /usr/lib/libgrpc.so.4
#2  0x00007ffff7144b24 in ?? () from /usr/lib/libgrpc.so.4
#3  0x00007ffff714c25f in grpc_pollset_work () from /usr/lib/libgrpc.so.4
#4  0x00007ffff717482c in ?? () from /usr/lib/libgrpc.so.4
#5  0x00007ffff7172f23 in grpc_completion_queue_pluck () from /usr/lib/libgrpc.so.4
#6  0x00007ffff74f746e in grpc::CoreCodegen::grpc_completion_queue_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) () from /usr/lib/libgrpc++.so.1
#7  0x000055555556b36d in grpc::CompletionQueue::Pluck (this=0x7fffffffe4a0, tag=0x7fffffffe550)
    at /usr/include/grpc++/impl/codegen/completion_queue.h:224
#8  0x000055555556c869 in grpc::BlockingUnaryCall<helloworld::HelloRequest, helloworld::HelloReply> (
    channel=0x5555557975d0, method=..., context=0x7fffffffe760, request=..., result=0x7fffffffe6d0)
    at /usr/include/grpc++/impl/codegen/client_unary_call.h:59
#9  0x0000555555569254 in helloworld::Greeter::Stub::SayHello (this=0x555555790e10, context=0x7fffffffe760,
    request=..., response=0x7fffffffe6d0) at helloworld.grpc.pb.cc:32
#10 0x000055555557251a in GreeterClient::SayHello (this=0x7fffffffe928, user="world") at greeter_client.cc:58
#11 0x000055555557221b in main (argc=1, argv=0x7fffffffea98) at greeter_client.cc:82

While another thread is keeping on retrying getaddrinfo():

#0  0x00007ffff6604604 in getaddrinfo () from /usr/lib/libc.so.6
#1  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#2  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#3  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#4  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#5  0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#6  0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

BTW, how can I build libgrpc.so with debug information? (I have posted it in group, but unfortunately no response.) Thanks in advance!

NanXiao commented Oct 11, 2017

@apolcyn

Thanks for your comment!

The Client't main thread is blocking in grpc::CompletionQueue::Pluck:

#0  0x00007ffff6619076 in epoll_pwait () from /usr/lib/libc.so.6
#1  0x00007ffff7145b78 in ?? () from /usr/lib/libgrpc.so.4
#2  0x00007ffff7144b24 in ?? () from /usr/lib/libgrpc.so.4
#3  0x00007ffff714c25f in grpc_pollset_work () from /usr/lib/libgrpc.so.4
#4  0x00007ffff717482c in ?? () from /usr/lib/libgrpc.so.4
#5  0x00007ffff7172f23 in grpc_completion_queue_pluck () from /usr/lib/libgrpc.so.4
#6  0x00007ffff74f746e in grpc::CoreCodegen::grpc_completion_queue_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) () from /usr/lib/libgrpc++.so.1
#7  0x000055555556b36d in grpc::CompletionQueue::Pluck (this=0x7fffffffe4a0, tag=0x7fffffffe550)
    at /usr/include/grpc++/impl/codegen/completion_queue.h:224
#8  0x000055555556c869 in grpc::BlockingUnaryCall<helloworld::HelloRequest, helloworld::HelloReply> (
    channel=0x5555557975d0, method=..., context=0x7fffffffe760, request=..., result=0x7fffffffe6d0)
    at /usr/include/grpc++/impl/codegen/client_unary_call.h:59
#9  0x0000555555569254 in helloworld::Greeter::Stub::SayHello (this=0x555555790e10, context=0x7fffffffe760,
    request=..., response=0x7fffffffe6d0) at helloworld.grpc.pb.cc:32
#10 0x000055555557251a in GreeterClient::SayHello (this=0x7fffffffe928, user="world") at greeter_client.cc:58
#11 0x000055555557221b in main (argc=1, argv=0x7fffffffea98) at greeter_client.cc:82

While another thread is keeping on retrying getaddrinfo():

#0  0x00007ffff6604604 in getaddrinfo () from /usr/lib/libc.so.6
#1  0x00007ffff714ee1e in ?? () from /usr/lib/libgrpc.so.4
#2  0x00007ffff714f38c in ?? () from /usr/lib/libgrpc.so.4
#3  0x00007ffff714d020 in ?? () from /usr/lib/libgrpc.so.4
#4  0x00007ffff714cf12 in ?? () from /usr/lib/libgrpc.so.4
#5  0x00007ffff71fff57 in ?? () from /usr/lib/libgrpc.so.4
#6  0x00007ffff7755049 in start_thread () from /usr/lib/libpthread.so.0
#7  0x00007ffff6618f0f in clone () from /usr/lib/libc.so.6

BTW, how can I build libgrpc.so with debug information? (I have posted it in group, but unfortunately no response.) Thanks in advance!

@murgatroid99

This comment has been minimized.

Show comment
Hide comment
@murgatroid99

murgatroid99 Oct 11, 2017

Member

I don't think we would really consider this a bug and that stack trace looks like I would expect in the situation you described. The library isn't "stuck" or anything; it is simply still trying to make your request. If you set a deadline, then after that deadline it will stop trying, and the SayHello call will return.

Member

murgatroid99 commented Oct 11, 2017

I don't think we would really consider this a bug and that stack trace looks like I would expect in the situation you described. The library isn't "stuck" or anything; it is simply still trying to make your request. If you set a deadline, then after that deadline it will stop trying, and the SayHello call will return.

@apolcyn

This comment has been minimized.

Show comment
Hide comment
@apolcyn

apolcyn Oct 11, 2017

Contributor

@murgatroid99 I've heard the other argument - I'm wondering what wait_for_ready should mean for name resolution.

Contributor

apolcyn commented Oct 11, 2017

@murgatroid99 I've heard the other argument - I'm wondering what wait_for_ready should mean for name resolution.

@NanXiao

This comment has been minimized.

Show comment
Hide comment
@NanXiao

NanXiao Oct 11, 2017

@murgatroid99 I find set_deadline() has a problem: for example, if I set 100 seconds, even it is communicating when the 100 seconds are due, the connection is still cut off. So I think what I need is when the connection is in "IDLE" for 100 seconds, then the connection should be over.

BTW, can gRPC do some retry count limitaion? Such as if it has tried 10 times and can't get the correct DNS result, report to caller function it is failed?

Thanks!

NanXiao commented Oct 11, 2017

@murgatroid99 I find set_deadline() has a problem: for example, if I set 100 seconds, even it is communicating when the 100 seconds are due, the connection is still cut off. So I think what I need is when the connection is in "IDLE" for 100 seconds, then the connection should be over.

BTW, can gRPC do some retry count limitaion? Such as if it has tried 10 times and can't get the correct DNS result, report to caller function it is failed?

Thanks!

@NanXiao

This comment has been minimized.

Show comment
Hide comment
@NanXiao

NanXiao Oct 11, 2017

@apolcyn @murgatroid99 After debugging, I find it is because in executor_thread:

if (ts->shutdown) {
    if (GRPC_TRACER_ON(executor_trace)) {
      gpr_log(GPR_DEBUG, "EXECUTOR[%d]: shutdown",
              (int)(ts - g_thread_state));
    }
    gpr_mu_unlock(&ts->mu);
    break;
  }

The ts->shutdown is always false, so retry getaddrinfo() again and again.

Thanks!

NanXiao commented Oct 11, 2017

@apolcyn @murgatroid99 After debugging, I find it is because in executor_thread:

if (ts->shutdown) {
    if (GRPC_TRACER_ON(executor_trace)) {
      gpr_log(GPR_DEBUG, "EXECUTOR[%d]: shutdown",
              (int)(ts - g_thread_state));
    }
    gpr_mu_unlock(&ts->mu);
    break;
  }

The ts->shutdown is always false, so retry getaddrinfo() again and again.

Thanks!

@markdroth

This comment has been minimized.

Show comment
Hide comment
@markdroth

markdroth Oct 16, 2017

Contributor

I suspect that this is related to #9646 and/or #12867.

Contributor

markdroth commented Oct 16, 2017

I suspect that this is related to #9646 and/or #12867.

@markdroth

This comment has been minimized.

Show comment
Hide comment
@markdroth

markdroth Oct 16, 2017

Contributor

Actually, I take back what I said -- I don't think those issues are relevant.

I don't think wait_for_ready is relevant in this case; that affects behavior when an initial connection attempt fails, but in this case we are not even getting to try a connection, because the resolver is not returning any results.

It is actually intentional behavior that the resolver does not return any results for an unknown address. This is because even if an address is unknown at the moment the channel is created, the name could potentially spring into existence later, at which point the channel would start connecting and everything would start working again.

That having been said, we should honor the RPC deadline even if the resolver doesn't return anything, and I suspect that we're not doing that right now. So I think that's the thing that needs to be fixed here.

Contributor

markdroth commented Oct 16, 2017

Actually, I take back what I said -- I don't think those issues are relevant.

I don't think wait_for_ready is relevant in this case; that affects behavior when an initial connection attempt fails, but in this case we are not even getting to try a connection, because the resolver is not returning any results.

It is actually intentional behavior that the resolver does not return any results for an unknown address. This is because even if an address is unknown at the moment the channel is created, the name could potentially spring into existence later, at which point the channel would start connecting and everything would start working again.

That having been said, we should honor the RPC deadline even if the resolver doesn't return anything, and I suspect that we're not doing that right now. So I think that's the thing that needs to be fixed here.

@markdroth markdroth changed the title from The Client will block forever when the host address is faked. to RPC deadline not honored when resolver does not return Oct 16, 2017

@NanXiao

This comment has been minimized.

Show comment
Hide comment
@NanXiao

NanXiao Oct 23, 2017

@apolcyn @murgatroid99 @markdroth Is there any update on this issue? Thanks!

NanXiao commented Oct 23, 2017

@apolcyn @murgatroid99 @markdroth Is there any update on this issue? Thanks!

@markdroth markdroth self-assigned this Oct 23, 2017

@markdroth

This comment has been minimized.

Show comment
Hide comment
@markdroth

markdroth Oct 23, 2017

Contributor

I'm the right person to fix this, but it may be a couple of weeks before I have time to look at it. I'll update the bug when I do.

Contributor

markdroth commented Oct 23, 2017

I'm the right person to fix this, but it may be a couple of weeks before I have time to look at it. I'll update the bug when I do.

@NanXiao

This comment has been minimized.

Show comment
Hide comment
@NanXiao

NanXiao Oct 23, 2017

@markdroth Got it, thanks!

NanXiao commented Oct 23, 2017

@markdroth Got it, thanks!

@markdroth

This comment has been minimized.

Show comment
Hide comment
@markdroth

markdroth May 3, 2018

Contributor

I can't reproduce this. I changed the greeter client to change localhost to badhost and added the following lines:

   context.set_wait_for_ready(true);
   context.set_deadline(
       std::chrono::system_clock::now() + std::chrono::seconds(10));

When I ran it, the RPC did indeed return DEADLINE_EXCEEDED after 10 seconds.

Please re-open if this is still happening for you.

Contributor

markdroth commented May 3, 2018

I can't reproduce this. I changed the greeter client to change localhost to badhost and added the following lines:

   context.set_wait_for_ready(true);
   context.set_deadline(
       std::chrono::system_clock::now() + std::chrono::seconds(10));

When I ran it, the RPC did indeed return DEADLINE_EXCEEDED after 10 seconds.

Please re-open if this is still happening for you.

@markdroth markdroth closed this May 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment