Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle channels sometimes fails RPCs with UNAVAILABLE status #11043

Closed
nathanielmanistaatgoogle opened this issue May 9, 2017 · 22 comments

Comments

@nathanielmanistaatgoogle
Copy link
Member

commented May 9, 2017

When a gRPC channel goes idle, the first request started after it went idle will sometimes fail with status UNAVAILABLE. This doesn't seem to be in line with the documented cases for UNAVAILABLE status, and since UNAVAILABLE is not a retriable status, implementing correct application-level semantics are more difficult.

Why does the Core say UNAVAILABLE when it doesn't have recent-in-time link-layer evidence to believe that the remote host is UNAVAILABLE? Or is there something I'm misunderstanding about channels going idle?

@marshallbrekka

This comment has been minimized.

Copy link

commented May 10, 2017

Hope this is helpful, I was debugging this issue (or one like it) for the ruby and python clients, and captured trace logs of the error occurring.

Test Method

The test is pretty simple

  1. Issue first RPC
  2. Restart server
  3. Issue second RPC

I did this manually by having the client sleep 15 seconds between RPCs, and manually restarting the server between calls.

The tests were run from the ruby:2.3.1 and python:3.5.2 docker images.

The test used the helloworld example from the grpc repo, with modified ruby and python clients, run with the environment vars

  • GRPC_TRACE=connectivity_state,channel,http,tcp
  • GRPC_VERBOSITY=debug

The problem exists for the ruby gem version 1.2.5, and the python package version 1.3.0, however building the ruby gem from source on 1.3.1 does not contain the issue (building from source on 1.2.5 still does).

Building python from source on 1.3.1 still failed, which lead me to believe that this PR potentially had the side effect of fixing the issue for Ruby.

Logs

The logs have been annotated with markers of significant events, all prefixed by ### NOTE.

@nathanielmanistaatgoogle

This comment has been minimized.

Copy link
Member Author

commented May 10, 2017

When you say "manually restarting the server", what do you mean exactly? Invoking the server's stop method and properly shutting it down? Sending a signal to server process? If a signal, which signal?

Thank you for the legwork!

@marshallbrekka

This comment has been minimized.

Copy link

commented May 10, 2017

Roughly this

$ ruby greeter_client.rb &
$ ruby greeter_server.rb
# watch for log from server receiving request
<ctrl-c>
$ ruby greeter_server.rb

I was able to reproduce the issue using both ctrl-c, and kill -9 <pid> against the server.

@marshallbrekka

This comment has been minimized.

Copy link

commented May 10, 2017

FWIW I could also reproduce the issue by running a client pod on K8 that looped forever, issuing requests every 15 seconds, and a deployment of 2 server pods, telling K8 to kill one pod at a time, so that there was always at least one healthy pod available.

@marshallbrekka

This comment has been minimized.

Copy link

commented May 10, 2017

Another data point: I couldn't reproduce the issue for python when building from source on master, but it looks like a lot of the networking internals in core have changed between v1.3.x and master.

@sulliwane

This comment has been minimized.

Copy link

commented May 23, 2017

Same error here:

Client:

  • grpcio@1.3.0
  • python 3.6

Server:

  • grpc@1.3.0
  • node 6.9

Client error message:

  File "/home/project/venv/lib/python3.6/site-packages/grpc/_channel.py", line 455, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, Endpoint read failed)>
@marshallbrekka

This comment has been minimized.

Copy link

commented May 23, 2017

verified issue is no longer present (for my test case) on ruby grpc-1.3.4 and python grpcio-1.3.5

@vetyy

This comment has been minimized.

Copy link

commented May 26, 2017

Hello I can still simulate this problem using python '3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]' , grpcio==1.3.5, Ubuntu 16.04.2 LTS

If I edit kpayson64 ReconnectTest and put for loop around test, lets say for _ in range(2000) it will randomly fail sometimes after 10 iterations sometimes after 500+.

What I am doing in my code is, I will create grpc.insecure_channel in setUpClass once, then for each setUp / tearDown I will create server, start it, stop it, delete server while running 20+ unittests per class and having 30 classes it almost always fails my pipeline with this exception.

@kpayson64

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2017

Fixed by #11154

@fho

This comment has been minimized.

Copy link
Contributor

commented Nov 17, 2017

Could it be that the problem still exist in the PHP client?
We are getting Endpoint read failed errors on the client on long living and (often) idle connections

@wenerme

This comment has been minimized.

Copy link

commented Nov 21, 2017

PHP with newest client still has this problem.

@fho

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2017

@wenbozhu The ticket is labeled with lang/all wrapped languages.
The issue still exist in the PHP-Client but the ticket was already closed.
Should it be reopened?

@apolcyn

This comment has been minimized.

Copy link
Contributor

commented Nov 29, 2017

@ZhouyihaiDing are you able to investigate this issue as it affects PHP?

@ZhouyihaiDing

This comment has been minimized.

Copy link
Contributor

commented Dec 6, 2017

Sorry for the delay. Start looking into this in PHP.

@MitchDart

This comment has been minimized.

Copy link

commented Jan 29, 2018

I am having this issue in C# dotnet core. If a channel idles for long I get this exception.

@jtattermusch

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2018

@MitchDart can you provide more details about the issue you are seeing? (the usual info: platform, gRPC version, any other additional that would help reproducing etc.)

@jtattermusch

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2018

@MitchDart I think the problem was resolved by #11021 (available in v1.3.5+). Are you seeing evidence that contradicts that?

@MitchDart

This comment has been minimized.

Copy link

commented Feb 7, 2018

@jtattermusch It has been a rough few days but I believe this is because of my current setup. I am using Grpc 1.8.3 in C#. I am using Kubernetes pods to host all my clients and servers in a microservice architecture. I believe the problem is because each kubernetes service has multiple pods meaning that the dns remains the same but the IP address changes continuously. So these idle channels simply "loose" its connection. So I decided to rather construct a new channel for each "scope" of work and destroy the channel afterwards using ShutdownAsync. The problem is ShutdownAsync takes around 12 seconds to finish !? This however seemed to solve the "Unavailable" problem however I get Status(StatusCode=Unknown, Detail="Stream removed") instead.

@jtattermusch

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2018

ShutdownAsync taking that long can be an indication of some objects (calls, channels, ... ) not being cleaned up properly.
have you tried GRPC_VERBOSITY=debug and GRPC_TRACE=.... from https://github.com/grpc/grpc/blob/master/doc/environment_variables.md?

@MitchDart

This comment has been minimized.

Copy link

commented Feb 7, 2018

Thanks @jtattermusch, I will check that out and let you know. Just two quick questions maybe you can help me with.

  1. Unary async calls are just awaited and stream calls are disposed with "using". Is there anything else I need to do to clean up the calls naturally? I can't seem to find anything in the samples.

  2. Do you think setting up channels on a per scope basis is better than long lived channels where IP address could change during an IDLE state?

Thanks for your help, its much appreciated.

@jtattermusch

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2018

  1. the native objects held by client-side call objects will be cleaned up when status is received from server and you have no pending operations on the call. What Dispose() does is that it cancels the call if it hasn't finished yet, which results in status from server being delivered soon afterward (and the call is cleaned up).
    With streaming calls you should always finish reading the response stream (MoveNext()) until the call finishes successfully or you get RpcException with status - abandoning the reponse stream can lead to call not being cleaned up properly.

  2. That really depends on your application. In general, short lived channels have more overhead (creating a TCP connection) but you of course cannot have streaming calls survive things like change of backend's IP (as you seem to be describing). For long-lived streaming calls, you should always have a way to terminate the call gracefully and start a new one (that's necessary for things like loadbalancing to work properly).

@congdufs

This comment has been minimized.

Copy link

commented Apr 16, 2018

@nathanielmanistaatgoogle @marshallbrekka When a gRPC channel goes idle, the first request started after it went idle will fail with Error Domain=io.grpc Code=2 "Stream removed" UserInfo={NSLocalizedDescription=Stream removed, io.grpc.TrailersKey={}}, do you have any idea? Thank you very much.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 29, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
You can’t perform that action at this time.