Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likely invalid handling of the goaway frames #2176

Closed
pfreixes opened this issue Jul 28, 2022 · 8 comments · Fixed by #2308
Closed

Likely invalid handling of the goaway frames #2176

pfreixes opened this issue Jul 28, 2022 · 8 comments · Fixed by #2308

Comments

@pfreixes
Copy link

pfreixes commented Jul 28, 2022

Problem description

We had suffer sporadically some issues with streams that got hanged forever, we think that issue could come because of the way that goaway frames are Today managed by grpc-node which seems to be different compared to other drivers like the Golang one [1]

In our scenario the streams are opened forever and we are expecting to get them recovered. We use heartbeats for having some defensive mechanism for TCP connections that are no longer usable - for any reason - and were not closed explicitly by the server.

From our understanding current code can not guarantee the usage of heartbeats in some scenarios, at least specifically during a goaway event [2] which basically will immediately stop [3] the handlers that will make sure that if no heartbeats are seen in some time the session should be proactively closed.

This is in our opinion a problem, since during that window of time TCP connections that are not longer usable will become unprotected by the heartbeats, since the heartbeats wont kick in eventually for closing the connection at client side.

Further implication of this

There is another implication of this strategy that goes beyond to the proper bug. When a goaway package is consumed the sub-channel connection is marked as IDLE and this state is also trespassed to the getConnectivityState reporting to the caller - the user that is using the API - that the status is IDLE which from our understanding [4] it means that there are no outstanding RPCs, which is not true in case of a goaway package where there is still at least one outstanding RPC.

[1] https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_client.go#L1199
[2] https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L525
[3] https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L718
[4] https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md

Reproduction steps

We reproduced this placing an Nginx on front of a simple gRPC server, and using the keepalive_time and grpc_read_timeout directives for reproducing the folloiwng:

  • When a new the keepalive_time - lets say configured to 120s - is reached, the next RPC would receive the goaway frame
  • After the RPC that got the goaway finishes, Nginx closes proactively the TCP connection
  • We also configure the grpc_read_timeout configured to something like 60s for stopping proactively streams that did not have any traffic at all.

With Nginx configured and a gRPC stream server behind it, you can start running a gRPC stream client, and run the same stream one after the other in case of having it close it - which will happen because of the grpc_read_timeout, eventually when the keepalive_time kicks in the heart-beats will stop working, if by that time you add a rule - iptables, pf, etc - for dropping any response from the server the client, the client wont notice and the stream will hang forever.

Environment

  • OS name, version and architecture: mac
  • Node version v16.16.0
  • Node installation method nvm
  • Package name and version 1.6.8
@pfreixes
Copy link
Author

@lidizheng 😉

@lidizheng
Copy link
Contributor

@pfreixes Hi, long time no see. This does sound strange. The repro steps is nice. The less friction way to let others help debugging would be providing a minimum repro example.

CC @murgatroid99

@pfreixes
Copy link
Author

Reproducing needs to be done by triggering the goaway code path [1], I did that by putting an Nginx in front of a gRPC stream server (we use golang but any one would serve)

Here the Nginx configuration that Ive used, where the keeplive_time does the trick for sending the goway frame

worker_processes  1;
error_log /dev/stdout debug;
master_process off;
daemon off;

events {
    worker_connections  1024;
}


http {
    include       mime.types;
    default_type  application/octet-stream;
    access_log /dev/stdout;
    sendfile        on;

    upstream backend {
        server 127.0.0.1:8081;
    }

    server {
        listen       8082 http2;
        server_name  localhost;
        keepalive_time 120s;

        location / {
            grpc_pass grpc://backend;
            grpc_socket_keepalive on;
            grpc_read_timeout 60s;
        }
    }
    include servers/*;
}

Once you have the Nginx and the gRPC backend server running its a matter of start hitting the gRPC port provided by Nginx using a gRCP client using the grpc-js package - latest version is fine - with heartbeats enabled, once started the client this will need to start stream and every time that it gets closed - because of grpc_read_timeout - just retry it.

You will notice that during the goaway event and the next RPC that will trigger a new TCP connection opened the heartbeats basically disappear.

[1] https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L525

@murgatroid99
Copy link
Member

I have discussed this with my team. The client is correct to transition to IDLE when receiving a GOAWAY, whether or not there are any ongoing requests. But it is not correct to stop the keepalive pings after it goes IDLE if the connection is still in use. Unfortunately, fixing that will require a bit of a rework of the low-level connection handling systems, so there will likely not be a fix very soon.

@pfreixes
Copy link
Author

The client is correct to transition to IDLE when receiving a GOAWAY

Im not totally sure about this one, the semantics of IDLE tells specifically that there are no ongoing RPCs, from here [1]

IDLE: This is the state where the channel is not even trying to create a connection because of a lack of new or pending RPCs. New RPCs MAY be created in this state. Any attempt to start an RPC on the channel will push the channel out of this state to connecting. When there has been no RPC activity on a channel for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period, channels that are READY or CONNECTING switch to IDLE. Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE to avoid connection overload at servers that are attempting to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes).

It says specifically Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE and in this case there are active RPCs

[1] https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md

@pfreixes
Copy link
Author

Unfortunately, fixing that will require a bit of a rework of the low-level connection handling systems, so there will likely not be a fix very soon.

Yeps I realised that fixing this will require some sort of refactoring, we will be using a watchdog in our application for now for canceling RPCs that might be in that state

@murgatroid99
Copy link
Member

The Connectivity Semantics doc is not authoritative. It is out of date and is only used as a rough guideline for implementation. All current implementations switch to IDLE when receiving a GOAWAY, whether or not there are active RPCs.

@murgatroid99
Copy link
Member

I have published a change that should fix this. Please try it out in version 1.8.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants