Likely invalid handling of the goaway frames #2176

pfreixes · 2022-07-28T16:01:16Z

Problem description

We had suffer sporadically some issues with streams that got hanged forever, we think that issue could come because of the way that goaway frames are Today managed by grpc-node which seems to be different compared to other drivers like the Golang one [1]

In our scenario the streams are opened forever and we are expecting to get them recovered. We use heartbeats for having some defensive mechanism for TCP connections that are no longer usable - for any reason - and were not closed explicitly by the server.

From our understanding current code can not guarantee the usage of heartbeats in some scenarios, at least specifically during a goaway event [2] which basically will immediately stop [3] the handlers that will make sure that if no heartbeats are seen in some time the session should be proactively closed.

This is in our opinion a problem, since during that window of time TCP connections that are not longer usable will become unprotected by the heartbeats, since the heartbeats wont kick in eventually for closing the connection at client side.

Further implication of this

There is another implication of this strategy that goes beyond to the proper bug. When a goaway package is consumed the sub-channel connection is marked as IDLE and this state is also trespassed to the getConnectivityState reporting to the caller - the user that is using the API - that the status is IDLE which from our understanding [4] it means that there are no outstanding RPCs, which is not true in case of a goaway package where there is still at least one outstanding RPC.

[1] https://github.com/grpc/grpc-go/blob/master/internal/transport/http2_client.go#L1199
[2] https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L525
[3] https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L718
[4] https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md

Reproduction steps

We reproduced this placing an Nginx on front of a simple gRPC server, and using the keepalive_time and grpc_read_timeout directives for reproducing the folloiwng:

When a new the keepalive_time - lets say configured to 120s - is reached, the next RPC would receive the goaway frame
After the RPC that got the goaway finishes, Nginx closes proactively the TCP connection
We also configure the grpc_read_timeout configured to something like 60s for stopping proactively streams that did not have any traffic at all.

With Nginx configured and a gRPC stream server behind it, you can start running a gRPC stream client, and run the same stream one after the other in case of having it close it - which will happen because of the grpc_read_timeout, eventually when the keepalive_time kicks in the heart-beats will stop working, if by that time you add a rule - iptables, pf, etc - for dropping any response from the server the client, the client wont notice and the stream will hang forever.

Environment

OS name, version and architecture: mac
Node version v16.16.0
Node installation method nvm
Package name and version 1.6.8

The text was updated successfully, but these errors were encountered:

pfreixes · 2022-07-28T16:08:18Z

@lidizheng 😉

lidizheng · 2022-07-28T17:10:03Z

@pfreixes Hi, long time no see. This does sound strange. The repro steps is nice. The less friction way to let others help debugging would be providing a minimum repro example.

CC @murgatroid99

pfreixes · 2022-07-28T20:01:02Z

Reproducing needs to be done by triggering the goaway code path [1], I did that by putting an Nginx in front of a gRPC stream server (we use golang but any one would serve)

Here the Nginx configuration that Ive used, where the keeplive_time does the trick for sending the goway frame

worker_processes  1;
error_log /dev/stdout debug;
master_process off;
daemon off;

events {
    worker_connections  1024;
}


http {
    include       mime.types;
    default_type  application/octet-stream;
    access_log /dev/stdout;
    sendfile        on;

    upstream backend {
        server 127.0.0.1:8081;
    }

    server {
        listen       8082 http2;
        server_name  localhost;
        keepalive_time 120s;

        location / {
            grpc_pass grpc://backend;
            grpc_socket_keepalive on;
            grpc_read_timeout 60s;
        }
    }
    include servers/*;
}

Once you have the Nginx and the gRPC backend server running its a matter of start hitting the gRPC port provided by Nginx using a gRCP client using the grpc-js package - latest version is fine - with heartbeats enabled, once started the client this will need to start stream and every time that it gets closed - because of grpc_read_timeout - just retry it.

You will notice that during the goaway event and the next RPC that will trigger a new TCP connection opened the heartbeats basically disappear.

[1] https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/subchannel.ts#L525

murgatroid99 · 2022-07-28T22:38:21Z

I have discussed this with my team. The client is correct to transition to IDLE when receiving a GOAWAY, whether or not there are any ongoing requests. But it is not correct to stop the keepalive pings after it goes IDLE if the connection is still in use. Unfortunately, fixing that will require a bit of a rework of the low-level connection handling systems, so there will likely not be a fix very soon.

pfreixes · 2022-07-29T07:24:52Z

The client is correct to transition to IDLE when receiving a GOAWAY

Im not totally sure about this one, the semantics of IDLE tells specifically that there are no ongoing RPCs, from here [1]

IDLE: This is the state where the channel is not even trying to create a connection because of a lack of new or pending RPCs. New RPCs MAY be created in this state. Any attempt to start an RPC on the channel will push the channel out of this state to connecting. When there has been no RPC activity on a channel for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period, channels that are READY or CONNECTING switch to IDLE. Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE to avoid connection overload at servers that are attempting to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes).

It says specifically Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE and in this case there are active RPCs

[1] https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md

pfreixes · 2022-07-29T07:25:57Z

Unfortunately, fixing that will require a bit of a rework of the low-level connection handling systems, so there will likely not be a fix very soon.

Yeps I realised that fixing this will require some sort of refactoring, we will be using a watchdog in our application for now for canceling RPCs that might be in that state

murgatroid99 · 2022-07-29T16:43:47Z

The Connectivity Semantics doc is not authoritative. It is out of date and is only used as a rough guideline for implementation. All current implementations switch to IDLE when receiving a GOAWAY, whether or not there are active RPCs.

murgatroid99 · 2023-01-11T18:54:10Z

I have published a change that should fix this. Please try it out in version 1.8.3.

murgatroid99 self-assigned this Oct 4, 2022

murgatroid99 added the package: @grpc/grpc-js label Oct 4, 2022

murgatroid99 mentioned this issue Jan 4, 2023

grpc-js: Refactor Transport and SubchannelConnector out of Subchannel #2308

Merged

jeanp413 mentioned this issue Feb 15, 2023

Stream connection hanging forever connectrpc/connect-es#463

Closed

murgatroid99 closed this as completed Sep 13, 2023

pratik151192 mentioned this issue Mar 19, 2024

"New streams cannot be created after receiving a GOAWAY" and CANCELLED errors during service deployments #2694

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Likely invalid handling of the goaway frames #2176

Likely invalid handling of the goaway frames #2176

pfreixes commented Jul 28, 2022 •

edited

Loading

pfreixes commented Jul 28, 2022

lidizheng commented Jul 28, 2022

pfreixes commented Jul 28, 2022

murgatroid99 commented Jul 28, 2022

pfreixes commented Jul 29, 2022

pfreixes commented Jul 29, 2022

murgatroid99 commented Jul 29, 2022

murgatroid99 commented Jan 11, 2023

Likely invalid handling of the goaway frames #2176

Likely invalid handling of the goaway frames #2176

Comments

pfreixes commented Jul 28, 2022 • edited Loading

Problem description

Further implication of this

Reproduction steps

Environment

pfreixes commented Jul 28, 2022

lidizheng commented Jul 28, 2022

pfreixes commented Jul 28, 2022

murgatroid99 commented Jul 28, 2022

pfreixes commented Jul 29, 2022

pfreixes commented Jul 29, 2022

murgatroid99 commented Jul 29, 2022

murgatroid99 commented Jan 11, 2023

pfreixes commented Jul 28, 2022 •

edited

Loading