Bidi calls never close, leaking memory client and server #4202

alexnixon · 2018-03-09T09:42:10Z

Using v1.8.0.

The following client/server leaves the bidi RPC call open, holding onto resources and ultimately going OOM if enough calls are made:

// proto
message Request {}
message Response {}

service TestService {
    rpc Ping(stream Request) returns (stream Response);
}

/////////////////////////////
// server impl
private static class ServerServiceImpl extends TestServiceGrpc.TestServiceImplBase {
    @Override
    public StreamObserver<TestRpc.Request> ping(StreamObserver<example.TestRpc.Response> responseObserver) {
        return new StreamObserver<TestRpc.Request>() {
            @Override
            public void onNext(TestRpc.Request request) {
                responseObserver.onNext(TestRpc.Response.newBuilder().build());
                responseObserver.onCompleted();
            }

            @Override public void onError(Throwable throwable) {}
            @Override public void onCompleted() {}
        };
    }
}
    
/////////////////////////////
// client impl
private void oneCall(Channel chan) {
    ClientCall<TestRpc.Request, TestRpc.Response> call =
            chan.newCall(TestServiceGrpc.getPingMethod(), CallOptions.DEFAULT);
    call.start(new ClientCall.Listener<TestRpc.Response>(), new Metadata());
    call.sendMessage(TestRpc.Request.newBuilder().build());
    call.request(1);
}
for (int i = 0; i < 1000; ++i) {
    oneCall(channel);
}

// If I attach a debugger to the client here, I can see 1000 instances of DefaultStream and 1000 instances of a bunch of other grpc/netty bits, even after > 5 minutes and repeated attempts at GC.
Thread.sleep(9999999);

Replacing the client's ClientCall.Listener with one that calls .halfClose() upon completion works around the issue.

The text was updated successfully, but these errors were encountered:

alexnixon · 2018-03-09T15:17:12Z

And the client also has ~550 live threads with the name grpc-default-executor-n. The single channel was created with:

        Channel channel = NettyChannelBuilder
                .forAddress("127.0.0.1", 12345)
                .usePlaintext(true)
                .nameResolverFactory(new DnsNameResolverProvider())
                .build();

And if I replace NettyChannelBuilder with OkHttpChannelBuilder then I end up with ~1k threads. And setting the executor with .executor(Executors.newSingleThreadExecutor()) fixes the thread problem, but the client/server still both go OOM if I increase the 1000 calls to something higher.

ejona86 · 2018-03-09T17:10:15Z

This is pure speculation, but I wouldn't be surprised if Netty is retaining the memory because the stream is HALF_CLOSED.

http://httpwg.org/specs/rfc7540.html#rfc.section.8.1 talks about this some:

An HTTP response is complete after the server sends — or the client receives — a frame with the END_STREAM flag set (including any CONTINUATION frames needed to complete a header block). A server can send a complete response prior to the client sending an entire request if the response does not depend on any portion of the request that has not been sent and received. When this is true, a server MAY request that the client abort transmission of a request without error by sending a RST_STREAM with an error code of NO_ERROR after sending a complete response (i.e., a frame with the END_STREAM flag). Clients MUST NOT discard responses as a result of receiving such a RST_STREAM, though clients can always discard responses at their discretion for other reasons.

The client should probably be sending a RST_STREAM to the server to transition it to CLOSED. We'd need to look into Netty's behavior here.

alexnixon · 2018-03-09T17:32:13Z

I still see the memory retained in the client when using ok-http though, which points to this being something in the common code?

Sorry for going over this again, but I do still wonder if we're misunderstanding something here. This is my train of thought:

In bidi GRPC, both streams are independent
You call onComplete when you're done sending
You expect to continue receiving messages after calling this
Which means the connection and stream must remain open and in memory
So they can only be cleaned up when both sides have indicated they've finished sending, by ending with a success or error

If that's all correct then this isn't a memory leak, it's expected. If it isn't, do you know where that logic breaks down?

ejona86 · 2018-03-09T18:50:24Z

In HTTP (and gRPC) the bi-di stream is done once the server finishes responding. We should simply throw away anything the client tries to send past that point (which is core to our design).

I would fully believe okhttp has the same bug. In fact, I'd expect okhttp to have it before netty.

ericgribkoff · 2018-03-12T22:42:55Z

@alexnixon It looks like there is a problem in Netty clients, but I'm not seeing the same in OkHttp (with which I'm more familiar). Netty clients appear to hold onto resources for the stream in the mapping in io.grpc.netty.NettyClientHandler#streamKey unless the client calls half-close, even keeping them as marked active, but I am not seeing that stream resources are similarly retained by the OkHttp transport.

Are the server and client running in separate processes? At some point I'd imagine you'd get an OOM on the client even if no resources are retained by the stream after completion, simply because you can create streams faster than they can complete. (especially if you're using a single threaded executor, as mentioned in comment #2).

Also, I believe the large number of threads is expected when using the default executor, a cached thread pool. The loop you provided will create 1000 calls on the channel, and each of those can create a thread. At some point we'd expect some of those calls to finish and threads to be reused for new calls, but it's not clear to me that the large number of threads is necessarily related to the retaining of stream resources after the server half closes.

alexnixon · 2018-03-13T16:52:57Z

@ericgribkoff I think you're right about OkHttp, that seems to be a different issue. The thousands of threads live concurrently and hold some resources, but do eventually die after a minute or so at which point the memory is free'd.

The client/server are running in separate processes, though I'm not sure it's just a creation/completion race given the Netty memory leak persists after 5 minutes. Also, when I initially saw this issue in production it was with a service that slowly went OOM over a period of days.

ericgribkoff · 2018-03-14T01:32:03Z

@alexnixon Yes, there is a genuine leak in Netty clients (I haven't looked too much at the server side yet). One difference in behavior is that the OkHttp client does send a RST_STREAM to the server when the server half-closes, whereas Netty clients do not. I hope to have a fix for this soon.

alexnixon · 2018-03-14T13:04:34Z

Fantastic, thanks very much. I'll also note for now that maxConcurrentCallsPerConnection on the NettyServerBuilder can be used to prevent these badly behaved clients from killing the server.

ericgribkoff · 2018-03-15T07:15:59Z

#4222 contains a fix for this problem: it changes the Netty client to send a reset stream when it receives an end-of-stream from the server without having already half-closed. There are more details in the comments on #4222, but sending the reset stream frees the stream resources on the client, and receipt of the reset stream frees the stream resources on the server.

Since we can't assume all clients will be updated to send the reset stream, I'll need to send out another PR to let the server releases stream resources even without this behavior change in the client. But just the client updates in #4222 alone are enough to solve this problem, as least as far as I've been able to reproduce and test it.

ejona86 added the bug label Mar 9, 2018

ejona86 added this to the Next milestone Mar 9, 2018

ejona86 assigned ericgribkoff Mar 9, 2018

ericgribkoff mentioned this issue Mar 15, 2018

core,netty: client sends rst stream when server half-closes #4222

Merged

ericgribkoff closed this as completed Feb 20, 2019

ejona86 modified the milestones: Next, 1.12 Feb 25, 2019

ejona86 mentioned this issue Mar 8, 2019

netty: default grace time for RPCs can leak memory #5443

Merged

lock bot locked as resolved and limited conversation to collaborators May 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bidi calls never close, leaking memory client and server #4202

Bidi calls never close, leaking memory client and server #4202

alexnixon commented Mar 9, 2018

alexnixon commented Mar 9, 2018 •

edited

ejona86 commented Mar 9, 2018 •

edited

alexnixon commented Mar 9, 2018

ejona86 commented Mar 9, 2018

ericgribkoff commented Mar 12, 2018

alexnixon commented Mar 13, 2018 •

edited

ericgribkoff commented Mar 14, 2018

alexnixon commented Mar 14, 2018 •

edited

ericgribkoff commented Mar 15, 2018

Bidi calls never close, leaking memory client and server #4202

Bidi calls never close, leaking memory client and server #4202

Comments

alexnixon commented Mar 9, 2018

alexnixon commented Mar 9, 2018 • edited

ejona86 commented Mar 9, 2018 • edited

alexnixon commented Mar 9, 2018

ejona86 commented Mar 9, 2018

ericgribkoff commented Mar 12, 2018

alexnixon commented Mar 13, 2018 • edited

ericgribkoff commented Mar 14, 2018

alexnixon commented Mar 14, 2018 • edited

ericgribkoff commented Mar 15, 2018

alexnixon commented Mar 9, 2018 •

edited

ejona86 commented Mar 9, 2018 •

edited

alexnixon commented Mar 13, 2018 •

edited

alexnixon commented Mar 14, 2018 •

edited