Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bidi calls never close, leaking memory client and server #4202

Closed
alexnixon opened this issue Mar 9, 2018 · 9 comments
Closed

Bidi calls never close, leaking memory client and server #4202

alexnixon opened this issue Mar 9, 2018 · 9 comments
Assignees
Labels
Milestone

Comments

@alexnixon
Copy link

Using v1.8.0.

The following client/server leaves the bidi RPC call open, holding onto resources and ultimately going OOM if enough calls are made:

// proto
message Request {}
message Response {}

service TestService {
    rpc Ping(stream Request) returns (stream Response);
}

/////////////////////////////
// server impl
private static class ServerServiceImpl extends TestServiceGrpc.TestServiceImplBase {
    @Override
    public StreamObserver<TestRpc.Request> ping(StreamObserver<example.TestRpc.Response> responseObserver) {
        return new StreamObserver<TestRpc.Request>() {
            @Override
            public void onNext(TestRpc.Request request) {
                responseObserver.onNext(TestRpc.Response.newBuilder().build());
                responseObserver.onCompleted();
            }

            @Override public void onError(Throwable throwable) {}
            @Override public void onCompleted() {}
        };
    }
}
    
/////////////////////////////
// client impl
private void oneCall(Channel chan) {
    ClientCall<TestRpc.Request, TestRpc.Response> call =
            chan.newCall(TestServiceGrpc.getPingMethod(), CallOptions.DEFAULT);
    call.start(new ClientCall.Listener<TestRpc.Response>(), new Metadata());
    call.sendMessage(TestRpc.Request.newBuilder().build());
    call.request(1);
}
for (int i = 0; i < 1000; ++i) {
    oneCall(channel);
}

// If I attach a debugger to the client here, I can see 1000 instances of DefaultStream and 1000 instances of a bunch of other grpc/netty bits, even after > 5 minutes and repeated attempts at GC.
Thread.sleep(9999999);

Replacing the client's ClientCall.Listener with one that calls .halfClose() upon completion works around the issue.

@alexnixon
Copy link
Author

alexnixon commented Mar 9, 2018

And the client also has ~550 live threads with the name grpc-default-executor-n. The single channel was created with:

        Channel channel = NettyChannelBuilder
                .forAddress("127.0.0.1", 12345)
                .usePlaintext(true)
                .nameResolverFactory(new DnsNameResolverProvider())
                .build();

And if I replace NettyChannelBuilder with OkHttpChannelBuilder then I end up with ~1k threads. And setting the executor with .executor(Executors.newSingleThreadExecutor()) fixes the thread problem, but the client/server still both go OOM if I increase the 1000 calls to something higher.

@ejona86 ejona86 added the bug label Mar 9, 2018
@ejona86 ejona86 added this to the Next milestone Mar 9, 2018
@ejona86
Copy link
Member

ejona86 commented Mar 9, 2018

This is pure speculation, but I wouldn't be surprised if Netty is retaining the memory because the stream is HALF_CLOSED.

http://httpwg.org/specs/rfc7540.html#rfc.section.8.1 talks about this some:

An HTTP response is complete after the server sends — or the client receives — a frame with the END_STREAM flag set (including any CONTINUATION frames needed to complete a header block). A server can send a complete response prior to the client sending an entire request if the response does not depend on any portion of the request that has not been sent and received. When this is true, a server MAY request that the client abort transmission of a request without error by sending a RST_STREAM with an error code of NO_ERROR after sending a complete response (i.e., a frame with the END_STREAM flag). Clients MUST NOT discard responses as a result of receiving such a RST_STREAM, though clients can always discard responses at their discretion for other reasons.

The client should probably be sending a RST_STREAM to the server to transition it to CLOSED. We'd need to look into Netty's behavior here.

@alexnixon
Copy link
Author

I still see the memory retained in the client when using ok-http though, which points to this being something in the common code?

Sorry for going over this again, but I do still wonder if we're misunderstanding something here. This is my train of thought:

  1. In bidi GRPC, both streams are independent
  2. You call onComplete when you're done sending
  3. You expect to continue receiving messages after calling this
  4. Which means the connection and stream must remain open and in memory
  5. So they can only be cleaned up when both sides have indicated they've finished sending, by ending with a success or error

If that's all correct then this isn't a memory leak, it's expected. If it isn't, do you know where that logic breaks down?

@ejona86
Copy link
Member

ejona86 commented Mar 9, 2018

In HTTP (and gRPC) the bi-di stream is done once the server finishes responding. We should simply throw away anything the client tries to send past that point (which is core to our design).

I would fully believe okhttp has the same bug. In fact, I'd expect okhttp to have it before netty.

@ericgribkoff
Copy link
Contributor

@alexnixon It looks like there is a problem in Netty clients, but I'm not seeing the same in OkHttp (with which I'm more familiar). Netty clients appear to hold onto resources for the stream in the mapping in io.grpc.netty.NettyClientHandler#streamKey unless the client calls half-close, even keeping them as marked active, but I am not seeing that stream resources are similarly retained by the OkHttp transport.

Are the server and client running in separate processes? At some point I'd imagine you'd get an OOM on the client even if no resources are retained by the stream after completion, simply because you can create streams faster than they can complete. (especially if you're using a single threaded executor, as mentioned in comment #2).

Also, I believe the large number of threads is expected when using the default executor, a cached thread pool. The loop you provided will create 1000 calls on the channel, and each of those can create a thread. At some point we'd expect some of those calls to finish and threads to be reused for new calls, but it's not clear to me that the large number of threads is necessarily related to the retaining of stream resources after the server half closes.

@alexnixon
Copy link
Author

alexnixon commented Mar 13, 2018

@ericgribkoff I think you're right about OkHttp, that seems to be a different issue. The thousands of threads live concurrently and hold some resources, but do eventually die after a minute or so at which point the memory is free'd.

The client/server are running in separate processes, though I'm not sure it's just a creation/completion race given the Netty memory leak persists after 5 minutes. Also, when I initially saw this issue in production it was with a service that slowly went OOM over a period of days.

@ericgribkoff
Copy link
Contributor

@alexnixon Yes, there is a genuine leak in Netty clients (I haven't looked too much at the server side yet). One difference in behavior is that the OkHttp client does send a RST_STREAM to the server when the server half-closes, whereas Netty clients do not. I hope to have a fix for this soon.

@alexnixon
Copy link
Author

alexnixon commented Mar 14, 2018

Fantastic, thanks very much. I'll also note for now that maxConcurrentCallsPerConnection on the NettyServerBuilder can be used to prevent these badly behaved clients from killing the server.

@ericgribkoff
Copy link
Contributor

#4222 contains a fix for this problem: it changes the Netty client to send a reset stream when it receives an end-of-stream from the server without having already half-closed. There are more details in the comments on #4222, but sending the reset stream frees the stream resources on the client, and receipt of the reset stream frees the stream resources on the server.

Since we can't assume all clients will be updated to send the reset stream, I'll need to send out another PR to let the server releases stream resources even without this behavior change in the client. But just the client updates in #4222 alone are enough to solve this problem, as least as far as I've been able to reproduce and test it.

@ejona86 ejona86 modified the milestones: Next, 1.12 Feb 25, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants