Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half-closed without a request #8866

Closed
tommyulfsparre opened this issue Jan 25, 2022 · 16 comments · Fixed by #9362
Closed

Half-closed without a request #8866

tommyulfsparre opened this issue Jan 25, 2022 · 16 comments · Fixed by #9362
Assignees
Labels

Comments

@tommyulfsparre
Copy link

tommyulfsparre commented Jan 25, 2022

gRPC version
1.40.0

Environment
A client and server communicating over localhost.

I have a simple client and server that does nothing beside responding to an incoming RPC:

  @Override
  public void concat(PingRequest request, StreamObserver<PingResponse> responseObserver) {
    responseObserver.onNext(PingResponse.newBuilder().build());
    responseObserver.onCompleted();
  }

The client sends 100 RPCs with a 100ms deadline in parallel over a single channel that is in a READY state and shutdowns after all RPCs has been sent.

Server side:
13:49:11.197  INFO [ault-executor-5] com.example.Main             : trace-seq: 97 - ServerCall.Listener.close with status code: INTERNAL, desc: Half-closed without a request

Client side:
13:49:10.881 [grpc-nio-worker-ELG-1-2] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x3141a0f1, L:/127.0.0.1:50244 - R:/127.0.0.1:5993] OUTBOUND HEADERS: streamId=145 headers=GrpcHttp2OutboundHeaders[:authority: 127.0.0.1:5993, :path: /example.test.v1.PingService/Ping, :method: POST, :scheme: http, content-type: application/grpc, te: trailers, user-agent: grpc-java-netty/1.40.0, trace-seq: 97, grpc-accept-encoding: gzip, grpc-trace-bin: AAB33fg6kgpaWIFFOW9S6A41Af/EVN8t8akZAgA, grpc-timeout: 27946551n] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false

13:49:10.979 [grpc-nio-worker-ELG-1-2] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x3141a0f1, L:/127.0.0.1:50244 - R:/127.0.0.1:5993] OUTBOUND DATA: streamId=145 padding=0 endStream=true length=16 bytes=000000000b0a04666f6f6f1203626172

13:49:11.004 [grpc-nio-worker-ELG-1-2] DEBUG io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler - [id: 0x3141a0f1, L:/127.0.0.1:50244 - R:/127.0.0.1:5993] OUTBOUND RST_STREAM: streamId=145 errorCode=8

(trace-seq: 97 is metadata that gets sent along with each RPC)

After the JVM has started we get a bunch of RPC with status: INTERNAL, desc: Half-closed without a request server side, client report the expected DEADLINE_EXCEEDED . So before any JIT or lazy class loading.

Re-running the client (keeping the server running) doesn't not produce those error and we can see that only onCancel gets called when the client cancel an RPC. Instead of close(Status.INTERNAL) -> onHalfClose -> onCancel that we saw before.

I guess that gRPC can't do much about the slow JVM startup but is the INTERNAL, desc: Half-closed without a request expected in those cases?

@ejona86
Copy link
Member

ejona86 commented Jan 28, 2022

Strange.

"Half-closed without a request" should mean that the client didn't send (all of) a request. But those logs show the request DATA frame and the 000000000b shows an uncompressed message (first byte) of size 11 (next 4 bytes). 5 byte header + 11 byte message is 16 bytes, which agrees with the data frame. Nothing looks strange here; I'd have expected it to work.

Am I right in assuming that 1.40.0 is the version for both the client and server? I don't recognize this as a recent bug. And since this is over localhost, there's almost certainly not a proxy involved.

Do you have any interceptors that could be causing this problem? If an interceptor is wrapping the ServerCall.Listener and not immediately forwarding onMessage(), then it'd be suspect. This sort of bug would be easy in an interceptor that does some work asynchronously. If an interceptor delays calling onMessage(), then it needs to also delay onHalfClose().

@tommyulfsparre
Copy link
Author

tommyulfsparre commented Jan 28, 2022

Am I right in assuming that 1.40.0 is the version for both the client and server? I don't recognize this as a recent bug. And since this is over localhost, there's almost certainly not a proxy involved.

Correct both client and server is using 1.40.0. I will try and see if I can reproduce this with a newer version.

Do you have any interceptors that could be causing this problem? If an interceptor is wrapping the ServerCall.Listener and not immediately forwarding onMessage(), then it'd be suspect. This sort of bug would be easy in an interceptor that does some work asynchronously. If an interceptor delays calling onMessage(), then it needs to also delay onHalfClose().

I added a logging interceptors just for debugging purposes:

  private ServerInterceptor logging() {
    var seq = new AtomicInteger(0);

    return new ServerInterceptor() {
      @Override
      public <ReqT, RespT> Listener<ReqT> interceptCall(
          ServerCall<ReqT, RespT> call, Metadata headers, ServerCallHandler<ReqT, RespT> next) {
        var remoteSeq = headers.get(TRACE_SEQ_KEY);
        var seqNr = seq.incrementAndGet();

        return new ForwardingServerCallListener.SimpleForwardingServerCallListener<ReqT>(
            next.startCall(
                new ForwardingServerCall.SimpleForwardingServerCall<ReqT, RespT>(call) {
                  @Override
                  public void sendMessage(final RespT message) {
                    super.sendMessage(message);
                    logger.info("{} - sendMessage", seq);
                  }

                  @Override
                  public void close(final Status status, final Metadata trailers) {
                    super.close(status, trailers);
                    logger.info(
                        "{} trace-seq: {} - closed with status code: {}, desc: {}",
                        seqNr,
                        remoteSeq,
                        status.getCode(),
                        status.getDescription());
                  }

                  @Override
                  public void sendHeaders(final Metadata headers) {
                    super.sendHeaders(headers);
                    logger.info("{} - sendHeader", seqNr);
                  }
                },
                headers)) {

          @Override
          public void onHalfClose() {
            super.onHalfClose();
            logger.info("{} - onHalfClose", seqNr);
          }

          @Override
          public void onMessage(final ReqT message) {
            super.onMessage(message);
            logger.info("{} - onMessage", seqNr);
          }

          @Override
          public void onCancel() {
            super.onCancel();
            logger.info("{} - onCancel", seqNr);
          }

          @Override
          public void onComplete() {
            super.onComplete();
            logger.info("{} - onComplete", seqNr);
          }
        };
      }
    };
}

could this cause the observed behaviour?

@ejona86
Copy link
Member

ejona86 commented Jan 31, 2022

I expect this issue will be present in newer versions too, but confirming is still worthwhile.

That interceptor doesn't look to be a problem. I will note that the logs may be in a misleading order because within super.onHalfClose() there may be a call to call.close(). So it may then appear as if the close() happens before the onHalfClose() when in truth the onHalfClose() triggered the close(). But that's minor.

Is that the only interceptor in use?

@tommyulfsparre
Copy link
Author

tommyulfsparre commented Jan 31, 2022

Is that the only interceptor in use?

Here is a minimal reproducible(at least on my machine) example: https://github.com/tommyulfsparre/grpc-half-close

@dapengzhang0
Copy link
Member

Here is a minimal reproducible(at least on my machine) example: https://github.com/tommyulfsparre/grpc-half-close

I can reproduce the "Half-closed without a request" error with the given reproducible.

@dapengzhang0
Copy link
Member

"Half-closed without a request" should mean that the client didn't send (all of) a request. But those logs show the request DATA frame and the 000000000b shows an uncompressed message (first byte) of size 11 (next 4 bytes). 5 byte header + 11 byte message is 16 bytes, which agrees with the data frame. Nothing looks strange here; I'd have expected it to work.

@ejona86

Seems when RST_STREAM is received,

listener.halfClosed()
https://github.com/grpc/grpc-java/blob/v1.44.0/core/src/main/java/io/grpc/internal/AbstractServerStream.java#L231

can race with
listener.onMessage()
https://github.com/grpc/grpc-java/blob/v1.44.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L834
https://github.com/grpc/grpc-java/blob/v1.44.0/core/src/main/java/io/grpc/internal/ServerCallImpl.java#L318

@irock
Copy link

irock commented Feb 3, 2022

Tangential question to whether this is a race condition or not: Is INTERNAL the most accurate code to describe this error condition? Would DATA_LOSS or CANCELLED be more accurate status codes?

As far as I understand, it is possible for this condition to occur under normal circumstances if the client closes a stream prematurely. Is this a server error or a client error? The current implementation is categorizing it as a server error. Do we want to change this?

@ejona86
Copy link
Member

ejona86 commented Feb 3, 2022

The current implementation is categorizing it as a server error.

@irock, INTERNAL is not a "server error." It's a "things are fundamentally broken." The INTERNAL usage on that line is appropriate, because the client believes the RPC is client-streaming instead of unary, so the client and server appear not to agree on the schema.

It is a bug though that line is executed. But the "wrong status" part of it doesn't matter too much because the client already cancelled and so won't see it.

@ejona86 ejona86 added bug and removed question labels Feb 3, 2022
@ejona86
Copy link
Member

ejona86 commented Feb 3, 2022

There's no compression in the reproduction. So I think this might be it:

  1. Transport receives full request
  2. In the successful cases the application request()s a message, so delivery to the application begins. In failure case, this is too slow and doesn't happen
  3. Transport receives RST_STREAM, which closes the deframer with an option to throw away any messages.
  4. The deframer cleans up, but it hadn't yet started processing a message, so nextFrame == null so hasPartialMessage == false.
    boolean hasPartialMessage = nextFrame != null && nextFrame.readableBytes() > 0;
  5. That makes AbstractServerStream think this is a graceful end, so it inserts halfClosed(), even though it didn't deliver the message.

Could the fix be as easy as avoiding the entire endOfStream case if immediateCloseRequested == true? Looks like it. I haven't run the reproduction yet to check.

@tommyulfsparre, the reproduction looks to have been very helpful here.

@ejona86
Copy link
Member

ejona86 commented Feb 3, 2022

Maybe we need something more; I'd have to look at complete() more.

@irock
Copy link

irock commented Feb 24, 2022

Hi @ejona86! Do you have any news on a potential fix for this issue?

@YifeiZhuang
Copy link
Contributor

At the server side, transport first notifies stream created, then server call is started, and request() messages on the stream. Meanwhile, the stream receives halfClose() and notifies the call. However, it seems no matter how long request() is delayed, scheduling of notifying halfClose() on the call always happen after scheduling of onMessage(), which guarantees request is fulfilled before halfclose. @ejona do you know why this order happens?

@YifeiZhuang
Copy link
Contributor

never mind, that situation was in successful case.

When deadline is small enough( in my local test, request delay =100ms, deadline=400ms), we do see 100% of time that, due to deadline and slow request(), server call's halfClose() is notified before onMessage(), therefore INTERNAL "Half-closed without a request" is always triggered, however, the client always sees DEADLINE_EXCEEDED estatus code, instead of INTERNAL. Does this mean it might be client side issue?

@ejona86
Copy link
Member

ejona86 commented Jul 1, 2022

Cancellation drains the MessageDeframer, but our code handling the deframer closure is noticing we've received halfClose on-the-wire and reporting that accidentally. The problem here isn't the ordering between onMessage() and halfClose(). The problem is that halfClose() shouldn't be reported at all. The cancellation code threw away the request message, so it won't ever be delivered.

The call is already cancelled when this INTERNAL error occurs, so the server doesn't send the status to the client. The client processes the deadline locally to see the DEADLINE_EXCEEDED.

@YifeiZhuang
Copy link
Contributor

oh i looked at the original issue again, so even in the undesired situation the client never receives INTERNAL error, only server sees that. The client always see DEADLINE_EXCEEDED. The it looks the reproduction is ok.

Cancellation drains the MessageDeframer, but our code handling the deframer closure is noticing we've received halfClose on-the-wire and reporting that accidentally. The problem here isn't the ordering between onMessage() and halfClose(). The problem is that halfClose() shouldn't be reported at all. The cancellation code threw away the request message, so it won't ever be delivered.

The received halfClose from the client was caused by client cancellation, right? Then it makes sense, the deadline was long enough for the server to receive the request but not processed it, and be small enough to cancel before it all completes to cause the INTERNAL error.

@ejona86
Copy link
Member

ejona86 commented Jul 6, 2022

The received halfClose from the client was caused by client cancellation, right?

No, I expect that was part of the original request. Something like:

Client sends request headers, message, halfClose
... delay
Client sends cancellation

Commonly that delay is large enough that the server would have request()ed before receiving the cancellation and so we don't normally see the issue.

But the rest of what you said sounds fine.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
5 participants