Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC retry behave weird when timeout is set along with retry policy #10336

Closed
ranavivek04 opened this issue Jun 30, 2023 · 13 comments · Fixed by #10855
Closed

gRPC retry behave weird when timeout is set along with retry policy #10336

ranavivek04 opened this issue Jun 30, 2023 · 13 comments · Fixed by #10855
Assignees
Labels
Milestone

Comments

@ranavivek04
Copy link

ranavivek04 commented Jun 30, 2023

Using below retry config:

{
  "methodConfig": [
    {
      "name": [
        {
          "service": "helloworld.Greeter",
          "method": "SayHello"
        }
      ],
      "timeout": "2.000s",
      "retryPolicy": {
        "maxAttempts": 5,
        "initialBackoff": "0.5s",
        "maxBackoff": "30s",
        "backoffMultiplier": 2,
        "retryableStatusCodes": [
          "UNAVAILABLE"
        ]
      }
    }
  ]
}

Expected: grpc client should stop retrying after "timeout": "2.000s" is over but client keeps running for infinite time i.e. kind of stuck in gRPC call to server.

@ejona86
Copy link
Member

ejona86 commented Jun 30, 2023

What version of gRPC are you using? Since you are mentioning retries being a cause, does that mean If you remove the retryPolicy the timeout works?

@ejona86
Copy link
Member

ejona86 commented Jun 30, 2023

I see you saw #10314. Are you using Netty or OkHttp?

@ranavivek04
Copy link
Author

What version of gRPC are you using? Since you are mentioning retries being a cause, does that mean If you remove the retryPolicy the timeout works?

<grpc.version>1.53.0</grpc.version>
Retry works fine when I either remove timeout or I set timeout to a value that is greater than the combined time of retry attempts.

@ranavivek04
Copy link
Author

I see you saw #10314. Are you using Netty or OkHttp?

using Netty

@ranavivek04
Copy link
Author

ranavivek04 commented Jun 30, 2023

This can be easily reproduced with below config on io.grpc.examples.retrying.RetryingHelloWorldClient example.

{
  "methodConfig": [
    {
      "name": [
        {
          "service": "helloworld.Greeter",
          "method": "SayHello"
        }
      ],
      "timeout": "2.000s",
      "retryPolicy": {
        "maxAttempts": 5,
        "initialBackoff": "0.5s",
        "maxBackoff": "30s",
        "backoffMultiplier": 2,
        "retryableStatusCodes": [
          "UNAVAILABLE"
        ]
      }
    }
  ]
}

Just change below method in io.grpc.examples.retrying.RetryingHelloWorldServer to return Status.UNAVAILABLE for all requests as below.

 @Override
    public void sayHello(HelloRequest request, StreamObserver<HelloReply> responseObserver) {
      int count = retryCounter.incrementAndGet();
//      if (random.nextFloat() < UNAVAILABLE_PERCENTAGE) {
        logger.info("Returning stubbed UNAVAILABLE error. count: " + count);
        responseObserver.onError(Status.UNAVAILABLE
            .withDescription("Greeter temporarily unavailable...").asRuntimeException());
//      } else {
//        logger.info("Returning successful Hello response, count: " + count);
//        HelloReply reply = HelloReply.newBuilder().setMessage("Hello " + request.getName()).build();
//        responseObserver.onNext(reply);
//        responseObserver.onCompleted().
//      }
    }

@ejona86
Copy link
Member

ejona86 commented Jun 30, 2023

I've reproduced this on master. It is broken at least since 1.52.0. I didn't test further back. And yes, removing retryPolicy fixes timeout.

diff --git a/examples/src/main/java/io/grpc/examples/retrying/RetryingHelloWorldServer.java b/examples/src/main/java/io/grpc/examples/retrying/RetryingHelloWorldServer.java
index 165cc72ff..a22980d76 100644
--- a/examples/src/main/java/io/grpc/examples/retrying/RetryingHelloWorldServer.java
+++ b/examples/src/main/java/io/grpc/examples/retrying/RetryingHelloWorldServer.java
@@ -97,16 +97,16 @@ public class RetryingHelloWorldServer {
     @Override
     public void sayHello(HelloRequest request, StreamObserver<HelloReply> responseObserver) {
       int count = retryCounter.incrementAndGet();
-      if (random.nextFloat() < UNAVAILABLE_PERCENTAGE) {
+      //if (random.nextFloat() < UNAVAILABLE_PERCENTAGE) {
         logger.info("Returning stubbed UNAVAILABLE error. count: " + count);
         responseObserver.onError(Status.UNAVAILABLE
             .withDescription("Greeter temporarily unavailable...").asRuntimeException());
-      } else {
-        logger.info("Returning successful Hello response, count: " + count);
-        HelloReply reply = HelloReply.newBuilder().setMessage("Hello " + request.getName()).build();
-        responseObserver.onNext(reply);
-        responseObserver.onCompleted();
-      }
+      //} else {
+      //  logger.info("Returning successful Hello response, count: " + count);
+      //  HelloReply reply = HelloReply.newBuilder().setMessage("Hello " + request.getName()).build();
+      //  responseObserver.onNext(reply);
+      //  responseObserver.onCompleted();
+      //}
     }
   }
 }
diff --git a/examples/src/main/resources/io/grpc/examples/retrying/retrying_service_config.json b/examples/src/main/resources/io/grpc/examples/retrying/retrying_service_config.json
index ff115a9ad..9d202d77b 100644
--- a/examples/src/main/resources/io/grpc/examples/retrying/retrying_service_config.json
+++ b/examples/src/main/resources/io/grpc/examples/retrying/retrying_service_config.json
@@ -8,6 +8,7 @@
         }
       ],
 
+      "timeout": "2.000s",
       "retryPolicy": {
         "maxAttempts": 5,
         "initialBackoff": "0.5s",

@burk
Copy link

burk commented Sep 29, 2023

As commented in the closed issue, we had the same problem, and I'm pretty sure the problem was introduced between versions 1.51.3 and 1.52.0.

@ejona86
Copy link
Member

ejona86 commented Jan 19, 2024

@turchenkoalex did some detailed investigation in #10838, which I think is related to this. @larry-safran, please look at the text in #10838. I didn't read it carefully but the things it mentioned were highly likely to be related to this issue, so I'm hopeful the root cause has been found.

larry-safran added a commit to larry-safran/grpc-java that referenced this issue Jan 27, 2024
@larry-safran
Copy link
Contributor

The solution proposed in #10838 fixed this. The root cause hypothesis was valid.

larry-safran added a commit to larry-safran/grpc-java that referenced this issue Jan 31, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 1, 2024
larry-safran added a commit that referenced this issue Feb 5, 2024
* Fix retries that timeout hanging forever. (#10855)

Fixes #10336

* Fix flaky retry tests (#10887)

* Reorder tracing and actually closing listener to eliminate test flakiness
* Use real value rather than mock for flaky test
larry-safran added a commit that referenced this issue Feb 6, 2024
larry-safran added a commit that referenced this issue Feb 6, 2024
* Fix retries that timeout hanging forever. (#10855)

Fixes #10336

* Fix flaky retry tests (#10887)
larry-safran added a commit to larry-safran/grpc-java that referenced this issue Feb 7, 2024
@ejona86 ejona86 modified the milestones: Next, 1.63 Feb 9, 2024
@31KM
Copy link

31KM commented Mar 27, 2024

Do you have any ETA when this fix will be released in 1.59.x and 1.60.x?

@larry-safran
Copy link
Contributor

There was a defect in this fix that was corrected in #11026. Is there a reason why you need 1.59 or 1.60?

@31KM
Copy link

31KM commented Mar 28, 2024

Thanks for your response and the pointer to the follow-up fix!
We are stuck on older versions because of the (kinda insane) update policy of one of our customers and also some version conflicts (grpc -> okhttp -> okio -> kotlin). We are in the automotive industry for what it's worth.

#11026 has not been backported to 1.59.x and 1.60.x AFAICT. Do you have plans to do this and perhaps release these versions?

@ejona86
Copy link
Member

ejona86 commented Mar 28, 2024

Per the support policy we wouldn't ordinarily do releases for 1.59 and 1.60. Doing backports are cheap and easy, so we will sometimes backport to branches that we don't know if we'll ever release, so that if there happens to be a release for some reason it would be included. I'm not quite sure why larry-safran backported to that many branches though.

To be clear, you need both 1.59 and 1.60? Just one or the other wouldn't be enough?

Is there anything we can do to help with the okio version conflicts? It might be good to open a new issue for that to discuss if there are options. Our Bazel build, for example, is still using 2.10.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants