Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a retry policy using the service config causes #25003

Open
Emil-Hansen opened this issue Dec 15, 2020 · 4 comments
Open

Creating a retry policy using the service config causes #25003

Emil-Hansen opened this issue Dec 15, 2020 · 4 comments

Comments

@Emil-Hansen
Copy link

Emil-Hansen commented Dec 15, 2020

What version of gRPC and what language are you using?

1.34.0, python
(Though also tried grpcio 1.33.2 and 1.32.0)

What operating system (Linux, Windows,...) and version?

Tested on Ubuntu 20.04 (docker), buster (docker) and MacOS 11.1 results were the same in all cases

What runtime / compiler are you using (e.g. python version or version of gcc)

python 3.8

What did you do?

Created a server that when heavily loaded will return RESOURCE_EXHAUSTED to the client and have the client use a service config to retry.

Steps to reproduce:

  1. Installed grpcio and grpcio-tools with version 1.34.0
  2. Ran the server from the file attached
  3. Ran the client from the file attached

Server:

from concurrent import futures
import logging

import grpc

protos, services = grpc.protos_and_services("test.proto")

class Handler(services.HandlerServicer):

    def Call(self, request, context):
        context.abort(code=grpc.StatusCode.RESOURCE_EXHAUSTED,
                          details="Server is busy, come back soon")

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    services.add_HandlerServicer_to_server(Handler(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()


if __name__ == '__main__':
    logging.basicConfig()
    serve()

Client:

import json

import grpc

protos, services = grpc.protos_and_services("test.proto")

SERVICE_CONFIG = json.dumps(
    {
        "methodConfig": [
            {
                "name": [
                    {"service": "test.Handler"},
                ],
                "retryPolicy": {
                    "maxAttempts": 5,
                    "initialBackoff": "0.3s",
                    "maxBackoff": "10s",
                    "backoffMultiplier": 3,
                    "retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"],
                },
            }
        ]
    }
)
options = [
    ("grpc.service_config", SERVICE_CONFIG),
]

channel = grpc.insecure_channel('localhost:50051', options=options)
stub = services.HandlerStub(channel)


for response in stub.Call(protos.Request()):
    pass

proto file:

syntax = "proto3";

package test;


service Handler {
  rpc Call  (Request) returns (stream Response) {}
}

message Request {
}

message Response {}

requirements.txt

grpcio==1.34
grpcio-tools==1.34

What did you expect to see?

Expected client to retry 4 times and then raise a RESOURCE_EXHAUSTED error

What did you see instead?

An error in an underlying library caused the call to crash and bring down client that the call was run from, not allowing python to handle the error.

With GRPC_VERBOSITY=debug, the following stack trace was generated:

D1215 18:44:02.599314007      24 ev_posix.cc:173]            Using polling engine: epollex
D1215 18:44:02.599417707      24 lb_policy_registry.cc:42]   registering LB policy factory for "grpclb"
D1215 18:44:02.599426554      24 lb_policy_registry.cc:42]   registering LB policy factory for "priority_experimental"
D1215 18:44:02.599434638      24 lb_policy_registry.cc:42]   registering LB policy factory for "weighted_target_experimental"
D1215 18:44:02.599438372      24 lb_policy_registry.cc:42]   registering LB policy factory for "pick_first"
D1215 18:44:02.599441453      24 lb_policy_registry.cc:42]   registering LB policy factory for "round_robin"
D1215 18:44:02.599445377      24 dns_resolver_ares.cc:507]   Using ares dns resolver
D1215 18:44:02.599460957      24 lb_policy_registry.cc:42]   registering LB policy factory for "cds_experimental"
D1215 18:44:02.599468822      24 lb_policy_registry.cc:42]   registering LB policy factory for "eds_experimental"
D1215 18:44:02.599503917      24 lb_policy_registry.cc:42]   registering LB policy factory for "eds_drop_experimental"
D1215 18:44:02.599507699      24 lb_policy_registry.cc:42]   registering LB policy factory for "xds_cluster_manager_experimental"
I1215 18:44:02.610086424      24 socket_utils_common_posix.cc:427] Disabling AF_INET6 sockets because ::1 is not available.
I1215 18:44:02.611445714      28 subchannel.cc:1126]         New connected subchannel at 0x7f2364002000 for subchannel 0x55812330b950
E1215 18:44:02.612751183      28 sync.cc:105]                assertion failed: prior > 0

Anything else we should know about your project / environment?

@alexgenco
Copy link

I am also seeing this error using the Ruby client. It seems like the only option is to avoid retry policies for streaming endpoints, because if you receive a retryable error like this before any messages have been received, your entire application process will be aborted.

Probably worth noting that if you receive an error AFTER the client has received any streaming messages, it correctly cancels the stream and throws the GRPC status error without retrying.

@spacether
Copy link

spacether commented Apr 26, 2021

When setting the channel options, I think that one also needs to add:

('grpc.enable_retries', True),

@nanonyme
Copy link

nanonyme commented Jun 21, 2021

It's essentially this assert which is being hit, right? https://github.com/grpc/grpc/blob/master/src/core/lib/gpr/sync.cc#L105. What behaviour is that assert guarding against? Shouldn't that be handled through return values instead in release builds?

@nanonyme
Copy link

@yashykt are you sure the Python label is correct? If the failure is also happening in other client bindings and happens because of an assert in core, isn't this more probably a core issue?

@lidizheng lidizheng removed their assignment Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants