Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected StatusCode.UNAVAILABLE: Connection reset by peer #19514

Closed
yutkin opened this issue Jul 1, 2019 · 12 comments
Closed

Unexpected StatusCode.UNAVAILABLE: Connection reset by peer #19514

yutkin opened this issue Jul 1, 2019 · 12 comments

Comments

@yutkin
Copy link

yutkin commented Jul 1, 2019

What version of gRPC and what language are you using?

Python 3.6
grpcio==1.21.1

What operating system (Linux, Windows,...) and version?

Ubuntu 18 in Docker

What runtime / compiler are you using (e.g. python version or version of gcc)

Running in Kubernetes with Envoy as a proxy

What did you do?

Invoke a function via gRPC

What did you expect to see?

Normal invocation and computed result

What did you see instead?

Sometimes (~ in 1-5 %) I get:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/backoffice/worker.py", line 138, in handle text

File "/usr/local/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) 

File "/usr/local/lib/python3.6/site-packages/backoffice/worker.py", line 261, in task_handler data.update(self.get_faq_search_article(input_text)) 

File "/usr/local/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) 

File "/usr/local/lib/python3.6/site-packages/backoffice/worker.py", line 225, in get_faq_search_article response = 
self.faq_search_stub.GetFAQArticleID(query) 

File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 565, in __call__ return _end_unary_response_blocking(state, call, False, None) 

File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking raise _Rendezvous(state, None, None, deadline) grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Connection reset by peer" debug_error_string = "{"created":"@1561963566.707249650","description":"Error received from peer ipv4:172.23.218.229:8100","file":"src/core/lib/surface/call.cc","file_line":1046,"grpc_message":"Connection reset by peer","grpc_status":14}" >

status_code => 'StatusCode.UNAVAILABLE'
error_details => 'Connection reset by peer'
@yutkin
Copy link
Author

yutkin commented Jul 1, 2019

Self-made retry mechanism with exp. backoff helped mitigate this issue.

@gnossen
Copy link
Contributor

gnossen commented Jul 1, 2019

How can we reproduce this problem? Do you have a minimal reproduction case?
Failing that, can you please turn up logging by setting the following environment variables and supply us with the output?

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

Does this problem occur in earlier versions of gRPC?

@geoffjukes
Copy link

@yutkin Would you be willing to share your solution?

@yutkin
Copy link
Author

yutkin commented Sep 16, 2019

@geoffjukes You have to define an interceptor for RPC Errors. It will doing retries on failures with exponential backoff sleeping policy:

class SleepingPolicy(abc.ABC):
    @abc.abstractmethod
    def sleep(self, try_i: int):
        """
        How long to sleep in milliseconds.
        :param try_i: the number of retry (starting from zero)
        """
        assert try_i >= 0

class ExponentialBackoff(SleepingPolicy):
    def __init__(self, *, init_backoff_ms: int, max_backoff_ms: int, multiplier: int):
        self.init_backoff = randint(0, init_backoff_ms)
        self.max_backoff = max_backoff_ms
        self.multiplier = multiplier

    def sleep(self, try_i: int):
        sleep_range = min(
            self.init_backoff * self.multiplier ** try_i, self.max_backoff
        )
        sleep_ms = randint(0, sleep_range)
        logger.debug(f"Sleeping for {sleep_ms}")
        time.sleep(sleep_ms / 1000)

class RetryOnRpcErrorClientInterceptor(
    grpc.UnaryUnaryClientInterceptor, grpc.StreamUnaryClientInterceptor
):
    def __init__(
        self,
        *,
        max_attempts: int,
        sleeping_policy: SleepingPolicy,
        status_for_retry: Optional[Tuple[grpc.StatusCode]] = None,
    ):
        self.max_attempts = max_attempts
        self.sleeping_policy = sleeping_policy
        self.status_for_retry = status_for_retry

    def _intercept_call(self, continuation, client_call_details, request_or_iterator):

        for try_i in range(self.max_attempts):
            response = continuation(client_call_details, request_or_iterator)

            if isinstance(response, grpc.RpcError):

                # Return if it was last attempt
                if try_i == (self.max_attempts - 1):
                    return response

                # If status code is not in retryable status codes
                if (
                    self.status_for_retry
                    and response.code() not in self.status_for_retry
                ):
                    return response

                self.sleeping_policy.sleep(try_i)
            else:
                return response

    def intercept_unary_unary(self, continuation, client_call_details, request):
        return self._intercept_call(continuation, client_call_details, request)

    def intercept_stream_unary(
        self, continuation, client_call_details, request_iterator
    ):
        return self._intercept_call(continuation, client_call_details, request_iterator)

Usage example:

interceptors = (
  RetryOnRpcErrorClientInterceptor(
    max_attempts=4,
    sleeping_policy=ExponentialBackoff(init_backoff_ms=100, max_backoff_ms=1600, multiplier=2),
    status_for_retry=(grpc.StatusCode.UNAVAILABLE,),
  ),
)
stub = YourStub(
    grpc.intercept_channel(grpc.insecure_channel("service-hostname:8100"), *interceptors)
)

@geoffjukes
Copy link

Thank you so much for sharing @yutkin, this is incredibly helpful.

@gnossen
Copy link
Contributor

gnossen commented Sep 16, 2019

@yutkin Thanks for the detailed interceptor! Retries are not currently handled automatically within the gRPC library, either in Python or in C++, so doing retries like this within your application is definitely the way to go. If you'd like to contribute this interceptor, we have a Github project called grpc-ecosystem for great community-authored add-ons like this. (guide for contributing)

@AspirinSJL AspirinSJL assigned lidizheng and gnossen and unassigned lidizheng Oct 14, 2019
@iamliamc
Copy link

iamliamc commented Nov 13, 2019

What I don't understand is why would this happen in the first place? Is this a normal potentiality in the python GRPC library or does it say something about the service that is responding:

status_code => 'StatusCode.UNAVAILABLE'
error_details => 'Connection reset by peer'

It's also not clear here https://grpc.github.io/grpc/python/_modules/grpc.html#intercept_channel

If there is support for secure_channels --- found out yes indeed it works...

Finally does anyone know what This is an EXPERIMENTAL API. means in so far as using this channel intercept in production?

@gnossen
Copy link
Contributor

gnossen commented Nov 13, 2019

@iamliamc It says something about the connection between the client and the server.

Yes, interceptors do indeed work with secure_channels.

Interceptors are currently considered an experimental API. Which means that we technically reserve the right to change the API without making it a major version bump. In practice however, so many things have come to depend upon the current form of the interceptor API, that it's highly unlikely we'll do this. In fact, we probably need to to make it a priority to graduate the API from experimental in the not-too-distant future.

pavius pushed a commit to v3io/frames that referenced this issue Feb 7, 2020
HTTP:
- Use persistent http session which is opened upon object initialization and not on each request
- The default `persist_connection=False` will add a header making sure the connection is closed upon after each request. `persist_connection=True` will leave the connection open, which is more  performant for frequent requests

GRPC:
- Using persistent channel opened on object initialization, similar to http, but, here there is no `persist_connection=False` - in grpc the channel is always persistent (no way to force close it or disable keepalive).
- upgraded grpc to 1.26.0 which supports interceptors and added interceptors with exponential retries upon UNAVAILABLE response errors on the channel. See: grpc/grpc#19514
@stale
Copy link

stale bot commented May 6, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions!

@gnossen gnossen closed this as completed May 12, 2020
kriben added a commit to OPM/ResInsight that referenced this issue Dec 18, 2020
Grpc connection is sometimes reset on flaky networks, and this is not
handled by the python grpc library. Solved by intercepting UNAVAILABLE
responses and retrying the command.

Adapted from this issue in python grpc repo:
grpc/grpc#19514
kriben added a commit to OPM/ResInsight that referenced this issue Dec 21, 2020
Grpc connection is sometimes reset on flaky networks, and this is not
handled by the python grpc library. Solved by intercepting UNAVAILABLE
responses and retrying the command.

Adapted from this issue in python grpc repo:
grpc/grpc#19514
@caryan
Copy link

caryan commented Apr 22, 2021

For anyone trying this approach with the aio flavour there is an if-elif chain that prevents you from using the multiple inheritance style above and you have to make separate interceptor classes for each type of RPC call. E.g.

class RetryOnRpcErrorClientInterceptor:
    def __init__(
        self,
        *,
        max_attempts: int,
        sleeping_policy: SleepingPolicy,
        status_for_retry: Optional[Tuple[grpc.StatusCode]] = None,
    ):
        self.max_attempts = max_attempts
        self.sleeping_policy = sleeping_policy
        self.status_for_retry = status_for_retry

    def _intercept_call(self, continuation, client_call_details, request_or_iterator):

        for try_i in range(self.max_attempts):
            response = continuation(client_call_details, request_or_iterator)

            if isinstance(response, grpc.RpcError):

                # Return if it was last attempt
                if try_i == (self.max_attempts - 1):
                    return response

                # If status code is not in retryable status codes
                if (
                    self.status_for_retry
                    and response.code() not in self.status_for_retry
                ):
                    return response

                self.sleeping_policy.sleep(try_i)
            else:
                return response

class UnaryUnaryRetryOnRpcErrorClientInterceptor(
    RetryOnRpcErrorClientInterceptor,
    grpc.aio.UnaryUnaryClientInterceptor
):
    def intercept_unary_unary(self, continuation, client_call_details, request):
        return self._intercept_call(continuation, client_call_details, request)

class UnaryUnaryRetryOnRpcErrorClientInterceptor(
    RetryOnRpcErrorClientInterceptor,
    grpc.aio.StreamUnaryClientInterceptor
):
    def intercept_unary_unary(self, continuation, client_call_details, request_iterator):
        return self._intercept_call(continuation, client_call_details, request_iterator)

@vanakema
Copy link

We weirdly only experience this on versions > 1.25. We are also building from source now since we need to use OpenSSL instead of BoringSSL (our certs are somehow incompatible with BoringSSL but work fine with OpenSSL. We're working on replacing them with BoringSSL compatible certs), but I can't see how this issue would be related to OpenSSL.

Has anyone else experienced this in the latest versions of gRPC (1.46.3)?

@ndvbd
Copy link

ndvbd commented Mar 10, 2023

Is the randomness in the sleep needed in this grpc case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants