Pubsub: Pull Subscriber unable to re-connect after a while #7910

daaain · 2019-05-09T13:42:56Z

Environment details

API: PubSub (maybe gRPC)
OS type and version: Debian GNU/Linux 9 (stretch) container (based on python:2.7.15-slim-stretch) running in 1.12.6-gke.10
Python version: 2.7.15
google-cloud-pubsub version: 0.40.0

Steps to reproduce

Have several PubSub subscribers run for a while (from hours to days) in a Kubernetes container, reconnecting every 10 minutes as normal
One of the subscribers suddenly start seeing reconnection rate go up as much as CPU resources allow (trying every few milliseconds in our case) and repeatedly getting 503 errors
At this point subscriber stops pulling messages and the process needs to be restarted

The issue did happen in a different service with only one subscriber in the container, but having several subscribers with all the others working excludes the possibility of a lot of other factors preventing re-connection (ie DNS resolution, no network, etc etc).

Code example

Totally standard Pull subscription using SubscriberClient + create_subscription + subscribe. Can paste code if required though.

StackDriver log snippet

...
I  2019-05-09 10:51:06.131 [INFO] Re-established stream
I  2019-05-09 10:51:06.134 [DEBUG] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f2fde071110>> caused 503 Connect Failed.
I  2019-05-09 10:51:06.135 [INFO] Observed recoverable stream error 503 Connect Failed 
I  2019-05-09 10:51:06.135 [DEBUG] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f2fde071110>>.
I  2019-05-09 10:51:06.137 [DEBUG] Stream was already re-established.
I  2019-05-09 10:51:06.138 [INFO] Re-established stream
I  2019-05-09 10:51:06.141 [INFO] Observed recoverable stream error 503 Connect Failed
I  2019-05-09 10:51:06.142 [DEBUG] Re-opening stream from gRPC callback.
I  2019-05-09 10:51:06.143 [INFO] Re-established stream 
I  2019-05-09 10:51:06.145 [DEBUG] The current p99 value is 10 seconds.
I  2019-05-09 10:51:06.146 [DEBUG] Snoozing lease management for 7.560536 seconds.
I  2019-05-09 10:51:06.147 [INFO] Observed recoverable stream error 503 Connect Failed
I  2019-05-09 10:51:06.147 [INFO] Re-established stream
...

First of all of course I'd be interested to help getting to the bottom of the issue and get it resolved.

But in the meanwhile it would be great to have a workaround detecting lost connection with a subscriber. I went through the public API documentation and couldn't find a way to get to the underlying (gRPC?) client, but it would be great to have a clean(ish) method doing a periodic check on the connection to be able to restart once the issue happens.

Thanks a lot in advance 🙏

The text was updated successfully, but these errors were encountered:

daaain · 2019-05-13T20:19:22Z

I came across #7709 which might be the root cause. When the new release comes out I'll post an update, but will also try to dig deeper in the meanwhile.

plamut · 2019-05-15T06:44:08Z

@daaain Thank you for reporting this and doing the initial research!

The linked issue could indeed be the root cause, and the fix for it merged recently makes sure that a clean shutdown of background threads is triggered if the underlying gRPC channel remains in an error state for too long.

The log records posted here are produced by these very threads, thus shutting them down should get rid of this problem.

Looking forward for more info after the next release!

(alternatively, if the nature of your application allows for it, i.e. it is not mission-critical, you could also experiment with the current development version of google-cloud-pubsub).

plamut · 2019-05-15T09:00:42Z

Update:
Whoa, I might have just been (un)lucky enough to actually observe a very similar error locally. While working on another issue and running the subscriber client's streaming pull (subscriber.subscribe()), my WiFi connection apparently got unstable, and I saw the following in the log output:

...
DEBUG    [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,441] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_reopen] Re-established stream
DEBUG    [2019-05-15 10:06:01,441] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,442] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,442] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,443] Thread-1                          [_reopen] Re-established stream
DEBUG    [2019-05-15 10:06:01,443] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,443] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,443] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,444] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,444] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,444] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,444] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,444] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,444] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,445] Thread-1                          [_reopen] Re-established stream
...

It appears that the streaming_pull() method is being repeatedly called in rapid succession despite the Retry settings that should result in exponentially increasing delays between successive calls.

The issue is reproducible on the current latest master branch, which already includes the aforementioned fix for #7709, thus my previous comment does not apply, unfortunately. To reproduce it locally, one must simply disable their internet connection, and then use subscriber client for a streaming pull:

subscriber.subscribe(SUBSCRIPTION_PATH, callback=my_callback)

while True:
    try:
        time.sleep(60)
    except KeyboardInterrupt:
        break

This busy "re-open the stream" loop consuming a lot of CPU is definitely not expected behavior.

sduskis · 2019-05-17T20:38:40Z

I think that the "definition of done" for this issue will be to definitively prove that this is a grpc level issue. If so, we can open up a grpc issue and we can work with the pub/sub team to see if this feature is truly GA blocking.

Dan4London · 2019-05-21T15:54:15Z

Hi,

I've been experiencing something similar but with publishing. When the connection to the cloud is disrupted (error 503), the gRPC in the background continues to attempt to re-establish connectivity with "exponentially increasing delays" as described above. Despite using an async request (check code sample below) this essentially hijacks the normal execution course of the program until the connection is re-established.

This error can be easily replicated by removing / adding routes to the routing table manually to simulate connections and disconnects to the cloud. I've managed to replicate this issue with a sample code from the pubsub documentation as follows:

def callback(message_future, testOne, testTwo):
    # When timeout is unspecified, the exception method waits indefinitely.
    if message_future.exception(timeout=3):
        print('Publishing message on {} threw an Exception {}.'.format(
            TOPIC_NAME, message_future.exception()))
    else:
        print("The number is: {} and the refOne: {} plus refTwo {}".format(message_future.result(), testOne, testTwo))


for n in range(1, 6):
    data = u'Message number {}'.format(n)
    # Data must be a bytestring
    data = data.encode('utf-8')
    # When you publish a message, the client returns a Future.
    message_future = PUBLISHER.publish(TOPIC_PATH, data=data)
    message_future.add_done_callback(partial(callback, testOne=123, testTwo=str(datetime.datetime.now())))

print('Published message IDs:')

I am using the latest google.cloud.pubsub libraries available to date (0.41). Any suggestions/advise?

Thanks!

Dan

plamut · 2019-05-22T20:27:07Z

Thanks to reporting this, @Dan4London. Could it be that this issue is the same as #8036 reporting a similar problem with publisher blocking?

plamut · 2019-05-22T21:49:58Z

I investigated the issue with the subscriber reconnect attempts skyrocketing. It seems that it has to do with the api_core.bidi.ResumableBidiRpc helper.

If there are network problems, the following can happen:

The code initiates asynchronous message pulling by calling subscriber.subscribe(). That method creates a new StreamingPullManager instance, and the latter opens the BackgroundConsumer running in its own thread.
The background consumer thread opens the (resumable) BidiRpc.
ResumableBidiRpc starts the RPC, which in this case is the SubscriberClient.straming_pull() method.
The result of this invocation (stored in call) is a StreamingResponseIterator object wrapping the result object returned by the underlying gRPC.
A done callback is registered on this result object - the ResumableBidiRpc._on_call_done() method.
If an error occurs, this callback is invoked. It checks if the error is recoverable (StreamingPullManager._should_recover()) - and 503 Service Unavailable is - and calls _reopen(), which in turn calls the open() method again.
Go to point 2, rinse and repeat, and consume a lot of CPU in the process.

(if the network error gets resolved in the meantime, the behavior returns back to normal)

One might wonder why the Retry wrapper that wraps the streaming_pull method does not kick in. The reason probably lies somewhere around the place where the streaming_pull actually gets invoked.

That call does not result in an exception, but instead the grpcio library returns a _Rendezvous object that seems to represent the result of a call. That object gets wrapped into StreamingResponseIterator, and the exception would only occur when next() is called on it.

Since the latter does not happen in that part of the code, the call returns normally (i.e. without an exception), thus the Retry wrapper around it does not kick in (and exponential backoff does not happen).

Instead, the aforementioned _on_call_done() callback is invoked, causing an unwanted retry loop at a different level.

sduskis · 2019-05-22T22:15:27Z

It sounds like there needs to be a backoff of some sort. This looks very much like this nodejs-pubsub issue.

@busunkim96, this sounds like something we should discuss with you.

sduskis · 2019-05-22T22:30:52Z

@plamut that makes sense. @busunkim96 should be involved, or at least aware, with any api_core related changes. @crwilcox mentioned offline that ResumableBidiRpc is used by Firestore as well, so I'm hereby pinging @BenWhitehead, since this will also likely affect Firestore.

Dan4London · 2019-05-23T12:10:31Z

Thanks for your feedback Plamut. #8036 describes precisely the problem I am experiencing. You can replicate the behaviour by re-executing the for loop before the previous gprc publish() request has timed out.

Do you have any suggestions when a fix for this would be available?

Thanks!

plamut · 2019-05-23T14:59:03Z

@Dan4London Unfortunately not (yet...), because it seems that the bug will have to be fixed in one of the PubSub client dependencies, but that will have to be coordinated with other teams that could be affected by the chance. The bug is high on the priority list, though.

Edit: Oh, you were probably asking about the publisher issue?
Depends on the priority assigned to it, @sduskis could probably tell more.

Edit 2: If nothing else comes across, I can probably have a look at it tomorrow.

Edit 3: This issue has been prioritized over the publisher issue, will look at the latter after this.

tseaver added api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue. labels May 9, 2019

sduskis assigned plamut May 14, 2019

sduskis added the triaged for GA label May 15, 2019

tseaver changed the title ~~PubSub: Pull Subscriber unable to re-connect after a while~~ Pubsub: Pull Subscriber unable to re-connect after a while May 20, 2019

plamut mentioned this issue May 30, 2019

Core: Mitigate busy reopen loop in ResumableBidiRpc consuming 100% CPU #8193

Merged

4 tasks

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 4, 2019

plamut added api: pubsub Issues related to the Pub/Sub API. and removed api: pubsub Issues related to the Pub/Sub API. labels Jun 5, 2019

plamut mentioned this issue Jun 6, 2019

Pub/Sub: publish message hangs waiting for previous publish to timeout #8036

Closed

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 11, 2019

sduskis closed this as completed in #8193 Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pubsub: Pull Subscriber unable to re-connect after a while #7910

Pubsub: Pull Subscriber unable to re-connect after a while #7910

daaain commented May 9, 2019

daaain commented May 13, 2019

plamut commented May 15, 2019

plamut commented May 15, 2019 •

edited

sduskis commented May 17, 2019

Dan4London commented May 21, 2019 •

edited by tseaver

plamut commented May 22, 2019

plamut commented May 22, 2019 •

edited

sduskis commented May 22, 2019

sduskis commented May 22, 2019 •

edited

Dan4London commented May 23, 2019

plamut commented May 23, 2019 •

edited

Pubsub: Pull Subscriber unable to re-connect after a while #7910

Pubsub: Pull Subscriber unable to re-connect after a while #7910

Comments

daaain commented May 9, 2019

Environment details

Steps to reproduce

Code example

StackDriver log snippet

daaain commented May 13, 2019

plamut commented May 15, 2019

plamut commented May 15, 2019 • edited

sduskis commented May 17, 2019

Dan4London commented May 21, 2019 • edited by tseaver

plamut commented May 22, 2019

plamut commented May 22, 2019 • edited

sduskis commented May 22, 2019

sduskis commented May 22, 2019 • edited

Dan4London commented May 23, 2019

plamut commented May 23, 2019 • edited

plamut commented May 15, 2019 •

edited

Dan4London commented May 21, 2019 •

edited by tseaver

plamut commented May 22, 2019 •

edited

sduskis commented May 22, 2019 •

edited

plamut commented May 23, 2019 •

edited