Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pubsub: Pull Subscriber unable to re-connect after a while #7910

Closed
daaain opened this issue May 9, 2019 · 11 comments · Fixed by #8193
Closed

Pubsub: Pull Subscriber unable to re-connect after a while #7910

daaain opened this issue May 9, 2019 · 11 comments · Fixed by #8193
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@daaain
Copy link

daaain commented May 9, 2019

Environment details

  1. API: PubSub (maybe gRPC)
  2. OS type and version: Debian GNU/Linux 9 (stretch) container (based on python:2.7.15-slim-stretch) running in 1.12.6-gke.10
  3. Python version: 2.7.15
  4. google-cloud-pubsub version: 0.40.0

Steps to reproduce

  1. Have several PubSub subscribers run for a while (from hours to days) in a Kubernetes container, reconnecting every 10 minutes as normal
  2. One of the subscribers suddenly start seeing reconnection rate go up as much as CPU resources allow (trying every few milliseconds in our case) and repeatedly getting 503 errors
  3. At this point subscriber stops pulling messages and the process needs to be restarted

The issue did happen in a different service with only one subscriber in the container, but having several subscribers with all the others working excludes the possibility of a lot of other factors preventing re-connection (ie DNS resolution, no network, etc etc).

Code example

Totally standard Pull subscription using SubscriberClient + create_subscription + subscribe. Can paste code if required though.

StackDriver log snippet

...
I  2019-05-09 10:51:06.131 [INFO] Re-established stream
I  2019-05-09 10:51:06.134 [DEBUG] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f2fde071110>> caused 503 Connect Failed.
I  2019-05-09 10:51:06.135 [INFO] Observed recoverable stream error 503 Connect Failed 
I  2019-05-09 10:51:06.135 [DEBUG] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f2fde071110>>.
I  2019-05-09 10:51:06.137 [DEBUG] Stream was already re-established.
I  2019-05-09 10:51:06.138 [INFO] Re-established stream
I  2019-05-09 10:51:06.141 [INFO] Observed recoverable stream error 503 Connect Failed
I  2019-05-09 10:51:06.142 [DEBUG] Re-opening stream from gRPC callback.
I  2019-05-09 10:51:06.143 [INFO] Re-established stream 
I  2019-05-09 10:51:06.145 [DEBUG] The current p99 value is 10 seconds.
I  2019-05-09 10:51:06.146 [DEBUG] Snoozing lease management for 7.560536 seconds.
I  2019-05-09 10:51:06.147 [INFO] Observed recoverable stream error 503 Connect Failed
I  2019-05-09 10:51:06.147 [INFO] Re-established stream
...

First of all of course I'd be interested to help getting to the bottom of the issue and get it resolved.

But in the meanwhile it would be great to have a workaround detecting lost connection with a subscriber. I went through the public API documentation and couldn't find a way to get to the underlying (gRPC?) client, but it would be great to have a clean(ish) method doing a periodic check on the connection to be able to restart once the issue happens.

Thanks a lot in advance 🙏

@tseaver tseaver added api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue. labels May 9, 2019
@daaain
Copy link
Author

daaain commented May 13, 2019

I came across #7709 which might be the root cause. When the new release comes out I'll post an update, but will also try to dig deeper in the meanwhile.

@plamut
Copy link
Contributor

plamut commented May 15, 2019

@daaain Thank you for reporting this and doing the initial research!

The linked issue could indeed be the root cause, and the fix for it merged recently makes sure that a clean shutdown of background threads is triggered if the underlying gRPC channel remains in an error state for too long.

The log records posted here are produced by these very threads, thus shutting them down should get rid of this problem.

Looking forward for more info after the next release!

(alternatively, if the nature of your application allows for it, i.e. it is not mission-critical, you could also experiment with the current development version of google-cloud-pubsub).

@plamut
Copy link
Contributor

plamut commented May 15, 2019

Update:
Whoa, I might have just been (un)lucky enough to actually observe a very similar error locally. While working on another issue and running the subscriber client's streaming pull (subscriber.subscribe()), my WiFi connection apparently got unstable, and I saw the following in the log output:

...
DEBUG    [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,440] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,441] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,441] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,441] Thread-1                          [_reopen] Re-established stream
DEBUG    [2019-05-15 10:06:01,441] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,442] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,442] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,442] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,442] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,443] Thread-1                          [_reopen] Re-established stream
DEBUG    [2019-05-15 10:06:01,443] Thread-ConsumeBidirectionalStream [_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>> caused 503 DNS resolution failed.
INFO     [2019-05-15 10:06:01,443] Thread-ConsumeBidirectionalStream [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,443] Thread-ConsumeBidirectionalStream [_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7fb5eb91d630>>.
INFO     [2019-05-15 10:06:01,444] Thread-ConsumeBidirectionalStream [_reopen] Re-established stream
INFO     [2019-05-15 10:06:01,444] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,444] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
DEBUG    [2019-05-15 10:06:01,444] Thread-1                          [_reopen] Stream was already re-established.
INFO     [2019-05-15 10:06:01,444] Thread-1                          [_should_recover] Observed recoverable stream error 503 DNS resolution failed
DEBUG    [2019-05-15 10:06:01,444] Thread-1                          [_on_call_done] Re-opening stream from gRPC callback.
INFO     [2019-05-15 10:06:01,445] Thread-1                          [_reopen] Re-established stream
...

It appears that the streaming_pull() method is being repeatedly called in rapid succession despite the Retry settings that should result in exponentially increasing delays between successive calls.

The issue is reproducible on the current latest master branch, which already includes the aforementioned fix for #7709, thus my previous comment does not apply, unfortunately. To reproduce it locally, one must simply disable their internet connection, and then use subscriber client for a streaming pull:

subscriber.subscribe(SUBSCRIPTION_PATH, callback=my_callback)

while True:
    try:
        time.sleep(60)
    except KeyboardInterrupt:
        break

This busy "re-open the stream" loop consuming a lot of CPU is definitely not expected behavior.

@plamut plamut added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed type: question Request for information or clarification. Not an issue. labels May 15, 2019
@sduskis
Copy link
Contributor

sduskis commented May 17, 2019

I think that the "definition of done" for this issue will be to definitively prove that this is a grpc level issue. If so, we can open up a grpc issue and we can work with the pub/sub team to see if this feature is truly GA blocking.

@tseaver tseaver changed the title PubSub: Pull Subscriber unable to re-connect after a while Pubsub: Pull Subscriber unable to re-connect after a while May 20, 2019
@Dan4London
Copy link

Dan4London commented May 21, 2019

Hi,

I've been experiencing something similar but with publishing. When the connection to the cloud is disrupted (error 503), the gRPC in the background continues to attempt to re-establish connectivity with "exponentially increasing delays" as described above. Despite using an async request (check code sample below) this essentially hijacks the normal execution course of the program until the connection is re-established.

This error can be easily replicated by removing / adding routes to the routing table manually to simulate connections and disconnects to the cloud. I've managed to replicate this issue with a sample code from the pubsub documentation as follows:

def callback(message_future, testOne, testTwo):
    # When timeout is unspecified, the exception method waits indefinitely.
    if message_future.exception(timeout=3):
        print('Publishing message on {} threw an Exception {}.'.format(
            TOPIC_NAME, message_future.exception()))
    else:
        print("The number is: {} and the refOne: {} plus refTwo {}".format(message_future.result(), testOne, testTwo))


for n in range(1, 6):
    data = u'Message number {}'.format(n)
    # Data must be a bytestring
    data = data.encode('utf-8')
    # When you publish a message, the client returns a Future.
    message_future = PUBLISHER.publish(TOPIC_PATH, data=data)
    message_future.add_done_callback(partial(callback, testOne=123, testTwo=str(datetime.datetime.now())))

print('Published message IDs:')

I am using the latest google.cloud.pubsub libraries available to date (0.41). Any suggestions/advise?

Thanks!

Dan

@plamut
Copy link
Contributor

plamut commented May 22, 2019

Thanks to reporting this, @Dan4London. Could it be that this issue is the same as #8036 reporting a similar problem with publisher blocking?

@plamut
Copy link
Contributor

plamut commented May 22, 2019

I investigated the issue with the subscriber reconnect attempts skyrocketing. It seems that it has to do with the api_core.bidi.ResumableBidiRpc helper.

If there are network problems, the following can happen:

(if the network error gets resolved in the meantime, the behavior returns back to normal)


One might wonder why the Retry wrapper that wraps the streaming_pull method does not kick in. The reason probably lies somewhere around the place where the streaming_pull actually gets invoked.

That call does not result in an exception, but instead the grpcio library returns a _Rendezvous object that seems to represent the result of a call. That object gets wrapped into StreamingResponseIterator, and the exception would only occur when next() is called on it.

Since the latter does not happen in that part of the code, the call returns normally (i.e. without an exception), thus the Retry wrapper around it does not kick in (and exponential backoff does not happen).

Instead, the aforementioned _on_call_done() callback is invoked, causing an unwanted retry loop at a different level.

@sduskis
Copy link
Contributor

sduskis commented May 22, 2019

It sounds like there needs to be a backoff of some sort. This looks very much like this nodejs-pubsub issue.

@busunkim96, this sounds like something we should discuss with you.

@sduskis
Copy link
Contributor

sduskis commented May 22, 2019

@plamut that makes sense. @busunkim96 should be involved, or at least aware, with any api_core related changes. @crwilcox mentioned offline that ResumableBidiRpc is used by Firestore as well, so I'm hereby pinging @BenWhitehead, since this will also likely affect Firestore.

@Dan4London
Copy link

Thanks for your feedback Plamut. #8036 describes precisely the problem I am experiencing. You can replicate the behaviour by re-executing the for loop before the previous gprc publish() request has timed out.

Do you have any suggestions when a fix for this would be available?

Thanks!

@plamut
Copy link
Contributor

plamut commented May 23, 2019

@Dan4London Unfortunately not (yet...), because it seems that the bug will have to be fixed in one of the PubSub client dependencies, but that will have to be coordinated with other teams that could be affected by the chance. The bug is high on the priority list, though.

Edit: Oh, you were probably asking about the publisher issue?
Depends on the priority assigned to it, @sduskis could probably tell more.

Edit 2: If nothing else comes across, I can probably have a look at it tomorrow.

Edit 3: This issue has been prioritized over the publisher issue, will look at the latter after this.

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 4, 2019
@plamut plamut added api: pubsub Issues related to the Pub/Sub API. and removed api: pubsub Issues related to the Pub/Sub API. labels Jun 5, 2019
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants