Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in streaming_pull_manager and ResumableBidiRpc after receiving 504 deadline exceeded error #74

Closed
sweatybridge opened this issue Apr 16, 2020 · 10 comments · Fixed by #76
Assignees
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@sweatybridge
Copy link

sweatybridge commented Apr 16, 2020

Environment details

  • OS type and version: alpine 3.11
  • Python version: 3.7.7
  • pip version: 20.0.2
  • google-cloud-pubsub version: 1.4.2

Steps to reproduce

  1. Start subscriber client
  2. Wait for 10 minutes (server should return 504 Deadline Exceeded error)
  3. Messages no longer received and heartbeat no longer sent

Here are the debug logs just after 10 mins.

DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:The current deadline value is 10 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:Snoozing lease management for 7.152175 seconds.
INFO:google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager:Observed non-terminating stream error 504 Deadline Exceeded
INFO:google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager:Observed recoverable stream error 504 Deadline Exceeded
DEBUG:google.api_core.bidi:Re-opening stream from gRPC callback.
DEBUG:google.api_core.bidi:Empty queue and inactive call, exiting request generator.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:The current deadline value is 10 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:Snoozing lease management for 4.379558 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:The current deadline value is 10 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:Snoozing lease management for 7.720803 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:The current deadline value is 10 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:Snoozing lease management for 3.224825 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:The current deadline value is 10 seconds.
DEBUG:google.cloud.pubsub_v1.subscriber._protocol.leaser:Snoozing lease management for 0.857948 seconds.

Looking at the logs, I'm expected to see _LOGGER.debug("Call to retryable %r caused %s.", method, exc) since it should be printed on error https://github.com/googleapis/python-api-core/blob/5e5559202891f7e5b6c22c2cbc549e1ec26eb857/google/api_core/bidi.py#L508. However, the message was not printed so the thread was stuck acquiring self._operational_lock.

I've sampled the top functions running on all threads after getting into the deadlock state (attached as stacktrace below). The heartbeater's background thread seems to be frequently holding the lock as it performs the is_active check https://github.com/googleapis/python-pubsub/blob/master/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py#L425. I'm not sure why it's not being released.

Code example

import logging

from google.cloud.pubsub_v1 import SubscriberClient


def main():
    logging.basicConfig(level=logging.DEBUG)

    client = SubscriberClient()
    sub = client.subscription_path("project", "subscription")

    def callback(message):
        print(message)
        message.ack()

    future = client.subscribe(sub, callback)

    try:
        future.result()
    except KeyboardInterrupt:
        future.cancel()

if __name__ == "__main__":
    main()

Stack trace

Collecting samples from 'python main.py' (python v3.7.7)
Total Samples 2000
GIL: 0.00%, Active: 12.00%, Threads: 17

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
 10.00%  10.00%    1.94s     1.94s   _worker (concurrent/futures/thread.py:78)
  2.00%   2.00%   0.640s    0.640s   _recoverable (api_core/bidi.py:508)
  0.00%   0.00%   0.330s    0.330s   is_active (api_core/bidi.py:578)
  0.00%   0.00%   0.000s    0.330s   heartbeat (pubsub_v1/subscriber/_protocol/streaming_pull_manager.py:425)
  0.00%  12.00%   0.000s     2.91s   _bootstrap (threading.py:890)
  0.00%  12.00%   0.000s     2.91s   _bootstrap_inner (threading.py:926)
  0.00%   2.00%   0.000s    0.640s   recv (api_core/bidi.py:562)
  0.00%  12.00%   0.000s     2.91s   run (threading.py:870)
  0.00%   0.00%   0.000s    0.330s   heartbeat (pubsub_v1/subscriber/_protocol/heartbeater.py:40)
  0.00%   2.00%   0.000s    0.640s   _thread_main (api_core/bidi.py:655)
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label Apr 16, 2020
@sweatybridge sweatybridge changed the title Deadlock in streaming_pull_manager Heartbeater and ResumableBidiRpc after receiving 504 deadline exceeded error Deadlock in streaming_pull_manager and ResumableBidiRpc after receiving 504 deadline exceeded error Apr 16, 2020
@plamut plamut added the type: question Request for information or clarification. Not an issue. label Apr 16, 2020
@sweatybridge
Copy link
Author

Btw this issue can be resolved by downgrading google-api-core to 1.16.0.

@plamut
Copy link
Contributor

plamut commented Apr 16, 2020

@sweatybridge I am just trying out to see if I can reproduce it. Interesting fact about the API core workaround, as the latest version (1.17.0) has been released just yesterday - it might be a regression in it.

@artefactop
Copy link

Hi, we had a similar problem yesterday in our production env. But I don't know if it is fully related.

Environment details

  • OS type and version: debian slim-buster
  • Python version: 3.6
  • pip version: 20.0.2
  • google-cloud-pubsub version: 1.1.0

Steps to reproduce

  • Start subscriber client
  • Wait for 10 minutes (server should return 504 Deadline Exceeded error)
  • Messages no longer received and heartbeat no longer sent
  • After some hours figuring out what could be the problem, we realize that bidi do not re-establish the stream. Here are the logs just after 10 mins.
[2020-04-15 20:08:59,485] [MainProcess (1801)] [ERROR] [_channel] Exception iterating requests!
Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 195, in consume_request_iterator request = next(request_iterator) ValueError: generator already executing
[2020-04-15 20:17:22,283] [MainProcess (1801)] [INFO] [streaming_pull_manager] Observed non-terminating stream error 504 Deadline Exceeded
[2020-04-15 20:17:22,285] [MainProcess (1801)] [INFO] [streaming_pull_manager] Observed recoverable stream error 504 Deadline Exceeded

We missed something like

[2020-04-16 07:37:45,286] [MainProcess (9)] [INFO] [bidi] Re-established stream

We solved it fixing grpcio version to 1.27.2

@plamut
Copy link
Contributor

plamut commented Apr 16, 2020

I can confirm the reported behavior using google-api-core 1.17.0 and simulating a 504 Deadline Exceeded error (by disabling WiFi for a minute or so).

After switching the WiFi back on, the only thing that's left in the DEBUG logs is the leaser thread activity:

DEBUG    [2020-04-16 12:02:20,458] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:186][maintain_leases] Snoozing lease management for 4.003695 seconds.
DEBUG    [2020-04-16 12:02:24,462] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:134][maintain_leases] The current deadline value is 10 seconds.
DEBUG    [2020-04-16 12:02:24,463] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:186][maintain_leases] Snoozing lease management for 4.865866 seconds.
DEBUG    [2020-04-16 12:02:29,330] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:134][maintain_leases] The current deadline value is 10 seconds.
DEBUG    [2020-04-16 12:02:29,331] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:186][maintain_leases] Snoozing lease management for 2.700914 seconds.
DEBUG    [2020-04-16 12:02:32,033] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:134][maintain_leases] The current deadline value is 10 seconds.
DEBUG    [2020-04-16 12:02:32,034] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:186][maintain_leases] Snoozing lease management for 2.199593 seconds.
DEBUG    [2020-04-16 12:02:34,235] Thread-LeaseMaintainer            [google.cloud.pubsub_v1.subscriber._protocol.leaser] [leaser.py:134][maintain_leases] The current deadline value is 10 seconds.
...

On the other hand, using google-api-core==1.16.0 works fine, the stream gets re-established and messages are received again. Will classify this as P2 as an easy workaround exists, and we might also temporarily pin API core to < 1.17.0 until this gets resolved.


P.S.: I used the following log config for somewhat better readability (IMHO):

log_format = (
    "%(levelname)-8s [%(asctime)s] %(threadName)-33s "
    "[%(name)s] [%(filename)s:%(lineno)d][%(funcName)s] %(message)s"
)

logging.basicConfig(level=logging.DEBUG, format=log_format)

@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Apr 16, 2020
@plamut
Copy link
Contributor

plamut commented Apr 16, 2020

Hi, we had a similar problem yesterday in our production env. But I don't know if it is fully related.
...
We solved it fixing grpcio version to 1.27.2

@artefactop I tried reproducing the issue @sweatybridge originally reported with grpcio==1.28.1 and google-api-core==1.16.0, and the stream was re-established just fine. The grpcio version did not seem to have the effect there, the problem is in api-core.

If you actually solved your problem just by pinning a different grpcio version, it appears that it was a different issue, and we might want have a look at that, too. How often was the issue occurring in your production and can it be consistently reproduced? Thanks!

@artefactop
Copy link

artefactop commented Apr 16, 2020

@plamut yes, it can be reproduced every time a subscriber starts, It happens for 12 hours until we realize the problem and we pin the version of grpcio.
The google-api-core version we had with the problem was 1.17.0 so we pinned both
google-api-core==1.16.0 and grpcio==1.27.2 just in case but with the test we did only pinning grpcio==1.27.2 was needed to fix the issue.

I just found more logs related to this problem:

2020-04-15 14:37:06.705 CEST [2020-04-15 12:37:06,704] [MainProcess (1)] [INFO] [subscriber] Subscribe to: Topic
2020-04-15 14:37:06.712 CEST [2020-04-15 12:37:06,712] [MainProcess (1)] [INFO] [pubsub_consumers] Starting consumer, topic: Topic
2020-04-15 14:38:32.485 CEST [2020-04-15 12:38:32,484] [MainProcess (1)] [ERROR] [_channel] Exception iterating requests!
2020-04-15 14:38:32.485 CEST Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 195, in consume_request_iterator request = next(request_iterator) ValueError: generator already executing
2020-04-15 14:38:32.499 CEST [2020-04-15 12:38:32,499] [MainProcess (1)] [INFO] [bidi] Thread-ConsumeBidirectionalStream exiting
2020-04-15 14:38:32.622 CEST [2020-04-15 12:38:32,622] [MainProcess (1)] [INFO] [leaser] Thread-LeaseMaintainer exiting.
2020-04-15 14:38:37.633 CEST [2020-04-15 12:38:37,632] [MainProcess (1)] [INFO] [heartbeater] Thread-Heartbeater exiting.
2020-04-15 14:47:01.892 CEST [2020-04-15 12:47:01,890] [MainProcess (1)] [INFO] [streaming_pull_manager] Observed non-terminating stream error 504 Deadline Exceeded
2020-04-15 14:47:01.892 CEST [2020-04-15 12:47:01,892] [MainProcess (1)] [INFO] [streaming_pull_manager] Observed recoverable stream error 504 Deadline Exceeded

Here the subscriber stops getting messages.

@zunger-humu
Copy link

I'm unclear on the status of this bug. It's been marked as closed by pinning a dependency, but:

(1) Does that fix this for clients in general? What happens for clients that have something else pulling in a fresher version of api-core?
(2) Has this fix actually been released, or is the current version on pip still broken? If so, then this is still a P1 open issue, as people are broken in production right now because of it. (I say this, speaking as someone who appears to be broken in production right now because of it)

@plamut
Copy link
Contributor

plamut commented Apr 20, 2020

@zunger-humu Yes, the fix has been released a few days ago (version 1.4.3).

If some other library could pull a more recent version of api-core, pip will still install 1.16.0 (as the pubsub client now imposes such restriction), unless that other library explicitly pins the api-core version to 1.17.0. In that case, a version conflict will occur and one of the libraries will not be installed.

I cannot say for certain how other client libraries will behave with api-core 1.17.0, though, as I am primarily familiar with BigQuery and PubSub.

You mention that you have a breakage in production - is that even after trying to install the newest PubSub client?

@zunger-humu
Copy link

zunger-humu commented Apr 20, 2020 via email

@plamut
Copy link
Contributor

plamut commented Apr 21, 2020

@zunger-humu That's indeed weird, hope you get to the bottom of it.

Just in case I tried installing PubSub 1.4.2 and api-core 1.17.0 locally, and then upgraded PubSub to 1.4.3. The install succeeded and the api-core was correctly downgraded to 1.16.0.

(used pip 20.0.2 if it matters).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants