Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubSub: deadlock when cancelling a subscribe call #8616

Closed
arthurdarcet opened this issue Jul 8, 2019 · 10 comments · Fixed by #9337
Closed

PubSub: deadlock when cancelling a subscribe call #8616

arthurdarcet opened this issue Jul 8, 2019 · 10 comments · Fixed by #9337
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@arthurdarcet
Copy link

Environment details

  1. OSX
  2. Python 3.7.3 in a venv
  3. google-cloud-pubsub 0.42.1

Steps to reproduce

Running cancel very soon after the subscriber.subscribe call leads to a deadlock. See example code below, that stops after between 0 and 4 loops on my computer.

Code example

import sys

from google.cloud import pubsub
from google.api_core.exceptions import NotFound

subscriber = pubsub.SubscriberClient()
publisher = pubsub.PublisherClient()

sub = subscriber.subscription_path('mocked-project', 'sub-name')
topic = publisher.topic_path('mocked-project', 'topic-name')

try: subscriber.delete_subscription(sub)
except NotFound: pass

try: publisher.delete_topic(topic)
except NotFound: pass

publisher.create_topic(topic)
subscriber.create_subscription(sub, topic)

i = 0
while True:
	future = subscriber.subscribe(sub, lambda msg: None)
	print('Run', i, 'cancelling')
	future.cancel()
	print('Cancelled')
	print('--')
	sys.stdout.flush()
	i += 1

Stack trace

^CTraceback (most recent call last):
  File "a.py", line 25, in <module>
    future.cancel()
  File "/Users/arthur/dev/.venvs/Users--arthur--dev--wyl/lib/python3.7/site-packages/google/cloud/pubsub_v1/subscriber/futures.py", line 46, in cancel
    return self._manager.close()
  File "/Users/arthur/dev/.venvs/Users--arthur--dev--wyl/lib/python3.7/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 425, in close
    self._consumer.stop()
  File "/Users/arthur/dev/.venvs/Users--arthur--dev--wyl/lib/python3.7/site-packages/google/api_core/bidi.py", line 686, in stop
    self._thread.join()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
@tseaver tseaver added api: clouddebugger Issues related to the Cloud Debugger API. api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed api: clouddebugger Issues related to the Cloud Debugger API. labels Jul 8, 2019
@tseaver
Copy link
Contributor

tseaver commented Jul 8, 2019

@plamut Can you take a look?

@plamut
Copy link
Contributor

plamut commented Jul 8, 2019

I can check, but only in a few days time at earliest.

@tseaver
Copy link
Contributor

tseaver commented Jul 8, 2019

@plamut No worries, seems like an odd edge case.

@arthurdarcet
Copy link
Author

As far as I can see:

<_Rendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = ""
	debug_error_string = "{"created":"@1562672704.766455000","description":"Error received from peer ipv6:[::1]:8085","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"","grpc_status":2}"
  • the emulator I'm using is not showing anything in its logs

This is also locking against the production pubsub

@plamut
Copy link
Contributor

plamut commented Jul 16, 2019

I can confirm this issue is reproducible.

Digging through the code, it appears that it sometimes gets stuck while trying to get the next response from the StreamingResponseIterator - line 79:

DEBUG    [2019-07-16 18:19:37,241] Thread-ConsumeBidirectionalStream [google.api_core.grpc_helpers] [grpc_helpers.py:82][next]  StreamingResponseIterator next <_Rendezvous object of in-flight RPC> 

When this occurs, I see several calls to oauth2.googleapis.com from multiple anonymous threads (from inside the gRPC?):

DEBUG    [2019-07-16 18:26:23,936] Thread-21                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:23,939] Thread-27                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:23,943] Thread-24                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:23,964] Thread-30                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:23,989] Thread-33                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:23,998] Thread-20                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:24,000] Thread-36                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG    [2019-07-16 18:26:24,019] Thread-39                         [urllib3.connectionpool] [connectionpool.py:393][_make_request] https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None

Could this be a bug in the gRPC itself? In what cases does the next() call block for so long, if any?


Update:
Leaving the script running for quite awhile, I noticed that the next() call returned after 10 minutes:

DEBUG    [2019-07-16 19:12:39,582] Thread-ConsumeBidirectionalStream [google.api_core.bidi] [bidi.py:500][_recoverable] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f91a813cd68>> caused 504 Deadline Exceeded.
INFO     [2019-07-16 19:12:39,582] Thread-ConsumeBidirectionalStream [google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager] [streaming_pull_manager.py:562][_should_recover] Observed recoverable stream error 504 Deadline Exceeded
DEBUG    [2019-07-16 19:12:39,582] Thread-ConsumeBidirectionalStream [google.api_core.bidi] [bidi.py:508][_recoverable] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f91a813cd68>>.

The background stream is re-opened, and the same thing happens again when calling next() - the call blocks for 10 minutes and ends up with the same 504 error being raised.

The main thread does not see that, though, and remains blocked at the future.cancel() call. FWIW, the issue does not seem to occur if one injects a short delay (say 100 ms) between subscribe() and cancel() calls.

@tseaver
Copy link
Contributor

tseaver commented Jul 16, 2019

I wonder if this issue is related to #7826. In that issue, the CANCELLED response triggers an unwanted traceback in the log. My PR #8650 handles CANCELLED (for Firestore) by shutting the BiDi stream down cleanly without raising an exception. It would be pretty trivial to extend the same semantics for Pub/Sub (in fact, I already have it done in a local git stash).

@plamut
Copy link
Contributor

plamut commented Jul 17, 2019

Maybe, I can try that when #8650 is merged.

@tseaver
Copy link
Contributor

tseaver commented Jul 17, 2019

@plamut I merge it today.

@plamut
Copy link
Contributor

plamut commented Jul 18, 2019

I tried leveraging the new should_terminate parameter of ResumableBidiRpc, but no luck. The code gets stuck (for 10 minutes) when requesting the next item from a _Rendezvous instance.

I also noticed that this does not happen if a tiny delay is inserted between the subscribe() and cancel() calls:

future = subscriber.subscribe(sub, lambda msg: None)
logger.info('\x1b[1mRun %d cancelling \x1b[0m', i)
time.sleep(0.01)  # <--- This seems to avoid the problem
future.cancel()

With that, I had the test script successfully running for more than 7800 iterations, cancelling the subscriptions as expected, after which I terminated it. Of course, this is not a proper fix.

It seems that cancelling the stream too early causes something weird to happen in grpc lib (in channel._Rendezvous), or in its interaction with the ConsumeBidirectionalStream thread.

@tseaver
Copy link
Contributor

tseaver commented Sep 27, 2019

This deadlock happens in BackgroundConsumer.stop:

def stop(self):
"""Stop consuming the stream and shutdown the background thread."""
with self._operational_lock:
self._bidi_rpc.close()
if self._thread is not None:
# Resume the thread to wake it up in case it is sleeping.
self.resume()
self._thread.join()
self._thread = None

The blocking call to self._thread.join() presumes that the thread (which is daemonized) will eventually terminate. The thread's loop tests for self._bidi_rpc.is_active, which should (in theory) be false after the call to self._bidi_rpc.close(). However, BidiRpc.is_active tests that the object's call is set and returns its is_active. The call's cancel method has been called, but is apparently never actually terminating, and therefore is_active continues to be true, so the thread never terminates.

tseaver added a commit that referenced this issue Oct 1, 2019
Toward clean shutdown of the subscriber's background thread.

See: #8616.
tseaver added a commit that referenced this issue Oct 2, 2019
Avoid blocking for ill-behaved daemon threads during BiDi shutdown.

Closes #8616, #9008.
busunkim96 pushed a commit to busunkim96/google-cloud-python that referenced this issue Oct 3, 2019
…s#9337)

Avoid blocking for ill-behaved daemon threads during BiDi shutdown.

Closes googleapis#8616, googleapis#9008.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants