Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during close in Python client #20026

Open
jpeel opened this issue Aug 21, 2019 · 17 comments
Open

Deadlock during close in Python client #20026

jpeel opened this issue Aug 21, 2019 · 17 comments

Comments

@jpeel
Copy link

jpeel commented Aug 21, 2019

What version of gRPC and what language are you using?

1.17.1 Python

What operating system (Linux, Windows,...) and version?

Linux 3.19.3.28478

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.5.3

What did you do?

If possible, provide a recipe for reproducing the error. Try being specific and include code snippets if helpful.

I haven't been able to find a good way to consistently reproduce the issue. Originally, I thought that unhandled exceptions causing the channel spin thread to close were causing the issue as I found in #19910 , but that is not what is going on because the channel spin thread is still going. The situation we have is a controller device is running the client and a peripheral device is running the grpc server. A channel is opened on the client side and a function subscribes to the connectivity changes. When connectivity issues (wifi for instance) occur, the channel state goes to IDLE from READY and we close the channel so that we can try service discovery again. However, the channel is deadlocking in the close method which is being invoked from the main Python thread. Unfortunately, I don't have any knowledge of any exceptions that might be happening other than on the main Python thread, but I do have a stack trace.

We have a long-running unary_stream being iterated over in a separate thread. When it fails, we retry a few times before giving up. There are also two unary_unary streams. One is used for sending requests and the other for sending events. All requests/events on these two channels are sent using the future method on each stream with a timeout of either 1 second or 5 seconds.

We have the channel keepalive timeout set to 90 seconds and the keepalive time set to 20 seconds. keepalive is permitted even without calls. min time between pings is 8 seconds. max_pings_without_data is set to 0. We are seeing a keepalive timeout happen when the channel connectivity goes to IDLE and the main thread tries to close the channel. Several requests fail around this time with the keepalive timeout exception.

The stack trace is very interesting. The close method is definitely waiting on the channel condition variable so it is either waiting for segregated calls, integrated calls, or connectivity polling to finish, but due to Cython I don't know which.

The unary_stream thread is waiting on the channel condition variable as well. I'm guessing that the connection disruption was somewhat temporary in this case and that this unary_stream was able to be recreated.

There is also a thread polling connectivity which is interesting because we call unsubscribe before calling close on the channel. The polling connectivity thread is in _poll_connectivity in _channel.py calling channel.watch_connectivity_state. Unfortunately the Cython code is preventing me from getting more insight here.

The last relevant thread is the channel spin thread that handles responses for unary_unary streams. It is still running and calling state.channel.next_call_event. Again, Cython keeps me from knowing more. The fact that the channel spin thread is still going means that there is still at least one open request.

One other thing to note is that we have disabled grpc forking support because it was causing us deadlock problems due to other parts of the Python app doing some forking.

To me this is clearly a problem that is either in the channel_spin code or the watch_connectivity_state code. If the problem is in watch_connectivity_state, I suspect that it is with grpc_channel_watch_connectivity_state or _latent_event not returning. If the problem is in _next_call_event, then it is probably held up in _latent_event or the on_success callback which calls _process_integrated_call_tag.

It has occurred to me that both of these functions use _latent_event with a deadline of None.

What did you expect to see?

The channel close would not deadlock.

What did you see instead?

The channel close deadlocked.

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

See TROUBLESHOOTING.md for how to diagnose problems better.

Anything else we should know about your project / environment?

@jpeel
Copy link
Author

jpeel commented Aug 21, 2019

By the way, we can't get this to happen every time. We've struggled to reproduce it consistently, but it is happening fairly often with how many systems we have running this code.

@jpeel
Copy link
Author

jpeel commented Aug 21, 2019

Also, we have messages in our logs for when requests time out or error out in any way as well as when they are sent. It appears that there was one request open when the watchdog timeout happened and its callback was called. I see the caught exception indicating that it failed due to the watchdog timeout. This is making me lean more towards the connectivity code being the issue.

@gnossen
Copy link
Contributor

gnossen commented Aug 21, 2019

Any chance you can give us the stack trace(s) verbatim with any sensitive portions censored?

@jpeel
Copy link
Author

jpeel commented Aug 22, 2019

Thread: 1587803232 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "device.py", line 574, in _receive_events

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 363, in next
return self._next()

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 348, in _next
self._state.condition.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1671689312 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 772, in channel_spin
event = state.channel.next_call_event()

Thread: 1946305632 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "threaded_call.py", line 40, in _make_call_thread

File "weak_callback_ref.py", line 32, in call

File "other_device.py", line 211, in _other_device_node_thread

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1906570336 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1722020960 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1612969056 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "device.py", line 574, in _receive_events

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 363, in next
return self._next()

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 348, in _next
self._state.condition.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1738798176 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1713632352 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1755575392 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1780741216 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "threaded_call.py", line 40, in _make_call_thread

File "weak_callback_ref.py", line 32, in call

File "unix_socket.py", line 27, in read_call

Thread: 1562637408 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 903, in _poll_connectivity
time.time() + 0.2)

Thread: 1936454752 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1991458816 File "device_service.py", line 23, in

File "device.py", line 994, in _grpc_channel_state_changed

File "device.py", line 549, in transition_offline

File "device.py", line 1070, in on_device_offline

File "device.py", line 880, in on_address_updated

File "device.py", line 967, in _shutdown_grpc_channel

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 1039, in close
self._close()

File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 1023, in _close
self._channel.close(cygrpc.StatusCode.cancelled, 'Channel closed!')

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

File "daemon.py", line 861, in dump_stacks

Thread: 1772352608 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1926231136 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/site-packages/paho/mqtt/client.py", line 2630, in _thread_main
self.loop_forever(retry_first_connection=True)

File "/usr/lib/python3.5/site-packages/paho/mqtt/client.py", line 1410, in loop_forever
rc = self.loop(timeout, max_packets)

File "/usr/lib/python3.5/site-packages/paho/mqtt/client.py", line 946, in loop
return self.loop_misc()

Thread: 1663300704 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1898181728 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1747186784 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

Thread: 1763964000 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
self._bootstrap_inner()

File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()

File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)

File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker
work_item = work_queue.get(block=True)

File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()

File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()

@jpeel
Copy link
Author

jpeel commented Aug 22, 2019

Another thing to note is that the way we use the unary_stream (which is started on its own thread), does not use a timeout on the unary part of the call:
for event in self._client.Events(cc.EventsRequest()): self._process_event(event)

I'm confident that this is the source of the open request which is why there is a channel spin thread. I don't know why it isn't being closed out via the canceling of integrated calls in the channel's close method.

@jpeel
Copy link
Author

jpeel commented Aug 23, 2019

I'm going to experiment with not trying to restart the Events stream after a keepalive timeout. Instead, the Events stream thread will shutdown until the connectivity goes back to READY, which makes more sense anyway. This will narrow down the situation so that we know if sending a request after the keepalive timeout is necessary to cause the deadlock. We have a few test systems that are experiencing this 1-2 times per day so I should know by Monday if this improves things.

@gnossen
Copy link
Contributor

gnossen commented Aug 29, 2019

To give you a quick update on this, I've tried a couple of different setups in a unit testbench based on the description of your configuration, but I've so far been unable to reproduce.

@jpeel
Copy link
Author

jpeel commented Aug 29, 2019

We've found a way to reproduce it when integrated with our complete application, but I can't seem to reproduce it with a paired down test script yet. We have confirmed that not sending a request to reopen the Events stream right after the keepalive watchdog timeout stop this from happening so that is our current workaround. Also, a request on a unary_unary stream has to be errored out with the keepalive error for this to happen.

@jpeel
Copy link
Author

jpeel commented Aug 30, 2019

I do think that it is something extra in our code, but I can't figure out what it is. The grpc client is being run from a daemonized process (double-forked). Also, there is another forking before that daemonization in a process manager that starts up various services. In the process manager, SIGPIPE is set to ignore and SIGCHLD sets a callback that calls os.waitpid(-1, os.WNOHANG). My guess after seeing that was that the cpython.PyErr_CheckSignals() in the completion queue code was getting stuck in that handler, but it doesn't appear to be the case. In the daemonized process, SIGPIPE and SIGCHLD are both set back to SIG_DFL. There are also signal handlers set for SIGTERM, SIGSEGV, and SIGUSR1, but I don't see anything in those handlers that could cause this problem. I did also notice that there is a change in the _next code of the completion queue since the version of grpc that we are using related to the with nogil and with gil code there. Maybe that would help as well.

I'm not sure what else to do right now so if none of the above information is helpful enough, we can close this as not reproducible. I don't have any more time right now to dig in on it and we have a workaround for our application that is sufficient.

@gnossen
Copy link
Contributor

gnossen commented Aug 30, 2019

We have recently fixed a few problems related to signal handling. It's possible that, e.g. #19481 could be related. If you get a chance, can you test with a nightly build and without your workaround in place?

@jpeel
Copy link
Author

jpeel commented Sep 3, 2019

We have created an internal ticket to try this out, but it might be a few weeks before it happens due to priorities.

@Kaali
Copy link

Kaali commented Mar 26, 2020

I have also experienced deadlock on channel close recently, but haven't gotten any traces out of it as it happens in a container which doesn't have shell access.

The setup where it happens is a very antipattern type of a gRPC system. Where a client is polling a server for messages with a bidi stream RPC, and closes the channel after every poll.

When the deadlock happens the loop is something like this:

Golang gRPC server (google.golang.org/grpc v.1.27.1)
Python 3.8.2 client with grpcio 1.27.2

  1. Client opens a channel
  2. Client sends a message to bidi RPC
  3. Server answers with status NotFound
  4. Client closes the channel
  5. Wait for a second then back to 1.

With this the channel close at 4. is almost guaranteed to deadlock after a day.

@gnossen
Copy link
Contributor

gnossen commented Mar 26, 2020

@Kaali Any chance you have a deterministic reproduction case you can share?

@Kaali
Copy link

Kaali commented Mar 27, 2020

I'll try to make one today. But so far it is not deterministic.

@Kaali
Copy link

Kaali commented Mar 27, 2020

So far I haven't been able to make a reproducable case for this, still working on it. I did get a possibly unrelated new error while running the same code that gets eventually stuck on a Linux cluster, on a macOS:

  File "/Users/kaali/dev/xxxx/runners/training_driver.py", line 41, in __init__
    self.stream = stub.TrainJob(iter(self.q.get, None))
  File "/Users/kaali/dev/xxxx/.venv/lib/python3.8/site-packages/grpc/_channel.py", line 1090, in __call__
    _consume_request_iterator(request_iterator, state, call,
  File "/Users/kaali/dev/xxxx/.venv/lib/python3.8/site-packages/grpc/_channel.py", line 256, in _consume_request_iterator
    consumption_thread.start()
  File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 120, in grpc._cython.cygrpc.ForkManagedThread.start
  File "/nix/store/0hi6adsig2nnp0dxw4wdiswlqfamqvfs-python3-3.8.2/lib/python3.8/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

this is with the same process I listed above. There is no threads created by the application itself.

@Kaali
Copy link

Kaali commented Mar 27, 2020

Just verified that the thread count increases all the time when running a tight loop like that.

Repro case for the rising thread count at least on macOS: https://github.com/Kaali/grpcbugrepro

And the code is similar to the one where close() gets in a deadlock but I haven't yet been able to reproduce it with this. But I'm also not running it in the production environment where it happened.

@stale
Copy link

stale bot commented May 6, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants