-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock during close in Python client #20026
Comments
By the way, we can't get this to happen every time. We've struggled to reproduce it consistently, but it is happening fairly often with how many systems we have running this code. |
Also, we have messages in our logs for when requests time out or error out in any way as well as when they are sent. It appears that there was one request open when the watchdog timeout happened and its callback was called. I see the caught exception indicating that it failed due to the watchdog timeout. This is making me lean more towards the connectivity code being the issue. |
Any chance you can give us the stack trace(s) verbatim with any sensitive portions censored? |
Thread: 1587803232 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "device.py", line 574, in _receive_events File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 363, in next File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 348, in _next File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1671689312 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 772, in channel_spin Thread: 1946305632 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "threaded_call.py", line 40, in _make_call_thread File "weak_callback_ref.py", line 32, in call File "other_device.py", line 211, in _other_device_node_thread File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1906570336 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1722020960 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1612969056 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "device.py", line 574, in _receive_events File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 363, in next File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 348, in _next File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1738798176 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1713632352 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1755575392 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1780741216 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "threaded_call.py", line 40, in _make_call_thread File "weak_callback_ref.py", line 32, in call File "unix_socket.py", line 27, in read_call Thread: 1562637408 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 903, in _poll_connectivity Thread: 1936454752 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1991458816 File "device_service.py", line 23, in File "device.py", line 994, in _grpc_channel_state_changed File "device.py", line 549, in transition_offline File "device.py", line 1070, in on_device_offline File "device.py", line 880, in on_address_updated File "device.py", line 967, in _shutdown_grpc_channel File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 1039, in close File "/usr/lib/python3.5/site-packages/grpc/_channel.py", line 1023, in _close File "/usr/lib/python3.5/threading.py", line 293, in wait File "daemon.py", line 861, in dump_stacks Thread: 1772352608 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1926231136 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/site-packages/paho/mqtt/client.py", line 2630, in _thread_main File "/usr/lib/python3.5/site-packages/paho/mqtt/client.py", line 1410, in loop_forever File "/usr/lib/python3.5/site-packages/paho/mqtt/client.py", line 946, in loop Thread: 1663300704 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1898181728 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1747186784 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait Thread: 1763964000 File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner File "/usr/lib/python3.5/threading.py", line 862, in run File "/usr/lib/python3.5/concurrent/futures/thread.py", line 64, in _worker File "/usr/lib/python3.5/queue.py", line 164, in get File "/usr/lib/python3.5/threading.py", line 293, in wait |
Another thing to note is that the way we use the unary_stream (which is started on its own thread), does not use a timeout on the unary part of the call: I'm confident that this is the source of the open request which is why there is a channel spin thread. I don't know why it isn't being closed out via the canceling of integrated calls in the channel's close method. |
I'm going to experiment with not trying to restart the Events stream after a keepalive timeout. Instead, the Events stream thread will shutdown until the connectivity goes back to READY, which makes more sense anyway. This will narrow down the situation so that we know if sending a request after the keepalive timeout is necessary to cause the deadlock. We have a few test systems that are experiencing this 1-2 times per day so I should know by Monday if this improves things. |
To give you a quick update on this, I've tried a couple of different setups in a unit testbench based on the description of your configuration, but I've so far been unable to reproduce. |
We've found a way to reproduce it when integrated with our complete application, but I can't seem to reproduce it with a paired down test script yet. We have confirmed that not sending a request to reopen the Events stream right after the keepalive watchdog timeout stop this from happening so that is our current workaround. Also, a request on a unary_unary stream has to be errored out with the keepalive error for this to happen. |
I do think that it is something extra in our code, but I can't figure out what it is. The grpc client is being run from a daemonized process (double-forked). Also, there is another forking before that daemonization in a process manager that starts up various services. In the process manager, SIGPIPE is set to ignore and SIGCHLD sets a callback that calls I'm not sure what else to do right now so if none of the above information is helpful enough, we can close this as not reproducible. I don't have any more time right now to dig in on it and we have a workaround for our application that is sufficient. |
We have recently fixed a few problems related to signal handling. It's possible that, e.g. #19481 could be related. If you get a chance, can you test with a nightly build and without your workaround in place? |
We have created an internal ticket to try this out, but it might be a few weeks before it happens due to priorities. |
I have also experienced deadlock on channel close recently, but haven't gotten any traces out of it as it happens in a container which doesn't have shell access. The setup where it happens is a very antipattern type of a gRPC system. Where a client is polling a server for messages with a bidi stream RPC, and closes the channel after every poll. When the deadlock happens the loop is something like this: Golang gRPC server (google.golang.org/grpc v.1.27.1)
With this the channel close at 4. is almost guaranteed to deadlock after a day. |
@Kaali Any chance you have a deterministic reproduction case you can share? |
I'll try to make one today. But so far it is not deterministic. |
So far I haven't been able to make a reproducable case for this, still working on it. I did get a possibly unrelated new error while running the same code that gets eventually stuck on a Linux cluster, on a macOS:
this is with the same process I listed above. There is no threads created by the application itself. |
Just verified that the thread count increases all the time when running a tight loop like that. Repro case for the rising thread count at least on macOS: https://github.com/Kaali/grpcbugrepro And the code is similar to the one where |
This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions! |
What version of gRPC and what language are you using?
1.17.1 Python
What operating system (Linux, Windows,...) and version?
Linux 3.19.3.28478
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.5.3
What did you do?
If possible, provide a recipe for reproducing the error. Try being specific and include code snippets if helpful.
I haven't been able to find a good way to consistently reproduce the issue. Originally, I thought that unhandled exceptions causing the channel spin thread to close were causing the issue as I found in #19910 , but that is not what is going on because the channel spin thread is still going. The situation we have is a controller device is running the client and a peripheral device is running the grpc server. A channel is opened on the client side and a function subscribes to the connectivity changes. When connectivity issues (wifi for instance) occur, the channel state goes to IDLE from READY and we close the channel so that we can try service discovery again. However, the channel is deadlocking in the close method which is being invoked from the main Python thread. Unfortunately, I don't have any knowledge of any exceptions that might be happening other than on the main Python thread, but I do have a stack trace.
We have a long-running unary_stream being iterated over in a separate thread. When it fails, we retry a few times before giving up. There are also two unary_unary streams. One is used for sending requests and the other for sending events. All requests/events on these two channels are sent using the future method on each stream with a timeout of either 1 second or 5 seconds.
We have the channel keepalive timeout set to 90 seconds and the keepalive time set to 20 seconds. keepalive is permitted even without calls. min time between pings is 8 seconds. max_pings_without_data is set to 0. We are seeing a keepalive timeout happen when the channel connectivity goes to IDLE and the main thread tries to close the channel. Several requests fail around this time with the keepalive timeout exception.
The stack trace is very interesting. The close method is definitely waiting on the channel condition variable so it is either waiting for segregated calls, integrated calls, or connectivity polling to finish, but due to Cython I don't know which.
The unary_stream thread is waiting on the channel condition variable as well. I'm guessing that the connection disruption was somewhat temporary in this case and that this unary_stream was able to be recreated.
There is also a thread polling connectivity which is interesting because we call unsubscribe before calling close on the channel. The polling connectivity thread is in _poll_connectivity in _channel.py calling channel.watch_connectivity_state. Unfortunately the Cython code is preventing me from getting more insight here.
The last relevant thread is the channel spin thread that handles responses for unary_unary streams. It is still running and calling state.channel.next_call_event. Again, Cython keeps me from knowing more. The fact that the channel spin thread is still going means that there is still at least one open request.
One other thing to note is that we have disabled grpc forking support because it was causing us deadlock problems due to other parts of the Python app doing some forking.
To me this is clearly a problem that is either in the channel_spin code or the watch_connectivity_state code. If the problem is in watch_connectivity_state, I suspect that it is with grpc_channel_watch_connectivity_state or _latent_event not returning. If the problem is in _next_call_event, then it is probably held up in _latent_event or the on_success callback which calls _process_integrated_call_tag.
It has occurred to me that both of these functions use _latent_event with a deadline of None.
What did you expect to see?
The channel close would not deadlock.
What did you see instead?
The channel close deadlocked.
Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).
See TROUBLESHOOTING.md for how to diagnose problems better.
Anything else we should know about your project / environment?
The text was updated successfully, but these errors were encountered: