Enforce a finite bound on the time gap between signal receipt and signal handler execution. #19481

gnossen · 2019-06-26T20:39:48Z

While drafting #19465, I found that the following simple snippet of code did not work:

stub = foo_pb2_grpc.FooStub(channel)
result_generator = stub.StreamingRPC(foo_pb2.FooReques())

def cancel_request(unused_signum, unused_frame):
    result_generator.cancel()

signal.signal(signal.SIGINT, cancel_request)
for result in result_generator:
    do_stuff(result)

Signal handlers were only given a chance to run upon receipt of an entry in the RPC stream. Since there is no time bound on how long that might take, there can be an arbitrarily long time gap between receipt of the signal and the execution of the application's signal handlers.

It turns out that this issue was not limited to streaming RPCs. Unary RPCs exhibited the same property.

Signal handlers are only run on the main thread. The cpython implementation takes great care to ensure that the main thread does not block for an arbitrarily long period between signal checks.

Our indefinite blocking was due to wait() invocations on condition variables without a timeout.

gnossen · 2019-06-26T20:40:34Z

Manually verified that this allows for the desired simplifications in #19465, but I'm still working on unit tests. Just wanted to get eyes on this. @lidizheng

lidizheng · 2019-06-26T20:57:24Z

I encountered the same problem with wait in #19299.

I personally prefer setting infinite timeout than periodically check, because IMO the later solution might have greater overhead while spinning.

gnossen · 2019-06-26T21:51:50Z

@lidizheng If we want to be responsive to handlers, some sort of spinning is going to be necessary. It's just a question of which layer it's going to be happening in. This is the code path that is exercised when you supply a timeout to wait(). However, reading this code, I can't help but feel that the C path is going to be faster than running this code in Python. It looks like the "infinite" timeout is going to be set to a spin with a period of 20 milliseconds. The infinite timeout looks like a good solution as long as we pair it with a test that verifies that cpython exhibits the behavior we're expecting.

Edit: I stand corrected. That second link I posted is for those platforms that do not have their own sem_timedwait. It looks like this behavior is dependent on the implementation of each individual platform. Perhaps we should actually quantify any changes in performance on the data path?

lidizheng

LGTM to the design! Good work!

Please take a look at failed test cases.

src/python/grpcio/grpc/_common.py

src/python/grpcio/grpc/_channel.py

lidizheng · 2019-07-01T23:21:35Z

src/python/grpcio/grpc/_common.py

+        spin_cb()
+
+
+def wait(wait_fn, wait_complete_fn, timeout=None, spin_cb=None):


Optional: if we move this function to Cython layer, we can gain free performance boost to reduce the overhead of introducing this spin wait mechanism.

src/python/grpcio_tests/tests/unit/_signal_client.py

lidizheng · 2019-07-01T23:35:50Z

Also, should we add a skip condition if the _common.wait is not ran in main thread? Then it won't block signal handling?

gnossen · 2019-07-01T23:43:58Z

@lidizheng I considered it, but I'm wondering if checking the TID on every call to wait would be more costly than just adding the timeout unconditionally. I'll do some more research on this.

gnossen · 2019-07-03T19:59:12Z

@lidizheng Thinking about it a little bit more, I don't think it makes sense to only add this behavior to the main thread. Suppose we're on some other thread and the application blocks indefinitely in a C-level function while holding the GIL (e.g. waiter.acquire()). Since it has the GIL, the main thread still will not be able to execute the signal handler.

gnossen · 2019-07-03T20:04:31Z

So I went ahead and disabled the gevent tests for this PR, as I'm bumping up against #18980. On a separate branch, I've determined the root cause and have a fix that makes this test pass under gevent (though perhaps not the ideal fix). My plan is to merge this to master and then remove the test from the blocklist in the follow-up PR that addresses #18980.

lidizheng · 2019-07-03T20:11:37Z

That makes sense. Thank you for thinking through this optimization.

…

On Wed, Jul 3, 2019 at 13:59 Richard Belleville ***@***.***> wrote: @lidizheng <https://github.com/lidizheng> Thinking about it a little bit more, I don't think it makes sense to only add this behavior to the main thread. Suppose we're on some other thread and the application blocks indefinitely in a C-level function while holding the GIL (e.g. waiter.acquire()). Since it has the GIL, the main thread *still* will not be able to execute the signal handler. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19481?email_source=notifications&email_token=ABYNM4BIBHVII2WVTLOUJIDP5UALBA5CNFSM4H3V3472YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFQZHY#issuecomment-508234911>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYNM4DVPSISZ5ID4L7EO6DP5UALBANCNFSM4H3V347Q> .

gnossen · 2019-07-03T20:34:12Z

@lidizheng PTALA.

…dler Execution Previously, signal handlers were only given a chance to run upon receipt of an entry in the RPC stream. Since there is no time bound on how long that might take, there can be an arbitrarily long time gap between receipt of the signal and the execution of the application's signal handlers. Signal handlers are only run on the main thread. The cpython implementation takes great care to ensure that the main thread does not block for an arbitrarily long period between signal checks. Our indefinite blocking was due to wait() invocations on condition variables without a timeout. This changes all usages of wait() in the the channel implementation to use a wrapper that is responsive to signals even while waiting on an RPC. A test has been added to verify this. Tests are currently disabled under gevent due to grpc#18980, but a fix for that has been found and should be merged shortly.

src/python/grpcio_tests/tests/unit/_signal_client.py

src/python/grpcio/grpc/_common.py

gnossen · 2019-07-03T23:20:51Z

#19554

…tion" This reverts commit 8f044f7, reversing changes made to 5ae9afd.

gnossen added lang/Python disposition/DO NOT MERGE release notes: yes Indicates if PR needs to be in release notes area/concurrency kind/bug labels Jun 26, 2019

lidizheng mentioned this pull request Jun 26, 2019

Signal Handlers not Run While Client Iterates over Server-Streaming or Bidi RPC #19464

Closed

gnossen force-pushed the main_thread_starvation branch from cf0d5d6 to 88f7865 Compare July 1, 2019 23:09

gnossen removed the disposition/DO NOT MERGE label Jul 1, 2019

gnossen requested a review from lidizheng July 1, 2019 23:10

gnossen assigned lidizheng Jul 1, 2019

lidizheng reviewed Jul 1, 2019

View reviewed changes

gnossen mentioned this pull request Jul 3, 2019

Patched GeventWorker + gunicorn causes calls to subprocess to segfault #18980

Closed

gnossen changed the title ~~Enforce a Finite Time Gap Bound between Signal Receipt and Signal Handler Execution~~ Enforce a finite bound on the time gap between signal receipt and signal handler execution. Jul 3, 2019

gnossen force-pushed the main_thread_starvation branch from 8a3f89a to af1b09f Compare July 3, 2019 20:36

lidizheng approved these changes Jul 3, 2019

View reviewed changes

src/python/grpcio_tests/tests/unit/_signal_client.py Show resolved Hide resolved

src/python/grpcio/grpc/_common.py Show resolved Hide resolved

Add explanation to _signal_client

f7182fe

gnossen merged commit 8f044f7 into grpc:master Jul 3, 2019

lidizheng mentioned this pull request Jul 8, 2019

Remove the unused import that breaks import #19581

Merged

gnossen added a commit to gnossen/grpc that referenced this pull request Jul 8, 2019

Revert "Merge pull request grpc#19481 from gnossen/main_thread_starva…

2014a51

…tion" This reverts commit 8f044f7, reversing changes made to 5ae9afd.

gnossen mentioned this pull request Jul 8, 2019

Revert signal handling #19583

Merged

gnossen mentioned this pull request Aug 30, 2019

Deadlock during close in Python client #20026

Open

gnossen deleted the main_thread_starvation branch September 27, 2019 22:29

lock bot locked as resolved and limited conversation to collaborators Dec 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce a finite bound on the time gap between signal receipt and signal handler execution. #19481

Enforce a finite bound on the time gap between signal receipt and signal handler execution. #19481

gnossen commented Jun 26, 2019 •

edited

Loading

gnossen commented Jun 26, 2019

lidizheng commented Jun 26, 2019

gnossen commented Jun 26, 2019 •

edited

Loading

lidizheng left a comment

lidizheng Jul 1, 2019

lidizheng commented Jul 1, 2019

gnossen commented Jul 1, 2019

gnossen commented Jul 3, 2019

gnossen commented Jul 3, 2019 •

edited

Loading

lidizheng commented Jul 3, 2019 via email

gnossen commented Jul 3, 2019

gnossen commented Jul 3, 2019

		spin_cb()


		def wait(wait_fn, wait_complete_fn, timeout=None, spin_cb=None):

Enforce a finite bound on the time gap between signal receipt and signal handler execution. #19481

Enforce a finite bound on the time gap between signal receipt and signal handler execution. #19481

Conversation

gnossen commented Jun 26, 2019 • edited Loading

gnossen commented Jun 26, 2019

lidizheng commented Jun 26, 2019

gnossen commented Jun 26, 2019 • edited Loading

lidizheng left a comment

Choose a reason for hiding this comment

lidizheng Jul 1, 2019

Choose a reason for hiding this comment

lidizheng commented Jul 1, 2019

gnossen commented Jul 1, 2019

gnossen commented Jul 3, 2019

gnossen commented Jul 3, 2019 • edited Loading

lidizheng commented Jul 3, 2019 via email

gnossen commented Jul 3, 2019

gnossen commented Jul 3, 2019

gnossen commented Jun 26, 2019 •

edited

Loading

gnossen commented Jun 26, 2019 •

edited

Loading

gnossen commented Jul 3, 2019 •

edited

Loading