Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce a finite bound on the time gap between signal receipt and signal handler execution. #19481

Merged
merged 2 commits into from
Jul 3, 2019

Conversation

gnossen
Copy link
Contributor

@gnossen gnossen commented Jun 26, 2019

While drafting #19465, I found that the following simple snippet of code did not work:

stub = foo_pb2_grpc.FooStub(channel)
result_generator = stub.StreamingRPC(foo_pb2.FooReques())

def cancel_request(unused_signum, unused_frame):
    result_generator.cancel()

signal.signal(signal.SIGINT, cancel_request)
for result in result_generator:
    do_stuff(result)

Signal handlers were only given a chance to run upon receipt of an entry in the RPC stream. Since there is no time bound on how long that might take, there can be an arbitrarily long time gap between receipt of the signal and the execution of the application's signal handlers.

It turns out that this issue was not limited to streaming RPCs. Unary RPCs exhibited the same property.

Signal handlers are only run on the main thread. The cpython implementation takes great care to ensure that the main thread does not block for an arbitrarily long period between signal checks.

Our indefinite blocking was due to wait() invocations on condition variables without a timeout.

Related:

@gnossen
Copy link
Contributor Author

gnossen commented Jun 26, 2019

Manually verified that this allows for the desired simplifications in #19465, but I'm still working on unit tests. Just wanted to get eyes on this. @lidizheng

@lidizheng
Copy link
Contributor

I encountered the same problem with wait in #19299.

I personally prefer setting infinite timeout than periodically check, because IMO the later solution might have greater overhead while spinning.

@gnossen
Copy link
Contributor Author

gnossen commented Jun 26, 2019

@lidizheng If we want to be responsive to handlers, some sort of spinning is going to be necessary. It's just a question of which layer it's going to be happening in. This is the code path that is exercised when you supply a timeout to wait(). However, reading this code, I can't help but feel that the C path is going to be faster than running this code in Python. It looks like the "infinite" timeout is going to be set to a spin with a period of 20 milliseconds. The infinite timeout looks like a good solution as long as we pair it with a test that verifies that cpython exhibits the behavior we're expecting.

Edit: I stand corrected. That second link I posted is for those platforms that do not have their own sem_timedwait. It looks like this behavior is dependent on the implementation of each individual platform. Perhaps we should actually quantify any changes in performance on the data path?

Copy link
Contributor

@lidizheng lidizheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to the design! Good work!

Please take a look at failed test cases.

src/python/grpcio/grpc/_common.py Outdated Show resolved Hide resolved
src/python/grpcio/grpc/_common.py Outdated Show resolved Hide resolved
src/python/grpcio/grpc/_channel.py Show resolved Hide resolved
spin_cb()


def wait(wait_fn, wait_complete_fn, timeout=None, spin_cb=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: if we move this function to Cython layer, we can gain free performance boost to reduce the overhead of introducing this spin wait mechanism.

@lidizheng
Copy link
Contributor

Also, should we add a skip condition if the _common.wait is not ran in main thread? Then it won't block signal handling?

@gnossen
Copy link
Contributor Author

gnossen commented Jul 1, 2019

@lidizheng I considered it, but I'm wondering if checking the TID on every call to wait would be more costly than just adding the timeout unconditionally. I'll do some more research on this.

@gnossen
Copy link
Contributor Author

gnossen commented Jul 3, 2019

@lidizheng Thinking about it a little bit more, I don't think it makes sense to only add this behavior to the main thread. Suppose we're on some other thread and the application blocks indefinitely in a C-level function while holding the GIL (e.g. waiter.acquire()). Since it has the GIL, the main thread still will not be able to execute the signal handler.

@gnossen
Copy link
Contributor Author

gnossen commented Jul 3, 2019

So I went ahead and disabled the gevent tests for this PR, as I'm bumping up against #18980. On a separate branch, I've determined the root cause and have a fix that makes this test pass under gevent (though perhaps not the ideal fix). My plan is to merge this to master and then remove the test from the blocklist in the follow-up PR that addresses #18980.

@gnossen gnossen changed the title Enforce a Finite Time Gap Bound between Signal Receipt and Signal Handler Execution Enforce a finite bound on the time gap between signal receipt and signal handler execution. Jul 3, 2019
@lidizheng
Copy link
Contributor

lidizheng commented Jul 3, 2019 via email

@gnossen
Copy link
Contributor Author

gnossen commented Jul 3, 2019

@lidizheng PTALA.

…dler Execution

Previously, signal handlers were only given a chance to run upon receipt of an
entry in the RPC stream. Since there is no time bound on how long that might
take, there can be an arbitrarily long time gap between receipt of the signal
and the execution of the application's signal handlers.

Signal handlers are only run on the main thread. The cpython implementation
takes great care to ensure that the main thread does not block for an
arbitrarily long period between signal checks.

Our indefinite blocking was due to wait() invocations on condition variables
without a timeout.

This changes all usages of wait() in the the channel implementation to use a
wrapper that is responsive to signals even while waiting on an RPC.

A test has been added to verify this.

Tests are currently disabled under gevent due to
grpc#18980, but a fix for that has been
found and should be merged shortly.
@gnossen
Copy link
Contributor Author

gnossen commented Jul 3, 2019

#19554

@gnossen gnossen merged commit 8f044f7 into grpc:master Jul 3, 2019
gnossen added a commit to gnossen/grpc that referenced this pull request Jul 8, 2019
…tion"

This reverts commit 8f044f7, reversing
changes made to 5ae9afd.
@gnossen gnossen mentioned this pull request Jul 8, 2019
@gnossen gnossen deleted the main_thread_starvation branch September 27, 2019 22:29
@lock lock bot locked as resolved and limited conversation to collaborators Dec 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/concurrency kind/bug lang/Python release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants