New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause all PantsService threads before forking a pantsd-runner #6671

Merged
merged 7 commits into from Oct 29, 2018

Conversation

Projects
None yet
3 participants
@stuhood
Member

stuhood commented Oct 22, 2018

Problem

As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a PantsService thread (usually FSEventService, but potentially the StoreGCService threads as well) were blocked on locks while attempting to acquire resources pre-fork.

As mentioned on the ticket, the fork_lock used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, all shared lock access needed to be protected by the fork_lock). But also, after the fork the fork_lock would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock.

Solution

Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff.

While we continue to move toward a world (via --v2 and @console_rule) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of --v1 pipelines pre-fork.

Result

Fixes #6565 and fixes #6621.

@stuhood

This comment has been minimized.

Member

stuhood commented Oct 22, 2018

This depends on the first 3 commits as independent PRs: #6667, #6669, #6670.

The last 3 commits are independently reviewable.

@jsirois

This comment has been minimized.

Member

jsirois commented Oct 22, 2018

I'm going to wait on #6667, #6669, #6670 to land to review the rest.

@stuhood stuhood force-pushed the twitter:stuhood/service-fork branch from 64401fd to 3f9e13e Oct 27, 2018

@stuhood stuhood force-pushed the twitter:stuhood/service-fork branch from 3f9e13e to eba7bcb Oct 28, 2018

@stuhood stuhood force-pushed the twitter:stuhood/service-fork branch from eba7bcb to 5d70958 Oct 28, 2018

@stuhood

This comment has been minimized.

Member

stuhood commented Oct 28, 2018

This no longer has dependencies, and should now be reviewable.

@jsirois

This comment has been minimized.

Member

jsirois commented Oct 29, 2018

@stuhood the failures look legit and related. I'll back off review a while longer until you fix or disabuse me of the notion the failures are related.

@stuhood

This comment has been minimized.

Member

stuhood commented Oct 29, 2018

@jsirois : The failures are related, but my hypothesis right now is that this change has made a previously non-determinstic failure in that test (see #6621) deterministic.

I'm going to spend some time today looking at it, but it feels unlikely that fixing it will require an overhaul of this patch.

@jsirois

Setting next lease time is broken, but this otherwise looks good.

@@ -207,20 +206,23 @@ def _setup_services(build_root, bootstrap_options, legacy_graph_scheduler, watch
store_gc_service = StoreGCService(legacy_graph_scheduler.scheduler)
return (
return PantsServices(
# Services.
(fs_event_service, scheduler_service, pailgun_service, store_gc_service),

This comment has been minimized.

@jsirois

jsirois Oct 29, 2018

Member

Leveraging your dataype for keywords and comment kills would be great. Ditto keywords above on line 155, or even leverage the work you put into __new__ to just PantsServices().

Show resolved Hide resolved src/python/pants/pantsd/service/pants_service.py
If a timeout is specified, the method may return False to indicate a timeout: with no timeout
it will always (eventually) return True.
Raises if the service is not currently in either the Pausing or Running state.

This comment has been minimized.

@jsirois

jsirois Oct 29, 2018

Member

Right now await_paused can be called while Running, and that raises; so doc or code needs adjustment.

timeout = deadline - time.time() if deadline else None
if timeout and timeout <= 0:
return False
self._condition.wait(timeout)

This comment has been minimized.

@jsirois

jsirois Oct 29, 2018

Member

Nit: you use the kwarg when waiting with timeout in maybe_pause, one or the other.

"""Called by the service to indicate that it is pausable.
If the service calls this method while the state is `Pausing`, the state will transition
to `Paused`, and the service will block here until it is marked `Running` again.

This comment has been minimized.

@jsirois

jsirois Oct 29, 2018

Member

It could also unblock if mark_terminating is called after it parks on line 180. I think some of the docs assume usage and are not fully self-contained.

to `Paused`, and the service will block here until it is marked `Running` again.
If the state is not currently `Pausing`, and a timeout is not passed, this method returns
immediately. If a timeout is passed, this method block up to that number of seconds to wait

This comment has been minimized.

@jsirois

jsirois Oct 29, 2018

Member

blocks

self._next_gc = time.time() + self._GARBAGE_COLLECTION_INTERVAL_SECONDS
def _set_next_lease_extension(self):
self._next_lease_extension = time.time() + self._GARBAGE_COLLECTION_INTERVAL_SECONDS

This comment has been minimized.

@jsirois

jsirois Oct 29, 2018

Member

_LEASE_EXTENSION_INTERVAL_SECONDS

@stuhood

This comment has been minimized.

Member

stuhood commented Oct 29, 2018

Applied review feedback (thanks!), and fixed #6621 by increasing the max_user_watches setting. Will merge on green CI.

@ity

ity approved these changes Oct 29, 2018

# Mark all services pausing (to allow them to concurrently pause), and then wait for them
# to have paused.
# NB: PailgunServer ensures that the entire run occurs under the lifecycle_lock.
for service in self._services.services:

This comment has been minimized.

@ity

ity Oct 29, 2018

Member

I was trying to reconcile whether the two steps are necessary since mark_pausing raises if its not in a Running state anyway. then realized the first step is merely a marker and the second is the one pausing.

@stuhood stuhood merged commit 835e0bf into pantsbuild:master Oct 29, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/service-fork branch Oct 29, 2018

@stuhood stuhood added this to the 1.10.x milestone Oct 29, 2018

stuhood added a commit that referenced this pull request Oct 29, 2018

Pause all PantsService threads before forking a pantsd-runner (#6671)
As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork.

As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock.

Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff.

While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork.

Fixes #6565 and fixes #6621.

stuhood added a commit that referenced this pull request Oct 29, 2018

Pause all PantsService threads before forking a pantsd-runner (#6671)
As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork.

As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock.

Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff.

While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork.

Fixes #6565 and fixes #6621.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment