New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move fork context management to rust #5521

Merged
merged 5 commits into from Aug 28, 2018

Conversation

Projects
None yet
4 participants
@stuhood
Copy link
Member

stuhood commented Feb 26, 2018

Problem

As described in #6356, we currently suspect that there are cases where resources within the engine are being used during a fork. The python-side fork_lock attempts to approximate a bunch of other locks which it would be more accurate to directly acquire instead.

Solution

Move "fork context" management to rust, and execute our double fork for DaemonPantsRunner inside the scheduler's fork context. This acquires all existing locks, which removes the need for a fork_lock that would approximate those locks. Also has the benefit that we can eagerly re-start the scheduler's CpuPool.

Result

It should be easier to add additional threads and locks on the rust side, without worrying that we have acquired the fork_lock in enough places.

A series of replays of our internal benchmarks no longer reproduce the hang described in #6356, so this likely fixes #6356.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Feb 26, 2018

Reviewable.

@illicitonion
Copy link
Contributor

illicitonion left a comment

It's not immediately apparently to me how this leads to being able to run background threads, but explicit fine-grained locking is probably good regardless, and I'm sure it's an important step :)

@@ -401,6 +401,14 @@ def _kill(self, kill_sig):
if self.pid:
os.kill(self.pid, kill_sig)

def _noop_fork_context(self, func):

This comment has been minimized.

@illicitonion

illicitonion Feb 27, 2018

Contributor

When is this the correct thing to use?

This comment has been minimized.

@stuhood

stuhood Mar 15, 2018

Member

Inlined and moved this docstring into the daemonize pydoc.


def visualize_to_dir(self):
return self._native.visualize_to_dir

def to_keys(self, subjects):
return list(self._to_key(subject) for subject in subjects)

def pre_fork(self):
self._native.lib.scheduler_pre_fork(self._scheduler)
def with_fork_context(self, func):

This comment has been minimized.

@illicitonion

illicitonion Feb 27, 2018

Contributor

Can you add a quick pydoc explaining what this is and how it should be used?

This comment has been minimized.

@stuhood

stuhood Mar 15, 2018

Member

I'll refer to the rust docs on the topic (in lib.rs).


# Perform the double fork under the fork_context. Three outcomes are possible after the double
# fork: we're either the original process, the double-fork parent, or the double-fork child.
# These are represented by parent_or_child being None, True, or False, respectively.

This comment has been minimized.

@illicitonion

illicitonion Feb 27, 2018

Contributor

Maybe this could use a tuple(is_original, is_parent) or something more enum-y, rather than a tri-state boolean?

if parent_or_child:
  ...
else:
  ...

doesn't read fantastically...

This comment has been minimized.

@stuhood

stuhood Mar 15, 2018

Member

I ache for actual enums... sigh.

This comment has been minimized.

@@ -37,20 +37,16 @@ def _launch_thread(f):
def _extend_lease(self):
while 1:
# Use the fork lock to ensure this thread isn't cloned via fork while holding the graph lock.
with self.fork_lock:
self._logger.debug('Extending leases')

This comment has been minimized.

@illicitonion

illicitonion Feb 27, 2018

Contributor

Can I have my logging back please? :)

This comment has been minimized.

@stuhood

stuhood Mar 15, 2018

Member

Whoops. Yep.

///
/// Run a function while the pool is shut down, and restore the pool after it completes.
///
pub fn with_shutdown<F, T>(&self, f: F) -> T

This comment has been minimized.

@illicitonion

illicitonion Feb 27, 2018

Contributor

Could we do away with the lock entirely by making with_shutdown take &mut self?

(I can believe this is impractical, but it would be nice if possible...)

This comment has been minimized.

@stuhood

stuhood Mar 15, 2018

Member

I don't think so, no... we have a reference to the pool via an Arc, and getting a mutable reference into that would require either cloning or something potentially panicy.

@@ -190,6 +201,8 @@ pub fn unsafe_call(func: &Function, args: &[Value]) -> Value {
/////////////////////////////////////////////////////////////////////////////////////////

lazy_static! {
// NB: Unfortunately, it's not currently possible to merge these locks, because mutating

This comment has been minimized.

@illicitonion

illicitonion Feb 27, 2018

Contributor

Nice comment :)

@dotordogh
Copy link
Contributor

dotordogh left a comment

Just out of curiosity, is running everything to see if the behavior remains the same the best way to test something like this?

@@ -448,32 +456,43 @@ def daemonize(self, pre_fork_opts=None, post_fork_parent_opts=None, post_fork_ch
daemons. Having a disparate umask from pre-vs-post fork causes files written in each phase to
differ in their permissions without good reason - in this case, we want to inherit the umask.
"""
fork_context = fork_context or self._noop_fork_context

def double_fork():

This comment has been minimized.

@dotordogh

dotordogh Feb 28, 2018

Contributor

Is it worth explaining in this context why double forking is necessary?

This comment has been minimized.

@stuhood

stuhood Mar 15, 2018

Member

It's explained in the comment above.

This comment has been minimized.

@dotordogh

dotordogh Mar 15, 2018

Contributor

I missed that! Sorry!

@stuhood stuhood changed the title Move the fork context management to rust Move fork context management to rust Mar 15, 2018

@stuhood stuhood force-pushed the twitter:stuhood/fork-lock-in-rust branch from c51ffad to e49fd02 Mar 15, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Mar 15, 2018

Just out of curiosity, is running everything to see if the behavior remains the same the best way to test something like this?

Yea, basically. Kris has added a lot of tests to cover daemon usecases, so we can be pretty confident that nothing is fundamentally broken.

@stuhood stuhood force-pushed the twitter:stuhood/fork-lock-in-rust branch from e49fd02 to 6c47781 Mar 15, 2018

@illicitonion
Copy link
Contributor

illicitonion left a comment

Looks great :) Thanks!

@stuhood stuhood force-pushed the pantsbuild:master branch from b6bb42d to 9e2fdb5 May 11, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Jul 13, 2018

I'm going to hold onto this branch, but the direction we're headed in will no longer require forking.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Aug 22, 2018

I believe that this is related to #6356, so I'm re-opening it in order to push a rebase and resume progress.

@stuhood stuhood reopened this Aug 22, 2018

@stuhood stuhood force-pushed the twitter:stuhood/fork-lock-in-rust branch from 6c47781 to 650af24 Aug 22, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Aug 22, 2018

This rebased version of the patch has the same "shape" as the old version (with_shutdown methods on resources), although it now needs to deal with significantly more resources. With this many resources in play, with_shutdown "context managers" get a bit hairy, so I'd like to experiment with one other idea before landing this. But I've eagerly pushed it in order to get a CI run.

@stuhood stuhood force-pushed the twitter:stuhood/fork-lock-in-rust branch 3 times, most recently from e04d1af to d7a4105 Aug 22, 2018

stuhood added a commit that referenced this pull request Aug 22, 2018

@stuhood stuhood force-pushed the twitter:stuhood/fork-lock-in-rust branch 2 times, most recently from 2cd2b97 to 166c3c8 Aug 27, 2018

@stuhood stuhood force-pushed the twitter:stuhood/fork-lock-in-rust branch from 166c3c8 to c3c8afa Aug 28, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Aug 28, 2018

I didn't see a clear way to do "composition of a bunch of resettable objects" without running into object safety, so planning to land this as is.

self.fs_pool.reset();
fn with_shutdown(&self, f: &mut FnMut() -> ()) {
// TODO: Although we have a Resettable<CpuPool>, we do not shut it down, because our caller
// will (and attempting to shut things down twice guarantees a deadlock because Resettable is

This comment has been minimized.

@stuhood

stuhood Aug 28, 2018

Member

Rather than deadlock, you'd actually panic: RwLock panics on reentrance.

@illicitonion
Copy link
Contributor

illicitonion left a comment

Thanks!

@stuhood stuhood merged commit ec931f8 into pantsbuild:master Aug 28, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/fork-lock-in-rust branch Aug 28, 2018

@stuhood stuhood added this to the 1.9.x milestone Aug 28, 2018

stuhood added a commit that referenced this pull request Aug 28, 2018

Move fork context management to rust (#5521)
As described in #6356, we currently suspect that there are cases where resources within the engine are being used during a fork. The python-side `fork_lock` attempts to approximate a bunch of other locks which it would be more accurate to directly acquire instead.

Move "fork context" management to rust, and execute our double fork for `DaemonPantsRunner` inside the scheduler's fork context. This acquires all existing locks, which removes the need for a `fork_lock` that would approximate those locks. Also has the benefit that we can eagerly re-start the scheduler's CpuPool.

It should be easier to add additional threads and locks on the rust side, without worrying that we have acquired the `fork_lock` in enough places.

A series of replays of our internal benchmarks no longer reproduce the hang described in #6356, so this likely fixes #6356.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment