Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closure transfer fix for sched::clone #920

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

xurtis
Copy link

@xurtis xurtis commented Jul 4, 2018

Fixes #919

Annotate the correct lifetime ('static) for the closure transferred to
the child process of the clone.

Properly box the closure in the parent then rebox it in the child to
ensure that the closure is dropped at the end of the invocation in the
child rather than at the end of the call to clone in the parent.

For some reason, unwrapping the boxed function itself caused all kinds
of mayhem, so it instead ends up wrapped in an additional box from which
it is then unwrapped.

Copy link
Member

@asomers asomers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to double box cb? I don't understand why you can't just use a single Box.

@xurtis
Copy link
Author

xurtis commented Jul 5, 2018

After looking at how std::thread achieves the same result, I changed the patch to mimic that instead of doing strange dances with boxes. The (much simpler) solution turns out to be just forgetting the boxed function rather than dropping it.

@xurtis
Copy link
Author

xurtis commented Jul 5, 2018

It also appears that even at its lowest level, in order to implement the FnOnce semantics, libstd::thread uses a Box<Box<FnBox()>> which it then unwraps, sends, then boxes again before calling. Still not sure on the why though.

I think it would make more sense to require the callback to be a FnOnce, but I can't see a way of doing that in stable currently.

@asomers
Copy link
Member

asomers commented Jul 5, 2018

I'm skeptical of the mem::forget. That function deliberately leaks memory, and is normally used when you've made a low-level (unsafe) copy of a reference somewhere. But normally at least one copy gets destroyed, and I don't see that happening. Is it really necessary to forget the callback, when it's already static?

@xurtis
Copy link
Author

xurtis commented Jul 5, 2018

In this case the 'static refers to the fact that the function that the box wraps is not bound to the lifetime of any other object (i.e. all items that it refers to have been moved into it). It does not mean that the boxed closure is static.

The existing implementation of clones as you have ti will only work when the child thread does not use the CLONE_VM flag, meaning it gets its own entire copy of the VM. The reason it doesn't work is because the boxed closure will be dropped by the end of the call to clone, as it is moved into that function and that function consumes it. There is only one copy to this box and the child thread gets a reference to it, but if that box is dropped be for the child thread exits, then the child thread could rely entirely on invalid memory and, in most cases, won't even get to the point of executing the closure.

Either, the box needs to be copied (which it can't) or have its ownership transferred to the child process, or forgotten and never dropped at all. The only to transfer ownership to the child process involves wrapping a the boxed closure in an additional box (I'm still not entirely clear on the why there).

@xurtis
Copy link
Author

xurtis commented Jul 5, 2018

Another alternative fix that could be used here is raising that concern to the caller and using a more direct analogous type to what the syscall is actually expecting and have the wrapper take a fn(T) -> isize (a function pointer rather than a boxed closure) and a Box<T> (where T: Send + 'static) instead. This would leave more abstract implementations of executing closures to caller instead, whcih may be preferable in this case.

@asomers
Copy link
Member

asomers commented Jul 5, 2018

I agree. But I'm not comfortable with the double-box solution without knowing why it's necessary. Did you check the history of the file in libstd that uses that technique? That might be instructive.

@xurtis
Copy link
Author

xurtis commented Jul 6, 2018

So, on your recommendation, I'm going back through the years of commits to https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/thread.rs and it goes back quite some way through pre-stable rust (back before ~ notation for boxes was a thing).

I asked around on #rust and got some help answering the question. The short version is, don't cast between pointers of static and DSTs. DST pointers are actually 'fat' pointers which are a tuple of a pointer to data and a pointer to a vtable. When those get cast to a pointer to a void, the vtable data is dropped. To resolve this, an additional box should be used, so that you have a pointer to a static type (which contains a DST pointer) that can be safely casted to pointer to a static type, like *mut c_void.

The existing code masked a fatal compile error that would have caught this using a mem::transmute.

@xurtis
Copy link
Author

xurtis commented Jul 6, 2018

I've fixed it up in a way that should disrupt the interface as little as possible using the double boxing. I've also noticed that the way the stack is passed into clone as a &mut [u8] is troublesome as that could mean that child thread outlives the memory for its own stack. Fixing that would break the interface.

xurtis added 11 commits July 6, 2018 21:53
Fixes nix-rust#919

Annotate the correct lifetime (`'static`) for the closure transferred to
the child process of the clone.

Properly box the closure in the parent then rebox it in the child to
ensure that the closure is dropped at the end of the invocation in the
child rather than at the end of the call to `clone` in the parent.

For some reason, unwrapping the boxed function itself caused all kinds
of mayhem, so it instead ends up wrapped in an additional box from which
it is then unwrapped.
This test appears to be incredibly flaky and causing errors in the
`sys::tests::test_sigwait` test which can't be replicated locally.
Copy link
Member

@asomers asomers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that explanation for double-boxing makes sense. I think it'll make even more sense when expressing things using the new dyn Trait syntax introduced by the latest compiler.

Don't forget to add a CHANGELOG entry.

src/sched.rs Outdated
@@ -119,3 +119,34 @@ pub fn setns(fd: RawFd, nstype: CloneFlags) -> Result<()> {

Errno::result(res).map(drop)
}

#[cfg(not)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg what?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to see if the tests on travis would pass if the test was disabled. It seems that with this test in place, both this test and the sys::wait::tests::test_waitpid test would fail in various situations, but not in a way that I could replicate locally, even when building for targets that were failing in travis.

src/sched.rs Outdated
#[test]
fn simple_clone() {
// Stack *must* outlive the child.
let mut stack = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well combine these lines into let mut stack = vec![0u8; 4096]

@jD91mZM2
Copy link

jD91mZM2 commented Jan 7, 2021

Anyone still working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants