Test failures installing NativeCall with Rakudo-2003.03 #23

Closed
mrhdias opened this Issue Mar 28, 2013 · 28 comments

7 participants

@mrhdias

Rakudo-2003.03
nqp-2003.03
parrot-5.1.0

$ panda install NativeCall
==> Fetching NativeCall
==> Building NativeCall
Compiling lib/NativeCall.pm6
==> Testing NativeCall
t/01-argless.t ......... ok
t/02-simple-args.t ..... ok
t/03-simple-returns.t .. ok
t/04-pointers.t ........ ok
t/05-arrays.t .......... ok
t/06-struct.t .......... ok
t/07-writebarrier.t .... ok
t/08-callbacks.t ....... Failed 4/9 subtests
Test Summary Report
t/08-callbacks.t (Wstat: 11 Tests: 5 Failed: 0)
Non-zero wait status: 11
Parse errors: Bad plan. You planned 9 tests but ran 5.
Files=8, Tests=90, 22 wallclock secs ( 0.07 usr 0.01 sys + 19.63 cusr 2.06 csys = 21.77 CPU)
Result: FAIL
test stage failed for NativeCall: Tests failed

@timo

Running the tests with just perl6 without prove shows, that it outputs "ok 1" through "ok 5" and then just "ok" for the rest and then segfaults after the ninth test case.

@arnsholt
Collaborator

Thanks for chiming in as well, timo. I'm not having luck reproducing the issue so far (unfortunately), but could you do two things, please:

1) Run Perl 6 under gdb and show the backtrace you get from the segfault. IIRC, it seemed to be GC-related.
2) Wrap the tests in 08-callbacks.t in a loop, say for ^10 { ... }, and see if the segfault still happens at the very end or somewhere in the middle.

@mrhdias could you do the same, as well? I think it'd be especially useful to have backtraces from both of you, to see where exactly the error crops up.

Oh, and @timo, which Parrot are you running on?

@timo

Good eye. Here's what happens with a for ^10:

ok 19 - struct callback string argument
ok zsh: segmentation fault (core dumped) perl6 -Iblib/lib t/08-callbacks.t

Here's the stacktrace: https://gist.github.com/timo/630c0dbfce0f6eeb50d9

My parrot is built from the RELEASE_5_1_0 tag.

@arnsholt
Collaborator

Hmm. So you're both on Parrot 5.1.0, while I'm on 4.10.0 (What you get from nqp --gen-parrot), so it might be a Parrot thing. Thanks for the backtrace (and especially for having compiled stuff with debug symbols! That helps!).

@arnsholt
Collaborator

So bad the bad news are that I've so far been unable to reproduce this bug on my machine (using both RELEASE_5_1_0 and RELEASE_5_2_0). On the other hand, I'm pretty certain there is a bug here. It's just a bit hard to nail down when I can't reproduce it locally. From the looks of it, a PMC is malformed (I'm assuming it's a PMC created in connection with callbacks) and it blows up when it's checked for GC.

Also, @moritz has run into what appears to be the same bug in some different code, so it looks to be a quite serious bug.

@arnsholt
Collaborator

Update: I've managed to reproduce the bug (finally), on both 5.1.0 and 4.10.0. I've still no idea what's going on, but now that I can reproduce it, I can find out.

For those interested, adding pir::sweep__vi(1); after the first testcase (TakeACallback(&simple_callback);) should trigger the segfault, so it seems to be a problem in something relatively basic (I suspect the callback caching mechanism).

@leto

This leads me to believe that this bug has been there all along, but it is only triggered by a GC sweep. This may be some kind of bug in the manual escape analysis of NQP.

I am not trying to throw blame around, though. It could be a bug in the Parrot GC, but right now, it is hard to tell.

@arnsholt
Collaborator

I agree, this is definitely an old bug.

Given that placing the pir::sweep before the first use of callbacks doesn't trigger the fault, but after it does trigger it, I'm inclined to believe that this is a problem in my C code at either end of the callback handling. Either I do something weird when creating the function pointer passed to the C code, or something weird when setting up the call back into Parrot when the callback is invoked. Do you have any suggestions on how I can track down where the corrupted (assuming that's what it is) PMC originates?

@leto

@arnsholt is there any way you can try to create the smallest test case that exhibits the bug? That will make things a bit easier to debug.

Also, it may tells us more details about exactly when this lovely little bug gets triggered.

@arnsholt
Collaborator

The simplest testcase I can think of right now is stripping away everything but the first testcase from the callbacks file: https://gist.github.com/arnsholt/5295730

Initially I tried to do something with atexit, but apparently that didn't live in my libc.so.6.

@jnthn suggested it might be die to a missing write barrier; does that sound plausible? On the other hand, I think all the PMC manipulations I do are through the VTABLE macros, in which case they should write barrier themselves, no?

@arnsholt
Collaborator

@leto: Is there a doc somewhere the outlines which kinds of operations require write-barriering?

@leto

@arnsholt sadly this is ill-documented and not well understood by developers that have worked outside the GC subsystem. I suggest asking @bacek and possibly @Benabik

@bacek

op nqp_native_call_build is breaking contract with GC by directly poking inside GCable without setting write barrier before doing it.

@pmichaud
Collaborator

Are the terms of the "contract with GC" written down anywhere? ;-)

Should this be added as a Parrot issue?

Pm

@leto

@pmichaud I suspect we have out of date and/or wrong docs somewhere, the C comments in the GC subsystem are the most likely to correspond to reality. Yes, I would love to help improve the GC docs :)

@pmichaud: If I can translate @bacek, he is saying that the NQP opcode native_call_build is not obeying the semantics of setting a write barrier before doing any operations that mutate state in a GCable. The barrier needs to be removed at the end of any possible state change, as well.

@arnsholt
Collaborator

So, trying to figure this one out again.

So, if I'm understanding this correctly, I need to do:

 PARROT_GC_WRITE_BARRIER(interp, my_pmc);
 /* Mutate PMC here. */
/* Remove barrier here. How? */

It's not clear to me how to remove the barrier though. Grepping through parrot/include the only thing I found that looks relevant is PARROT_GC_WRITE_BARRIER.

And docs on the contract here would be great. Which files should I start with, if source-diving is the only option?

@leto

There is no need to remove a barrier. The barrier only has effect on the single use of the argument you provide it, as far as I understand. Here is the definition of the macro:

include/parrot/gc_api.h
34:#define PARROT_GC_WRITE_BARRIER(i, p) do { if (PObj_GC_need_write_barrier_TEST((p))) Parrot_gc_write_barrier((i), (p)); } while(0)

This is my interpretation of the "write barrier": don't allow writing to anything that could possibly change the memory behind the data in the given object, while it is being looked up.

We need things like this because of parallel algorithms in the GC. We can't have the rug pulled out from under us while we are searching through the fibers of the rug.

Does that make sense, @arnsholt ?

@arnsholt
Collaborator

Right, so the write barrier registers some piece of code to be run when the object is written to.

The next question then, is: What kind of operations require me to set a write-barrier (and when, relative to my code)?

@arnsholt arnsholt closed this May 18, 2013
@arnsholt arnsholt reopened this May 18, 2013
@arnsholt
Collaborator

Whoops. Hit the close and comment rather than comment button.

@leto

This is a better philosophy: if you have any question about whether a
write barrier is needed, use a write barrier.

Later, as an optimization, you can make a pass through and remove
unnecessary write barriers. This would preferably have a code review
from a Parrot internals dev. Sometimes your code will work on one system without
a write barrier, but it will fail horribly on a different architecture,
or a machine with a different amount or layout of memory, etc...

Needless to say, it is best to just always use a write barrier and then
later remove them selectively with lots of cross-platform testing.

@Benabik

The write barrier isn't registering code. It's just marking that the GC needs to pay more attention to the object in question.

Generational GCs attempt to gain some performance by not having to check every object on the theory that older objects change far less frequently. The downside is that if you don't tell it you changed an old object, it won't check it when doing reachability checks and may collect things it shouldn't.

The simple rule of thumb is you need a write barrier whenever you alter an object. The more complex one is that you need a write barrier when you alter a pointer to another object. Individual write barriers are pretty cheap, and the cost for missing a needed one is fairly high, so you might as well add them liberally.

@arnsholt
Collaborator

Right, that makes sense. An object in old-space will not point to an object in new-space unless it's mutated.

So first I mutate my object, then I mark it with the WRITE_BARRIER macro. Which mutations require a write barrier though? I guess anything which changes which objects are marked during the mark phase, but are there any of those, apart from the obvious ones I know about from my object's internals?

Also, should I write barrier immediately after changing the internals, or is there some timing slack?

@leto

Just to recap, nqp_native_call_build is directly modifying the internals of a GCable (just about everything, including any object) instead of using a public API. This "breaking of a contract", which in my opinion is barely documented at best, means that nefarious GC bugs exist.

They will only be triggered on special memory configurations, on particular OS's, on certain hardware, with particular memory sizes and only on the waxing of a gibbous moon. Nasal demons usually appear, too.

This is not a Parrot bug, but I will gladly help get this stuff documented in a more useful way and help in any other way that I can.

@arnsholt
Collaborator

I agree, this isn't a Parrot bug. But it's not entirely clear to me why native_call_build is breaking the contract? Yes, it's manipulating the internals of an object, but it's an object of a known type whose implementation and structure we control. And it's not really reaching into the privates of the PMC, it's modifying a struct that's pointed to by the PMC data pointer, which I thought was for this kind of stuff.

Sure, I could move the logic to a function somewhere else, but the implementation wouldn't change noticeably just because of that.

@leto

"it's modifying a struct that's pointed to by the PMC data pointer"

You are breaking the contract by doing that, but not setting a write barrier.

@arnsholt
Collaborator

Right, right. There is a write barrier set, at the very end of the function, but that might be too late, I guess. Unfortunately, hoisting the write barrier up to just after where arg_info[i] is set (line 881) doesn't fix the problem (I tried immediately before as well, for good measure). It still segfaults in the same way.

In the op, $1 is the PMC representing the native function we're invoking and body is the PMC's PMC_data. The only GCables contained in the struct are the non-NULL elements of arg_info.

Unless I'm completely misunderstanding how to set the write barriers, it seems to me native_call_build isn't the culprit.

@arnsholt
Collaborator

My currently minimal test case for this bug: https://gist.github.com/arnsholt/5690787

@arnsholt
Collaborator

The latest NQP commit (nqp/61d80b9) should fix this problem. If you could confirm that it's fixed for you too, that'd be great.

@arnsholt arnsholt closed this Jun 6, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment