runtime.LockOSThread() related crash #886

nsf · 2010-06-26T01:29:55Z

The attached code crashes in case if runtime.LockOSThread() was called. Otherwise works
correctly. Crash doesn't tell anything meaningfull.

See also:
http://groups.google.com/group/golang-nuts/browse_thread/thread/ffdabac62768d410

Attachments:

test.go (1920 bytes)

nsf · 2010-06-26T01:30:43Z

Comment 1:

Forget machine spec: 
 113ec27f29f1 release.2010-06-21/release 
 linux 
 386

nsf · 2010-06-27T21:08:11Z

Comment 2:

Also, note: runs fine with GC turned off. 
e.g.:
GCGO=off ./test
-----------------------------------------------------------------------
Given that clue I've tracked down an error source to a file 'runtime/mcache.c',
'ReleaseN' function.
Slightly modified version of this function (notice inserted printf statements):
-----------------------------------------------------------------------
// Take n elements off l and return them to the central free list.
static void
ReleaseN(MCache *c, MCacheList *l, int32 n, int32 sizeclass)
{
    MLink *first, **lp;
    int32 i;
    // Cut off first n elements.
    first = l->list;
    lp = &l->list;
    printf("list: %d (sizeclass: %d, size: %d)\n", n, sizeclass, class_to_size[sizeclass]);
    for(i=0; i<n; i++) {
        printf("lp: %d: %p, %p\n", i, lp, &(*lp)->next);
        lp = &(*lp)->next;
    }
    l->list = *lp;
    *lp = nil;
    l->nlist -= n;
    if(l->nlist < l->nlistmin)
        l->nlistmin = l->nlist;
    c->size -= n*class_to_size[sizeclass];
    // Return them to central free list.
    MCentral_FreeList(&mheap.central[sizeclass], n, first);
}
-----------------------------------------------------------------------
Prints this (almost each time, when error happens):
list: 215 (sizeclass: 1, size: 8)
lp: 0: 0xb76fd344, 0xb75fd8c0
lp: 1: 0xb75fd8c0, 0x0
-----------------------------------------------------------------------
It says number of objects to remove from the list == 215, but third object is a null
pointer. Obviously a null pointer deref. Happens _always_ with sizeclass == 1 which has
size == 8.
Continuing bug hunting process.. Knowing what is wrong now need to figure out why..

nsf · 2010-06-27T23:16:26Z

Comment 3:

'nlist' value appears to be correct, no memory corruption here, investigating reason for
null pointer in the middle of the list...
-----------------------------------------------------------------------
Looks like race condition to me.. 
stealcache() in mgc0.c runs MCache_ReleaseAll() on all ms. I guess it is assumed that
all threads are asleep. Probably a wrong assumption.
I've written a function that checks whether a null pointer exists in a cache list for
particular mcache and inserted it before MCache_ReleaseAll(). Sometimes it says that
there is no null pointer, but MCache_ReleaseAll() dies after that. Can't explain it
other than parallel write access to the mcache.
Or a horrible buffer overflow somewhere else. Any hints on that will be helpful. I'm
done for today. :D

nsf · 2010-06-28T00:59:30Z

Comment 4:

Just a random guess, it appears that this patch fixes the problem:
------------------------------------------------------
diff -r 23d5aba3fbc9 src/pkg/runtime/proc.c
--- a/src/pkg/runtime/proc.c    Tue Jun 22 15:22:49 2010 +0100
+++ b/src/pkg/runtime/proc.c    Mon Jun 28 06:54:44 2010 +0600
@@ -168,7 +168,7 @@
    M *m;
 
    // If g is wired, hand it off directly.
-   if((m = g->lockedm) != nil) {
+   if(sched.mcpu < sched.mcpumax && (m = g->lockedm) != nil) {
        mnextg(m, g);
        return;
    }
------------------------------------------------------
But I have no idea is it correct or not. However, it fits. Garbage collector relies on
the fact that mcpumax == 1 will prevent running ms and keep them asleep. But in case if
it's a locked goroutine (e.g. after LockOSThread) this is the only place where it
bypasses that rule, resulting in mcpu == 2 and mcpumax == 1, which is unexpected. The
patch above fixes that unexpected result and therefore prevents race conditions. My test
apps show that error is gone.
Hope it helps..

nsf · 2010-06-28T03:12:23Z

Comment 5:

A few more tests show that the patch changes scheduler behaviour a bit, but anyway, I'm
pretty sure that the bug itself is in this exact spot. There is a problem with garbage
collector assumptions and overall scheduler mechanics working together. Good luck fixing
that. :D

adg · 2010-06-29T03:31:54Z

Comment 6:

Thanks for looking into this.
Is it possible to cut your test program down to something smaller?

Status changed to WaitingForReply.

rsc · 2010-06-29T08:07:05Z

Comment 7:

I agree with your analysis.  Thanks for the detailed report.

Owner changed to r...@golang.org.

Status changed to Accepted.

nsf · 2010-06-29T09:20:56Z

Comment 8:

> Thanks for looking into this.
> Is it possible to cut your test program down to something smaller?
Well, since now I know where the bug is, it's pretty easy to simulate that situation.
But my example is a very good, really. Because it crashes very often and with a good
null pointer dereference error which is easy to track down. For example this code
crushes too (the first thing that I was able to come up with):
----------------------------------------------------------------
package main
import (
    "runtime"
    "time"
)
func doALotAllocs() {
    for {
        data := make([]int, 65536)
        for i := 0; i < 65536; i++ {
            data[i] = i*i
        }
    }
}
func main() {
    runtime.LockOSThread()
    go doALotAllocs()
    for {
        data := make([]int, 65536)
        for i := 0; i < 65536; i++ {
            data[i] = i*i
        }
        time.Sleep(2000)
    }
}
----------------------------------------------------------------

rsc · 2010-06-30T00:47:42Z

Comment 9:

This issue was closed by revision 7743336.

Status changed to Fixed.

nsf · 2010-06-30T03:17:51Z

Comment 10:

As I said in comment #5 that fix is wrong, it alters scheduler behaviour. Cgo calls use
the same locking mechanics, and this fix prevents cgo from getting extra threads. Since
cgo should recieve extra thread when it ends up in a blocking syscall or something, that
fix really breaks things. Now, no matter what you can't get more than GOMAXPROCS threads
(which is 1 by default) and this is really bad. Correct me if I'm wrong.

nsf · 2010-06-30T04:09:59Z

Comment 11:

See lines:
runtime/cgocall.c:28
runtime/cgocall.c:29
It uses the same locking mechanics as LockOSThread. And therefore later in
'entersyscall' it uses the same code paths in scheduler. This is why your fix affects
the way cgo calls behave. I can't say exactly how patch affect the behaviour, but 
apparently it's _you_ who supposed to know that.
If you know someone who is familiar with scheduler's guts please ask him to review your
patch. The current way GC interacts with scheduler (using mcpu/mcpumax) is totally
broken.
Also note that probably what I'm talking about causes issue #785.

rsc · 2010-06-30T05:13:32Z

Comment 12:

I don't understand what you mean in #5, #10, or #11.
If GOMAXPROCS is 1, calls into C are treated the
same as system calls: they don't count against the
one active Go thread, and if on system call exit there
is already an active Go thread, the system call exit
blocks until it can run again.  
The bug fix was that using LockOSThread was not
respecting the cpumax, so that this kind of system
call exit would not block, and you'd have Go code
running at the same time as the "stop the world"
garbage collector.

nsf · 2010-06-30T05:24:58Z

Comment 13:

Well, I don't quite understand what's happening either, but the thing is: when I applied
this patch, I was experiencing different behaviour in cgo-based applications. For
example my 'gomandel' mandelbrot demo became very unresponsive, which basically means
that while it was doing hard CPU work, the main OpenGL thread for some reason wasn't
active.
On the other hand it looks like there are correct number of threads and other apps
behave correctly. However it gives me reasons to think that something is not the same as
it was before. I guess we should wait till the release then and see people complaining
about cgo issues (if any). If there will be no complains, then good..

nsf · 2010-07-01T22:03:39Z

Comment 14:

Ok, I've finally figured out what wasn't right. And it turns out that the scheduler
behaviour was buggy before that fix and it works as expected now.
Thanks a lot for a fix (although I did it mostly :D).

nsf added fixed labels Jul 1, 2010

nsf assigned rsc Jul 1, 2010

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

rsc removed their assignment Jun 22, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime.LockOSThread() related crash #886

runtime.LockOSThread() related crash #886

nsf commented Jun 26, 2010

nsf commented Jun 26, 2010

nsf commented Jun 27, 2010

nsf commented Jun 27, 2010

nsf commented Jun 28, 2010

nsf commented Jun 28, 2010

adg commented Jun 29, 2010

rsc commented Jun 29, 2010

nsf commented Jun 29, 2010

rsc commented Jun 30, 2010

nsf commented Jun 30, 2010

nsf commented Jun 30, 2010

rsc commented Jun 30, 2010

nsf commented Jun 30, 2010

nsf commented Jul 1, 2010

runtime.LockOSThread() related crash #886

runtime.LockOSThread() related crash #886

Comments

nsf commented Jun 26, 2010

nsf commented Jun 26, 2010

nsf commented Jun 27, 2010

nsf commented Jun 27, 2010

nsf commented Jun 28, 2010

nsf commented Jun 28, 2010

adg commented Jun 29, 2010

rsc commented Jun 29, 2010

nsf commented Jun 29, 2010

rsc commented Jun 30, 2010

nsf commented Jun 30, 2010

nsf commented Jun 30, 2010

rsc commented Jun 30, 2010

nsf commented Jun 30, 2010

nsf commented Jul 1, 2010