Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime.LockOSThread() related crash #886

Closed
nsf opened this issue Jun 26, 2010 · 14 comments
Closed

runtime.LockOSThread() related crash #886

nsf opened this issue Jun 26, 2010 · 14 comments

Comments

@nsf
Copy link

nsf commented Jun 26, 2010

The attached code crashes in case if runtime.LockOSThread() was called. Otherwise works
correctly. Crash doesn't tell anything meaningfull.

See also:
http://groups.google.com/group/golang-nuts/browse_thread/thread/ffdabac62768d410

Attachments:

  1. test.go (1920 bytes)
@nsf
Copy link
Author

nsf commented Jun 26, 2010

Comment 1:

Forget machine spec: 
 113ec27f29f1 release.2010-06-21/release 
 linux 
 386

@nsf
Copy link
Author

nsf commented Jun 27, 2010

Comment 2:

Also, note: runs fine with GC turned off. 
e.g.:
GCGO=off ./test
-----------------------------------------------------------------------
Given that clue I've tracked down an error source to a file 'runtime/mcache.c',
'ReleaseN' function.
Slightly modified version of this function (notice inserted printf statements):
-----------------------------------------------------------------------
// Take n elements off l and return them to the central free list.
static void
ReleaseN(MCache *c, MCacheList *l, int32 n, int32 sizeclass)
{
    MLink *first, **lp;
    int32 i;
    // Cut off first n elements.
    first = l->list;
    lp = &l->list;
    printf("list: %d (sizeclass: %d, size: %d)\n", n, sizeclass, class_to_size[sizeclass]);
    for(i=0; i<n; i++) {
        printf("lp: %d: %p, %p\n", i, lp, &(*lp)->next);
        lp = &(*lp)->next;
    }
    l->list = *lp;
    *lp = nil;
    l->nlist -= n;
    if(l->nlist < l->nlistmin)
        l->nlistmin = l->nlist;
    c->size -= n*class_to_size[sizeclass];
    // Return them to central free list.
    MCentral_FreeList(&mheap.central[sizeclass], n, first);
}
-----------------------------------------------------------------------
Prints this (almost each time, when error happens):
list: 215 (sizeclass: 1, size: 8)
lp: 0: 0xb76fd344, 0xb75fd8c0
lp: 1: 0xb75fd8c0, 0x0
-----------------------------------------------------------------------
It says number of objects to remove from the list == 215, but third object is a null
pointer. Obviously a null pointer deref. Happens _always_ with sizeclass == 1 which has
size == 8.
Continuing bug hunting process.. Knowing what is wrong now need to figure out why..

@nsf
Copy link
Author

nsf commented Jun 27, 2010

Comment 3:

'nlist' value appears to be correct, no memory corruption here, investigating reason for
null pointer in the middle of the list...
-----------------------------------------------------------------------
Looks like race condition to me.. 
stealcache() in mgc0.c runs MCache_ReleaseAll() on all ms. I guess it is assumed that
all threads are asleep. Probably a wrong assumption.
I've written a function that checks whether a null pointer exists in a cache list for
particular mcache and inserted it before MCache_ReleaseAll(). Sometimes it says that
there is no null pointer, but MCache_ReleaseAll() dies after that. Can't explain it
other than parallel write access to the mcache.
Or a horrible buffer overflow somewhere else. Any hints on that will be helpful. I'm
done for today. :D

@nsf
Copy link
Author

nsf commented Jun 28, 2010

Comment 4:

Just a random guess, it appears that this patch fixes the problem:
------------------------------------------------------
diff -r 23d5aba3fbc9 src/pkg/runtime/proc.c
--- a/src/pkg/runtime/proc.c    Tue Jun 22 15:22:49 2010 +0100
+++ b/src/pkg/runtime/proc.c    Mon Jun 28 06:54:44 2010 +0600
@@ -168,7 +168,7 @@
    M *m;
 
    // If g is wired, hand it off directly.
-   if((m = g->lockedm) != nil) {
+   if(sched.mcpu < sched.mcpumax && (m = g->lockedm) != nil) {
        mnextg(m, g);
        return;
    }
------------------------------------------------------
But I have no idea is it correct or not. However, it fits. Garbage collector relies on
the fact that mcpumax == 1 will prevent running ms and keep them asleep. But in case if
it's a locked goroutine (e.g. after LockOSThread) this is the only place where it
bypasses that rule, resulting in mcpu == 2 and mcpumax == 1, which is unexpected. The
patch above fixes that unexpected result and therefore prevents race conditions. My test
apps show that error is gone.
Hope it helps..

@nsf
Copy link
Author

nsf commented Jun 28, 2010

Comment 5:

A few more tests show that the patch changes scheduler behaviour a bit, but anyway, I'm
pretty sure that the bug itself is in this exact spot. There is a problem with garbage
collector assumptions and overall scheduler mechanics working together. Good luck fixing
that. :D

@adg
Copy link
Contributor

adg commented Jun 29, 2010

Comment 6:

Thanks for looking into this.
Is it possible to cut your test program down to something smaller?

Status changed to WaitingForReply.

@rsc
Copy link
Contributor

rsc commented Jun 29, 2010

Comment 7:

I agree with your analysis.  Thanks for the detailed report.

Owner changed to r...@golang.org.

Status changed to Accepted.

@nsf
Copy link
Author

nsf commented Jun 29, 2010

Comment 8:

> Thanks for looking into this.
> Is it possible to cut your test program down to something smaller?
Well, since now I know where the bug is, it's pretty easy to simulate that situation.
But my example is a very good, really. Because it crashes very often and with a good
null pointer dereference error which is easy to track down. For example this code
crushes too (the first thing that I was able to come up with):
----------------------------------------------------------------
package main
import (
    "runtime"
    "time"
)
func doALotAllocs() {
    for {
        data := make([]int, 65536)
        for i := 0; i < 65536; i++ {
            data[i] = i*i
        }
    }
}
func main() {
    runtime.LockOSThread()
    go doALotAllocs()
    for {
        data := make([]int, 65536)
        for i := 0; i < 65536; i++ {
            data[i] = i*i
        }
        time.Sleep(2000)
    }
}
----------------------------------------------------------------

@rsc
Copy link
Contributor

rsc commented Jun 30, 2010

Comment 9:

This issue was closed by revision 7743336.

Status changed to Fixed.

@nsf
Copy link
Author

nsf commented Jun 30, 2010

Comment 10:

As I said in comment #5 that fix is wrong, it alters scheduler behaviour. Cgo calls use
the same locking mechanics, and this fix prevents cgo from getting extra threads. Since
cgo should recieve extra thread when it ends up in a blocking syscall or something, that
fix really breaks things. Now, no matter what you can't get more than GOMAXPROCS threads
(which is 1 by default) and this is really bad. Correct me if I'm wrong.

@nsf
Copy link
Author

nsf commented Jun 30, 2010

Comment 11:

See lines:
runtime/cgocall.c:28
runtime/cgocall.c:29
It uses the same locking mechanics as LockOSThread. And therefore later in
'entersyscall' it uses the same code paths in scheduler. This is why your fix affects
the way cgo calls behave. I can't say exactly how patch affect the behaviour, but 
apparently it's _you_ who supposed to know that.
If you know someone who is familiar with scheduler's guts please ask him to review your
patch. The current way GC interacts with scheduler (using mcpu/mcpumax) is totally
broken.
Also note that probably what I'm talking about causes issue #785.

@rsc
Copy link
Contributor

rsc commented Jun 30, 2010

Comment 12:

I don't understand what you mean in #5, #10, or #11.
If GOMAXPROCS is 1, calls into C are treated the
same as system calls: they don't count against the
one active Go thread, and if on system call exit there
is already an active Go thread, the system call exit
blocks until it can run again.  
The bug fix was that using LockOSThread was not
respecting the cpumax, so that this kind of system
call exit would not block, and you'd have Go code
running at the same time as the "stop the world"
garbage collector.

@nsf
Copy link
Author

nsf commented Jun 30, 2010

Comment 13:

Well, I don't quite understand what's happening either, but the thing is: when I applied
this patch, I was experiencing different behaviour in cgo-based applications. For
example my 'gomandel' mandelbrot demo became very unresponsive, which basically means
that while it was doing hard CPU work, the main OpenGL thread for some reason wasn't
active.
On the other hand it looks like there are correct number of threads and other apps
behave correctly. However it gives me reasons to think that something is not the same as
it was before. I guess we should wait till the release then and see people complaining
about cgo issues (if any). If there will be no complains, then good..

@nsf
Copy link
Author

nsf commented Jul 1, 2010

Comment 14:

Ok, I've finally figured out what wasn't right. And it turns out that the scheduler
behaviour was buggy before that fix and it works as expected now.
Thanks a lot for a fix (although I did it mostly :D).

@nsf nsf added fixed labels Jul 1, 2010
@nsf nsf assigned rsc Jul 1, 2010
@golang golang locked and limited conversation to collaborators Jun 24, 2016
@rsc rsc removed their assignment Jun 22, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants