-
Notifications
You must be signed in to change notification settings - Fork 17.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime.LockOSThread() related crash #886
Labels
Comments
Also, note: runs fine with GC turned off. e.g.: GCGO=off ./test ----------------------------------------------------------------------- Given that clue I've tracked down an error source to a file 'runtime/mcache.c', 'ReleaseN' function. Slightly modified version of this function (notice inserted printf statements): ----------------------------------------------------------------------- // Take n elements off l and return them to the central free list. static void ReleaseN(MCache *c, MCacheList *l, int32 n, int32 sizeclass) { MLink *first, **lp; int32 i; // Cut off first n elements. first = l->list; lp = &l->list; printf("list: %d (sizeclass: %d, size: %d)\n", n, sizeclass, class_to_size[sizeclass]); for(i=0; i<n; i++) { printf("lp: %d: %p, %p\n", i, lp, &(*lp)->next); lp = &(*lp)->next; } l->list = *lp; *lp = nil; l->nlist -= n; if(l->nlist < l->nlistmin) l->nlistmin = l->nlist; c->size -= n*class_to_size[sizeclass]; // Return them to central free list. MCentral_FreeList(&mheap.central[sizeclass], n, first); } ----------------------------------------------------------------------- Prints this (almost each time, when error happens): list: 215 (sizeclass: 1, size: 8) lp: 0: 0xb76fd344, 0xb75fd8c0 lp: 1: 0xb75fd8c0, 0x0 ----------------------------------------------------------------------- It says number of objects to remove from the list == 215, but third object is a null pointer. Obviously a null pointer deref. Happens _always_ with sizeclass == 1 which has size == 8. Continuing bug hunting process.. Knowing what is wrong now need to figure out why.. |
'nlist' value appears to be correct, no memory corruption here, investigating reason for null pointer in the middle of the list... ----------------------------------------------------------------------- Looks like race condition to me.. stealcache() in mgc0.c runs MCache_ReleaseAll() on all ms. I guess it is assumed that all threads are asleep. Probably a wrong assumption. I've written a function that checks whether a null pointer exists in a cache list for particular mcache and inserted it before MCache_ReleaseAll(). Sometimes it says that there is no null pointer, but MCache_ReleaseAll() dies after that. Can't explain it other than parallel write access to the mcache. Or a horrible buffer overflow somewhere else. Any hints on that will be helpful. I'm done for today. :D |
Just a random guess, it appears that this patch fixes the problem: ------------------------------------------------------ diff -r 23d5aba3fbc9 src/pkg/runtime/proc.c --- a/src/pkg/runtime/proc.c Tue Jun 22 15:22:49 2010 +0100 +++ b/src/pkg/runtime/proc.c Mon Jun 28 06:54:44 2010 +0600 @@ -168,7 +168,7 @@ M *m; // If g is wired, hand it off directly. - if((m = g->lockedm) != nil) { + if(sched.mcpu < sched.mcpumax && (m = g->lockedm) != nil) { mnextg(m, g); return; } ------------------------------------------------------ But I have no idea is it correct or not. However, it fits. Garbage collector relies on the fact that mcpumax == 1 will prevent running ms and keep them asleep. But in case if it's a locked goroutine (e.g. after LockOSThread) this is the only place where it bypasses that rule, resulting in mcpu == 2 and mcpumax == 1, which is unexpected. The patch above fixes that unexpected result and therefore prevents race conditions. My test apps show that error is gone. Hope it helps.. |
I agree with your analysis. Thanks for the detailed report. Owner changed to r...@golang.org. Status changed to Accepted. |
> Thanks for looking into this. > Is it possible to cut your test program down to something smaller? Well, since now I know where the bug is, it's pretty easy to simulate that situation. But my example is a very good, really. Because it crashes very often and with a good null pointer dereference error which is easy to track down. For example this code crushes too (the first thing that I was able to come up with): ---------------------------------------------------------------- package main import ( "runtime" "time" ) func doALotAllocs() { for { data := make([]int, 65536) for i := 0; i < 65536; i++ { data[i] = i*i } } } func main() { runtime.LockOSThread() go doALotAllocs() for { data := make([]int, 65536) for i := 0; i < 65536; i++ { data[i] = i*i } time.Sleep(2000) } } ---------------------------------------------------------------- |
This issue was closed by revision 7743336. Status changed to Fixed. |
As I said in comment #5 that fix is wrong, it alters scheduler behaviour. Cgo calls use the same locking mechanics, and this fix prevents cgo from getting extra threads. Since cgo should recieve extra thread when it ends up in a blocking syscall or something, that fix really breaks things. Now, no matter what you can't get more than GOMAXPROCS threads (which is 1 by default) and this is really bad. Correct me if I'm wrong. |
See lines: runtime/cgocall.c:28 runtime/cgocall.c:29 It uses the same locking mechanics as LockOSThread. And therefore later in 'entersyscall' it uses the same code paths in scheduler. This is why your fix affects the way cgo calls behave. I can't say exactly how patch affect the behaviour, but apparently it's _you_ who supposed to know that. If you know someone who is familiar with scheduler's guts please ask him to review your patch. The current way GC interacts with scheduler (using mcpu/mcpumax) is totally broken. Also note that probably what I'm talking about causes issue #785. |
I don't understand what you mean in #5, #10, or #11. If GOMAXPROCS is 1, calls into C are treated the same as system calls: they don't count against the one active Go thread, and if on system call exit there is already an active Go thread, the system call exit blocks until it can run again. The bug fix was that using LockOSThread was not respecting the cpumax, so that this kind of system call exit would not block, and you'd have Go code running at the same time as the "stop the world" garbage collector. |
Well, I don't quite understand what's happening either, but the thing is: when I applied this patch, I was experiencing different behaviour in cgo-based applications. For example my 'gomandel' mandelbrot demo became very unresponsive, which basically means that while it was doing hard CPU work, the main OpenGL thread for some reason wasn't active. On the other hand it looks like there are correct number of threads and other apps behave correctly. However it gives me reasons to think that something is not the same as it was before. I guess we should wait till the release then and see people complaining about cgo issues (if any). If there will be no complains, then good.. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Attachments:
The text was updated successfully, but these errors were encountered: