Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux x86/x86_64 - chance of hang in the tests at process exit #16

Closed
harningt opened this issue Feb 18, 2012 · 23 comments
Closed

Linux x86/x86_64 - chance of hang in the tests at process exit #16

harningt opened this issue Feb 18, 2012 · 23 comments

Comments

@harningt
Copy link

It seems that some of the tests have a problem closing out when Lua is tearing itself down.

32-bit machine uname -a: Linux pinetrail 3.0.0-13-generic #22-Ubuntu SMP Wed Nov 2 13:25:36 UTC 2011 i686 i686 i386 GNU/Linux
64-bit machine uname -a: Linux ionic 3.0.0-13-generic #22-Ubuntu SMP Wed Nov 2 13:27:26 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

I have tried GDB on a 32-bit and 64-bit machine and it complains that it cannot track the threads.

Starting program: /usr/bin/lua5.1 tests/errhangtest.lua
[Thread debugging using libthread_db enabled]
Cannot find new threads: generic error

It seems that each of the tests have a chance of this hang. If you know any trick for managing this generic error, let me know.

Encountered at revision: v3.0-beta-10-g60a0e70 60a0e70

@benoit-germain
Copy link
Member

I am afraid that someone more knowledgeable with pthreads than myself should endeavour to have to look into this as I don't have the slightest idea what the problem could be.

@davidm
Copy link

davidm commented Apr 27, 2012

I notice this on Ubuntu as well when trying to setup tests for LuaDist (LuaDist/Repository#76). The problem does not seem to occur in OSX and Cygwin.

@benoit-germain
Copy link
Member

Just curious if 32aa701 fixes it?

@benoit-germain
Copy link
Member

Forget it, I believe the fix is more complex. In fact, I'm not even sure if I won't have to resort to a dirty hack to fix it.

@benoit-germain
Copy link
Member

v3.1.6 contains a fix. Let's hope there is only one such bug :-).

@hinrik
Copy link

hinrik commented Oct 27, 2012

I tried the latest git version and this issue is still present for me (LuaJIT 2.0.0 beta9 and Lua 5.1.5 on Linux 3.2.0, Debian Squeeze).

@benoit-germain
Copy link
Member

I've been trying to setup a bootable Debian USB drive to check this, but without success so far (network doesn't work yet). But I don't forget :-).

@benoit-germain
Copy link
Member

It looks like this crash at application shutdown occurs when the main thread invokes atexit_close_keepers(). The crash disappears when I don't register it anymore. It is as if the function pointer is invoked after the lanes SO is unloaded.
I suppose this happens when the main Lua state is closed, therefore before the handler is called.
I don't know why this doesn't crash on Windows. Maybe the handling differs on Windows and the DLL doesn't actually get unloaded, although the respective documentation says the behavior is basically the same.

I'll do some tests with this behavior removed and see how things fare.

@mkottman
Copy link
Contributor

I'd like to comment on this:

Starting program: /usr/bin/lua5.1 tests/errhangtest.lua
[Thread debugging using libthread_db enabled]
Cannot find new threads: generic error

This is because Lua is not compiled with pthread support. When Lua is loaded into gdb, it sees there is no support for pthreads and sets up itself in a certain way. When you load a pthread-enabled module into Lua later, gdb is confused and spits out this error.

I usually handle it by compiling a custom version of Lua with pthread support and debug symbols enabled, which I call luad so it does not mess with my system Lua:

--- lua-5.2.1/src/Makefile  2012-03-09 17:32:16.000000000 +0100
+++ lua-5.2.1-pthread/src/Makefile  2012-11-21 10:03:55.284051778 +0100
@@ -103,7 +103,7 @@
 generic: $(ALL)

 linux:
-   $(MAKE) $(ALL) SYSCFLAGS="-DLUA_USE_LINUX" SYSLIBS="-Wl,-E -ldl -lreadline -lncurses"
+   $(MAKE) $(ALL) SYSCFLAGS="-DLUA_USE_LINUX -pthread -ggdb" SYSLIBS="-Wl,-E -ldl -lreadline -lncurses -pthread -ggdb"

 macosx:
    $(MAKE) $(ALL) SYSCFLAGS="-DLUA_USE_MACOSX" SYSLIBS="-lreadline"

@benoit-germain
Copy link
Member

Another option is to add this to your .gdbinit so that gdb always loads pthread by itself even if the debugged image isn't pthread-enabled:
set env LD_PRELOAD /lib/libpthread.so.0

@benoit-germain
Copy link
Member

fixed by f154e1f

@hinrik
Copy link

hinrik commented Nov 23, 2012

I hate to be the bearer of bad news, but this didn't fix the issue for me. errhangtest.luait still hangs (or segfaults) at exit sometimes. It happens on two different x86-64 Debian machines I have. If you can't reproduce it I can give you shell access to one of them (just drop me a line at hinrik.sig@gmail.com or literalon irc.freenode.net) so you can troubleshoot it.

@benoit-germain
Copy link
Member

Ok I reproduced it with Debian Squeeze amd64 as well. However, it won't hang with Debian Squeeze i386.

@benoit-germain
Copy link
Member

Here is the callstack I get:

(gdb) bt
#0  0x00007ffff7bce1fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x00007ffff6d4da5b in THREAD_WAIT_IMPL () from /usr/local/lib/lua/5.1/lanes/core.so
#2  0x00007ffff6d44940 in selfdestruct_gc () from /usr/local/lib/lua/5.1/lanes/core.so
#3  0x0000000000408696 in ?? ()
#4  0x0000000000408ba9 in ?? ()
#5  0x000000000040a39f in ?? ()
#6  0x000000000040aa28 in ?? ()
#7  0x0000000000408287 in ?? ()
#8  0x000000000040e14e in lua_close ()
#9  0x00000000004041c1 in main ()

If I change the selfdestruct_gc code at line 1213 to perform a full selfdestruct chain processing as in windows, the application hangs much less often but I still get an occasional crash:

(gdb) r errhangtest.lua
Starting program: /usr/bin/lua errhangtest.lua
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff6b11700 (LWP 7242)]
true    true
false   tried to copy unsupported types
oh boy

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6b11700 (LWP 7242)]
0x00007ffff6d43ccf in ?? ()
(gdb) bt
#0  0x00007ffff6d43ccf in ?? ()
#1  0x00007ffff6b11700 in ?? ()
#2  0x000000000065d720 in ?? ()
#3  0x000000000065d720 in ?? ()
#4  0x0000000000000002 in ?? ()
#5  0x00000000006503f0 in ?? ()
#6  0x00007ffff6b11700 in ?? ()
#7  0x00007ffff6b11700 in ?? ()
#8  0x00007ffff6b11700 in ?? ()
#9  0x0000000000000000 in ?? ()

It looks like the crash occurs inside the timer lane's thread (which is the only one the test creates).
But I certainly don't know why there is a difference between the 32 and 64 bits build in that regard.

@benoit-germain
Copy link
Member

I caused the program to crash while inside gdb and inspected the process: here is what I see:

benoit@benoit-germain-debian64:~$ lsof -p 10238
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
lua     10238 benoit  cwd    DIR    8,1     4096  82584 /home/benoit/lanes-master/tests
lua     10238 benoit  rtd    DIR    8,1     4096      2 /
lua     10238 benoit  txt    REG    8,1   167904 463577 /usr/bin/lua5.1
lua     10238 benoit  mem    REG    8,1   286776 434221 /lib/libncurses.so.5.7
lua     10238 benoit  mem    REG    8,1  1437064 434187 /lib/libc-2.11.3.so
lua     10238 benoit  mem    REG    8,1   273840 434307 /lib/libreadline.so.6.1
lua     10238 benoit  mem    REG    8,1    14696 434199 /lib/libdl-2.11.3.so
lua     10238 benoit  mem    REG    8,1   530736 434200 /lib/libm-2.11.3.so
lua     10238 benoit  mem    REG    8,1   131258 434182 /lib/libpthread-2.11.3.so
lua     10238 benoit  mem    REG    8,1   128744 434183 /lib/ld-2.11.3.so
lua     10238 benoit    0u   CHR  136,0      0t0      3 /dev/pts/0
lua     10238 benoit    1u   CHR  136,0      0t0      3 /dev/pts/0
lua     10238 benoit    2u   CHR  136,0      0t0      3 /dev/pts/0
lua     10238 benoit    3r  FIFO    0,8      0t0  29125 pipe
lua     10238 benoit    4w  FIFO    0,8      0t0  29125 pipe

As you can see, lanes/core.so is no longer loaded, therefore it was unloaded before all objects are garbage collected, including the one Lanes registers so that its __gc metamethod performs thread cleanup. This seems to be related to a known Lua issue that exists since Lua 5.1, and is fixed in Lua 5.2.1. But again, why is it not 100%, and why should it work fine on 32 bits flavors?

@mwild1
Copy link

mwild1 commented Aug 12, 2013

I can reproduce this crash on 32-bit. Do you have any ideas for how this might be fixed on the Lanes side?

@benoit-germain
Copy link
Member

Some simple repro case would help answer this. If I can debug it I should be able to see what's wrong. But so far I haven't had any issue (but I don't work on linux, and even a few tries in virtualbox didn't crash).

@benoit-germain
Copy link
Member

Just in case:is it fixed in version 3.6.4?

@hinrik
Copy link

hinrik commented Oct 2, 2013

Not for me. Here's what happens on both 3.6.4 and 3.6.6:

$ lua errhangtest.lua 
true    true
false   tried to copy unsupported types
oh boy
Segmentation fault

Got the same result on both Lua 5.1.5 and 5.2.2 (Debian jessie x86-64)

@benoit-germain
Copy link
Member

I am somewhat stumped. I have a debian 6.0.6 64 bits virtualbox that works just fine:

$ lua errhangtest.lua
3.6.6
true    true
false   tried to copy unsupported types
oh boy

@benoit-germain
Copy link
Member

938ee19 fixes a possible shutdown sequence crash. I don't really think this could be the actual issue as I fix something related to the protect_allocator feature, but who knows :-). Can someone give it a try?

@hinrik
Copy link

hinrik commented Oct 9, 2013

It works now. On both 5.1.5 and 5.2.2.

@benoit-germain
Copy link
Member

w00t!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants