-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock in je_prof_boot2 #585
Comments
Using libgcc creates a deadlock on start. See jemalloc/jemalloc#585.
Thanks for the report. Can you grab a stack trace of the deadlock? I would not be shocked if we're letting some glibc assumptions sneak in. |
Sure. $ gdb a.out
(gdb) run
Starting program: /jemalloc/jemalloc-4.4.0/a.out
^C
Program received signal SIGINT, Interrupt.
__syscall () at src/internal/x86_64/syscall.s:13
13 src/internal/x86_64/syscall.s: No such file or directory.
(gdb) bt
#0 __syscall () at src/internal/x86_64/syscall.s:13
#1 0x0000000000460cc8 in __timedwait_cp (addr=addr@entry=0x670024 <init_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0,
priv=priv@entry=128) at src/thread/__timedwait.c:31
#2 0x0000000000460d51 in __timedwait (addr=addr@entry=0x670024 <init_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0,
priv=priv@entry=128) at src/thread/__timedwait.c:43
#3 0x000000000045fc70 in __pthread_mutex_timedlock (m=0x670020 <init_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:27
#4 0x000000000045fba3 in __pthread_mutex_lock (m=m@entry=0x670020 <init_lock>) at src/thread/pthread_mutex_lock.c:11
#5 0x00000000004030a2 in je_malloc_mutex_lock (tsdn=0x0, mutex=0x670020 <init_lock>) at include/jemalloc/internal/mutex.h:101
#6 malloc_init_hard () at src/jemalloc.c:1480
#7 0x0000000000406531 in malloc_init () at src/jemalloc.c:317
#8 ialloc_body (slow_path=true, usize=<synthetic pointer>, tsdn=<synthetic pointer>, zero=false, size=4728) at src/jemalloc.c:1577
#9 malloc (size=size@entry=4728) at src/jemalloc.c:1641
#10 0x000000000045dd21 in start_fde_sort (count=589, accu=0x7fffffffdf00)
at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2-fde.c:409
#11 init_object (ob=0x670140 <object>) at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2-fde.c:771
#12 search_object (ob=ob@entry=0x670140 <object>, pc=pc@entry=0x45cf31 <_Unwind_Backtrace+55>)
at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2-fde.c:961
#13 0x000000000045e4ff in _Unwind_Find_registered_FDE (bases=0x7fffffffe298, pc=0x45cf31 <_Unwind_Backtrace+55>)
at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2-fde.c:1025
#14 _Unwind_Find_FDE (pc=0x45cf31 <_Unwind_Backtrace+55>, bases=bases@entry=0x7fffffffe298)
at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2-fde-dip.c:454
#15 0x000000000045bb0c in uw_frame_state_for (context=context@entry=0x7fffffffe1f0, fs=fs@entry=0x7fffffffe040)
at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2.c:1241
#16 0x000000000045c769 in uw_init_context_1 (context=context@entry=0x7fffffffe1f0, outer_cfa=outer_cfa@entry=0x7fffffffe4a0,
outer_ra=0x44b195 <je_prof_boot2+37>) at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind-dw2.c:1562
#17 0x000000000045cf32 in _Unwind_Backtrace (trace=trace@entry=0x4434e0 <prof_unwind_init_callback>, trace_argument=trace_argument@entry=0x0)
at /home/buildozer/aports/main/gcc/src/gcc-6.2.0/libgcc/unwind.inc:283
#18 0x000000000044b195 in je_prof_boot2 (tsd=tsd@entry=0x681478 <builtin_tls+24>) at src/prof.c:2272
#19 0x000000000040330a in malloc_init_hard () at src/jemalloc.c:1501
#20 0x00000000004001f5 in malloc_init () at src/jemalloc.c:317
#21 jemalloc_constructor () at src/jemalloc.c:2801
#22 0x000000000045e92d in libc_start_init () at src/env/__libc_start_main.c:61
#23 0x000000000045e960 in __libc_start_main (main=0x4003b1 <main>, argc=1, argv=0x7fffffffe568) at src/env/__libc_start_main.c:71
#24 0x00000000004002aa in _start_c (p=<optimized out>) at crt/crt1.c:17
#25 0x0000000000400282 in _start () |
Interestingly switching to GCC intrinsics causes immediate segfault when profiling is enabled (we're still linking to glibc, not musl). |
Oh, interesting; I had assumed this was a musl thing. Incidentally, how badly does this bug hurt you? I've been delaying looking into it because some of our other efforts might fix it along the way, but I'll reprioritize if this is actually blocking you on something. |
We ran into this bug in an attempt to make CockroachDB support older Linux distributions (namely CentOS 6), and broadly, there are two ways to do this:
So we're somewhat stuck - we'd ideally like to provide both kinds of binaries, but there is no configuration of jemalloc that works with both. Fixing this issue would allow us to use libgcc with both musl and glibc, which would hopefully resolve the segfault issue. |
Can you see if --enable-prof-libunwind (instead of --enable-prof) works for you? I think the issue with gcc intrinsics is fundamental, and the one with libgcc will be a while to fix. |
Actually, I suppose that if you're shipping a custom musl, you can compile it without -fomit-frame-pointer; that might get the gcc intrinsics working. |
Adding Somewhat expectedly,
Any other ideas? |
Just to double-check: you were building musl with --fno-omit-frame-pointer, not jemalloc? @jasone, any ideas? |
Oh, no, I did not try rebuilding musl. Also to double check: you're suggesting that musl built with --fno-omit-frame-pointer would work with libgcc (and not deadlock)? Can you help me understand why you'd expect that to work? |
Is it possible to install libunwind? That's the best bet. Otherwise you could experiment with moving the block of code that calls |
Quick update: I'm now using a cross-compilation toolchain with gcc 6.3.0 (built with crosstool-ng) to target musl 1.1.16, and I get the deadlock whether I use libgcc or gcc intrinsics. Regarding libunwind: anything's possible, but documentation seems scant, and I'm also not sure what is meant by libunwind - are we talking about http://www.nongnu.org/libunwind/ or https://github.com/llvm-mirror/libunwind ? |
We're talking about the project at http://www.nongnu.org/libunwind/ . In my experience it has worked well other than on obscure old platforms and ARM-based systems with otherwise brittle toolchains. |
This allows musl builds to avoid profiling which causes deadlock. See jemalloc/jemalloc#585.
My guess is that musl compiled with frame pointers + gcc intrinsics (not libgcc) may work. My experience with the gcc intrinsics is that they'll happily crash the process if confused by the stack layout. Musl compiles with -fomit-frame-pointers by default, which is the sort of thing that can confuse the intrsinics. |
@davidtgoldblatt recall that our problem with musl is deadlock, not crash. We were seeing the crash with gcc intrinsics, debian jessie glibc 2.19, and gcc 4.9.2. We are no longer seeing this crash with gcc 4.9.3 targeting glibc 2.12. Going back to the original issue here: I've gone and disabled jemalloc profiling in our musl builds since we see these deadlocks with both libgcc and gcc intrinsics. For now, this is not worth the headache of installing libunwind. So, again, it would be great if you guys could figure out what's causing this deadlock. |
Sorry, I omitted some context -- when doing the repro, I tried it with the gcc intrinsics + musl and saw crashes that went away with the change to musl compilation. The deadlock itself is pretty straightforward - grabbing a backtrace with libgcc may call malloc (at https://github.com/gcc-mirror/gcc/blob/1cb6c2eb3b8361d850be8e8270c597270a1a7967/libgcc/unwind-dw2-fde.c#L437 in this case), and we can't in general handle reentrancy (with this being particularly true during bootstrapping). |
How come there's no deadlock with glibc? |
So, the proximate cause is that musl generates a call to __register_frame_info_bases, so that unseen_objects (in unwind-dw2-fde.c in libgcc) is non-null in musl at the of the backtrace, but null in gcc. When the unseen_objects list is nonempty (i.e. in bootstrapping with musl), _Unwind_Find_FDE will go through it, and initialize a sorted array of pointers to allow subsequent lookups to use binary search instead of linear search. This array is malloc'd on first search attempt, causing the reentry. The upshot is, the search code checks for malloc failure and falls back to linear search in that case, so we can fix this by detecting reentrancy early on in bootstrapping and returning null. @jasone, any problems leaping out at you? |
If we return |
No, it will keep trying to allocate each search if it failed the first time (link: https://github.com/gcc-mirror/gcc/blob/035409c33a6cf53ea48956f723c3e7ef2c68a04b/libgcc/unwind-dw2-fde.c#L983 ). I agree about reentry. Though, note that our current plans can't handle "reentrancy during bootstrapping". I think tracking this down has pushed me into the "we should have a lock-free base allocator" camp. |
Potentially same issue as this one? redis/redis#3799 |
Hmm, probably not; I've only ever seen this one manifest as blocking deadlocks or crashes. I'll jump in on the redis issue. |
Thank you @davidtgoldblatt |
it seems that i've encounter the same problem even i use static libunwind.a to compile
|
have you got some plan to fix this? |
It's something I'd like to get to in the abstract, but realistically the interactions with libc during bootstrapping are always going to be complex enough that the whack-a-mole game with uncommon libcs is hard to justify. I've got some ideas about how to re-do bootstrapping to fix this class of issues in a more principled way, but we're stretched pretty thin at the moment. |
@davidtgoldblatt Thanks for your attention. Hope That you will get fatter.(joking) Best Wishes For YOU! |
According to jemalloc/jemalloc#585, enabling memory profiling can cause deadlock on some platform and with some version of glibc. So this pr removes it by default for best safety. Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
atexit call can allocate, which may cause deadlock problem like jemalloc#585. Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
atexit call can allocate, which may cause deadlock problem like jemalloc#585. Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Occurs when building with profiling enabled in alpine linux with gcc 6.2. Full repro (I've left in the entire configure output in case that's useful):
cc @petermattis @bdarnell
The text was updated successfully, but these errors were encountered: