Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Crash when compiling Terminix on armhf (and i386) #2022
Comments
|
Probably |
|
With upgrading Terminix to 1.5.2, the i386 issue has fixed itself (rather worked around, I guess), while the FTBFS on armhf persisted. We now have an RC bug against LDC about this, unfortunately, which makes this issue quite high-priority to keep LDC in the next Debian release: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085 |
|
@ximion Did you try to cross-compile to armhf to reproduce (for dustmiting)? Not sure about the triple, perhaps |
|
Looks like it's failing again after just slightly modifying the mentioned patch: https://buildd.debian.org/status/fetch.php?pkg=terminix&arch=armhf&ver=1.5.2-3&stamp=1488930891&raw=0 |
|
A way to quickly reproduce this on x86, e.g., via archiving the required files and providing a command line incl. target triple to make it crash, would be extremely helpful so that we can immediately move on to debugging. Terminix doesn't compile on Win64, I already tried that (the parts/dependencies that didn't compile didn't crash, just missing POSIX imports etc.). |
|
Unfortunately Terminix is a large codebase and I can only guess what is relevant for the issue here by the patches I sent and the behavior I observed. |
|
Oh, and just in case: Sorry for the offhand bug description... I guess I was a bit frustrated by hitting compiler bugs so often when writing it (and at that time I hoped to get Terminix updated in Stretch, which I think won't happen now, so this crash came at a really bad time). You're doing an amazing job on LDC, and I'll update the description when I have some better information on what's actually going on (it's only current content is pretty much "there's a bug when compiling X" :P) |
Just on a side note: How did we end up with a situation like this in the first place? I thought it was clear that non-x86 support is on a bit of a tentative basis for now. For example, we never really had a CI setup for armhf or PPC on a permanent basis (Kai started to set something up, but it never quite entered normal the development cycle). If I were to choose, I'd rather not have LDC packaged on other platforms at all than x86 support suffering from it. (Of course, we'll want to rectify the CI/testing situation as soon as possible, but until then…) |
|
@klickverbot Debian packages build on all architectures and we are encouraged to support as many platforms as possible. LDC won't get dropped from the Stretch release, before that happens I would rather negotiate something with the release team to drop the faulty architecture. Not sure what's it gonna be yet. Apparently armhf stuff built with LDC works though, so does ppc64el - some support is better than none. On a related note: I think LDC should really get CI set up for multiple architectures. Since D has a foundation now, I think it would be hugely beneficial to get some quota from an arm/x86/amd64 cloud provider and a Jenkins instance to run tests easily (would also help GDC and potentially DMD). I can give you access to Debian porterboxes too, but that is always temporary and not really a good permanent solution. |
kalev
commented
Mar 8, 2017
|
For what it's worth, Fedora is pretty much in the same position and strongly encouraged to support as many architectures as possible. We're currently building ldc for armv7hl, i686, ppc64, ppc64le, x86_64. |
|
Building on other architectures is very welcome – we (myself included) did spend considerable development effort on non-x86 archs, and if it makes it easier for users to evaluate where we stand, then all the better. It's just that we can't offer the same level of support yet as for the production-quality x86 compiler (well, as production-quality as any D compiler is). |
|
Crosscompiling with |
|
Okay, no crash in armhf chroot either, so emulation doesn't work. Could this maybe be another case of NEON being (not) present? |
ximion
changed the title from
Crash when compiling Terminix on i386 and armhf
to
Crash when compiling Terminix on armhf (and i386)
Mar 11, 2017
|
Looks like all porterboxes for armhf are unreachable at time too... |
|
The porterboxes are accessible again, since we have Dustmite in the archive I will try to create a minimal testcase there. EDIT: Crap, this bug seems to be of the unstable kind, sometimes it happens and sometimes it doesn't, and it seems to especially not-happen when running under GDB. I wonder if Dustmite will yield anything useful under these conditions (it will likely run for many hours, at least :-/ ) |
|
This will run a few days longer - I helped it out a bit by removing stuff, but this bug is very evasive. It doesn't appear under GDB, and the slightest change on the sources can make it disappear. And sometimes it is just flaky for no reason. So something is really weird here. |
|
Darn... Can you get a core dump and maybe gain some idea about the details from the back trace? |
|
Hah, I completely forgot about coredumps ^^ |
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0xb6e1c710 in TemplateInstance::needsCodegen() ()
(gdb) bt full
#0 0xb6e1c710 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#1 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#2 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#3 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#4 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#5 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#6 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#7 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#8 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
=> To infinity!Looks like this recursion never stops - I'll try to maybe generate a better backtrace. |
|
Depending on how flaky the issue is, you should be able to do a release+debug or debug build as well. Then, you could also dump the source location information in the debugger to get further hints as to what causes the issue. Right now, I can only guess that it is a memory corruption issue in the compiler leading to invalid AST... Is the issue reproducible in Valgrind? |
|
More information: #10786 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#10787 0xb6e1cc20 in TemplateInstance::needsCodegen() ()
No symbol table info available.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/gen/declarations.cpp.dwo(0xf48a14f6c605bd93) referenced by CU at offset 0xa68 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10788 0xb6ef03d0 in CodegenVisitor::visit(TemplateInstance*) () at ./gen/declarations.cpp:448
No locals.
#10789 0xb6ef0956 in Declaration_codegen(Dsymbol*) () at ./gen/declarations.cpp:576
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/gen/modules.cpp.dwo(0xb03109d7010ced7c) referenced by CU at offset 0x1a4 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10790 0xb6e9c770 in codegenModule(IRState*, Module*) () at ./gen/modules.cpp:635
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/driver/codegenerator.cpp.dwo(0x178f2d9de69b88a2) referenced by CU at offset 0xc48 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10791 0xb6efb516 in ldc::CodeGenerator::emit(Module*) () at ./driver/codegenerator.cpp:234
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/driver/main.cpp.dwo(0xabaacaa5774a7d6e) referenced by CU at offset 0x864 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10792 0xb6ee1fc4 in codegenModules(Array<Module*>&) () at ./driver/main.cpp:1047
No locals.
#10793 0xb6d7caa4 in mars_mainBody(Array<char const*>&, Array<char const*>&) ()
No symbol table info available.
#10794 0xb6ee33ae in cppmain(int, char**) () at ./driver/main.cpp:1021
No locals.
#10795 0xb6cc2f7c in D main ()
No symbol table info available.Any debugging on this machine takes ages... |
|
Valgrind isn't super useful... EDIT: [removed clutter] |
|
Seems like you ran valgrind for |
|
Yeah, I noticed this right after writing the entry on Github (the suspiciously fast time Valgrind was running got be to examine the thing more). |
|
The uninitialized reads from the GC are benign, but these look potentially interesting (albeit probably not related?):
|
|
Dustmite is removing around 40 source-code lines per day, so we will only need to wait 400 days for this process to minimize everything down to zero... (around 16960 lines still exist) |
|
It should crash on 32-bit x86 with that 'special' terminix src too [at least sometimes], right? Just asking because that would at least be debuggable by us directly. |
|
@kinke I don't know... I was crashing before with a different source, then something was changed and the crash disappeared. It does definitely not crash when cross-compiling. |
|
Our CI systems use x86_64 compilers only except for the Win32 AppVeyor job, that's the only native 32-bit one. On Windows, we don't support shared runtime libs etc. Is the crashing x86 LDC linked against static or shared druntime/Phobos? And what was its D host compiler? Edit: From the logs apparently LDC 1.1 as host compiler too + shared druntime/Phobos. Same for your LDC used on ARM? Note that afaik, the LDCs we test in CI are all linked against static D runtime libs. |
|
Hmm, spurious crashes at program shutdown on OSX and potentially Linux with both shared and static runtime libs have been fixed with LDC 1.2 only (dlang/druntime#1655 and ldc-developers/druntime@6ce3c20). @klickverbot: Right? But we've only seen it fail for OSX so far. That might explain spurious i386 crashes. |
|
@kinke For things in Linux distributions you can almost always assume that shared libraries are used ;-) I might try to compile the thing with LDC 1.2 to see if that fixes the bug... Would be strange though, since the problem seems to be LDC not getting out of a |
|
Yep the ARM issue appears to be something else. |
|
Heureka! After about 5-6 days of runtime, we have a minimized testcase! The build now also fails:
(using ldmd here, but ldc alone crashes as well) Testcase is here: ldc-terminix-sigsegv-armhf.tar.gz So, we are not closer to fixing this bug, but at least LDC gets better in the process... |
Yep, unfortunately this looks very much like a crash on invalid code. |
|
Yep, and DMD 2.073.0 crashes too, so I guess it's a front-end bug (I'm at work)... |
|
I'm afraid I can't give any more information on this issue. Since the crash doesn't happen with GDB, narrowing it down with Dustmite won't work unless the other crash and potentially more are fixed. If there's anything more you need or any idea you have, let me know. |
|
I filed an upstream issue regarding your unrelated latest crash on invalid code. |
|
@kinke Thanks! As per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085#26 it looks like this bug might actually be a regression between LDC 1.1 Beta3/6 and the final LDC 1.1 release... I wonder whether this has something to do with Debian's machines not allowing NEON. In any case, it it would be great to have this fixed, but I have no idea how to track it down properly. The best that came out was the GDB backtrace so far... |
|
It looks like the i386 crash is back with Tilix (the latest version of Terminix after it's name change): https://buildd.debian.org/status/package.php?p=tilix This might be a generic 32bit problem... |
|
Alright, if time allows, I'll debug it these days with a 32-bit LDC on Ubuntu; the crashing command-line isn't horrendously long... ;) |
|
Last time I could remove at least the X11 stuff without killing the crash as well. But it's pretty great that the issue happens on ia32 again, that makes debugging much easier and potential Dustmiting much faster :-) |
|
I performed some tests on a 32-bit Xubuntu 16.04 Live-DVD in a VM. I extracted the vanilla Tilix 1.5.4 source as well as the GtkD 3.5.1 source and then tested this command-line (I had to add GtkD's
I performed 10 runs with our 1.1.1 release package, 10 runs with our 1.2.0-beta2 package and finally 10 runs with master (compiled by 1.2.0-beta2 and using Ubuntu's LLVM 3.8). And you guessed it, no issues. All of these were linked against static druntime/Phobos fwiw. Edit: I let master rebuild itself and linked it against the shared runtime libs; no problems after |
|
(Trimmed) Valgrind log for the master build linked against shared runtime libs: http://paste.debian.net/926447/ |
|
This one looks potentially non-benign:
|
|
Any news on this? (sorry for nagging, but this issue is critical for the next Debian release, and I need a plan on how to deal with it - ideally that would be fixing the issue, but we could also drop Terminix from the release on armhf). |
|
So far, nobody was able to reproduce the x86 issue yet. I just got an ARM VPS to try and reproduce the issue there (don't have my dev boards handy). No guarantees I'll get anything done over the holidays, though. |
|
Can't reproduce on x86 or
either. With gtk-d and tilix from the Debian sid source repos:
|
|
Side note: Please avoid links to logs/pastes that expire quickly. |
|
@ximion: I'm building a compiler from the 1.1 release to verify, will have the results tomorrow morning. Also retrying with the exact same command line you used (I just used ./configure before). At this point, it looks like we have to consider a miscompilation of the LDC binary you are using. How are you building/bootstrapping the compiler? |
|
also works. |
|
(For reference, this is on MV78460 Marvell Armada XP/370 SoC with 2 GiB RAM running Arch Linux, armv7l-linux-gnueabihf (NEON disabled), GCC 6.3.1, GNU ld 2.28.0.20170322.) |
|
Ping @markos for bootstrapping questions. |
|
Manually bootstrapping from 0.17.3 is what I did above, yes. |
|
Same command line as above works with
as well. |
|
@klickverbot Just to be safe: You're not cross-compiling anything and are on a real machine? |
|
Yes, all happened on the above |
|
I suppose I should try a self-hosted 1.1.1 build. Give me |
built with the above 1.1.1 (i.e. built in turn by 0.17.3) also works. I guess the notification spam on this issue is over for now, with the conclusion that neither @kinke nor me can reproduce the issue on i686 and armhf. If it helps for tracking down any specifics of your setup, I can give you SSH access to the box I've done this on. |
|
@klickverbot You did all your experiments on Arch, right?
on i386 - for the LTC compiler build! So, now I wonder whether this might have something to do with LLVM. The full build log is here: https://buildd.debian.org/status/fetch.php?pkg=ldc&arch=i386&ver=1%3A1.1.1-2&stamp=1492986438&raw=0 |
|
Building ltsmaster works fine on i686 with Ubuntu 16.04 using its LLVM 3.8 libs (static ones apparently by default). |
Yes.
I wouldn't think so – the infinite recursion seem to happen in the frontend (in this case compiled by GCC), so it would have to be something like the AST being corrupted by a memory issue within LLVM, ABI issues messing up the stack due to header/executable issues, etc. That one Valgrind issue from @kinke's post I pointed out above occurs after IR generation is done (where But just in case, I was using the Arch Linux/ARM packages for LLVM earlier, while a straightforward source build is used for the binary releases. I suppose there is a slim chance that the invalid writes in that other Valgrind log end up corrupting the AST in a way for the infinite recursion to happen. Do they still occur with the ltsmaster compiler? Do you know whether the issue occurs on all i686 Debian Sid boxes (cf. us not being able to reproduce it on Ubuntu)? If you swap out the compiler for one from a binary release, does that crash as well? (Unfortunately, I'm a bit short on time right now.) |
I could switch back to LLVM 3.8 instead of 3.9.1 to see if that changes anything... @klickverbot I can probably test these things tomorrow :-) I still find it an interesting detail that this only seems to affect 32bit architectures, as amd64 and ppc64el are fine. |
|
Compiling in an i386 chroot doesn't trigger the same behavior. |
petterreinholdtsen
commented
May 22, 2017
|
LCD and all its dependencies was removed from Debian testing (aka the next release) today because of this issue, see https://tracker.debian.org/pkg/ldc for the latest status. |
|
Yes, and I am not happy with how the release team handled this matter. In any case, I apologize for this and hope to be able to introduce LDC to Stretch backports instead. This issue still needs to be resolved for that to happen, though. FTR: Link to the downstream bug report again https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085 |
|
Welp, that's not great. I am not sure what we could have done differently as the upstream devs (apart from directly engaging with release people, which probably they wouldn't like), given that we still can't reproduce the issues on a number of systems/configurations. |
|
(@ximion Many thanks for your work though, it is very much appreciated!) |
There's not really much you could have done - I was discussing this with people on IRC yesterday, and decided to just drop the armhf port, since it didn't look like we could resolve the issue prior to the release. I submitted an upload doing that, and apparently one of the release-team member who listened and commented on our discussion force-removed the package today without giving any prior warning or reason. Regarding the bug, I summarized what we know at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085#41 - hopefully we can resolve this somehow. I might just try to build LDC with LLVM in Debian experimental now, to see if that makes a difference (with no chance of getting LDC into the release anymore, I can at least go crazy in unstable to find this bug). Unfortunately I have never been able to reproduce this issue locally, it seems to happen exclusively on Debian buildds. |
I thought Debian was moving towards reproducible builds? (That is, isn't there some way to get a chroot with the same environment locally?) |
|
Yes you can, but that still makes the build not-fail here. Which makes me believe that it might have something to do with the exact architecture used. |
petterreinholdtsen
commented
May 22, 2017
|
[Matthias Klumpp]
Yes you can, but that still makes the build not-fail here. Which makes
me believe that it might have something to do with the exact
architecture used.
Two thing you can do in general is to make sure valgrind do not report
any problems when running the program, and ask for access to Coverity
via <URL: https://scan.coverity.com/projects > and fix as many bugs
discovered by Coverity as possible. It would reduce the set of possible
problems that can cause the segfault. :)
…
|
|
@petterreinholdtsen In general good advise, but this particular issue is an infinite recursion which happens for no obvious reason (I looked through the relevant pieces of code). |
|
Building with LLVM 4.0 made the FTBFS of LDC itself on the i386 architecture disappear. Interesting. I am very curious if this maybe fixes the Tilix build as well. |
|
LDC with LLVM 4.0 builds Tilix 1.5.6 flawlessly on all architectures. The LLVM version is also the biggest difference between the Fedora and Debian toolchain. Unfortunately, LLVM 4.0 won't be in the next Debian release, which is why I have not explored that option until now, because it wouldn't have helped the cause of getting LDC to work on Debian 9. The failing Tilix version was 1.5.4 though, before. So, to be 100% certain that LLVM was indeed the culprit here, I will need to build that exact version of Tilix with the updated LLVM 4 LDC on armhf and see if that resolves the issue. I think the answer will be yes. In any case, it now seems to be highly likely that the thing we were after for months is actually a bug somewhere in LLVM, that got resolved with the 4.0 release (or worked around / not triggered). |
|
Okay, I tried to reproduce the issue with the new LLVM4 LDC and the exact same sources on a Debian armhf porterbox that I used before and which was the only place where I could reliably reproduce the bug: The issue did not appear anymore! On a hunch, I also tried the old LDC (same version, but with LLVM 3.8) binary, and the bug also didn't show itself. I also had a manually compiled version laying around which I had used before to reproduce the issue, that was compiled with LLVM 3.9: The bug also was gone with that one. The LLVM changelog reads:
None of these changes look like they could have caused this bug. But I think it's relatively safe to assume now that this issue might not actually be an issue in LDC at all. |
|
I think we can close this. If it ever happens again, I will file a new bug report. Thanks for all your help! |
ximion commentedMar 3, 2017
Another one bites the dust...
See https://buildd.debian.org/status/fetch.php?pkg=terminix&arch=i386&ver=1.4.2-4&stamp=1488569684&raw=0
and https://buildd.debian.org/status/fetch.php?pkg=terminix&arch=armhf&ver=1.4.2-4&stamp=1488575872&raw=0 respectively.
The build started to fail after these three patches had been applied:
https://anonscm.debian.org/git/pkg-gnome/terminix.git/tree/debian/patches/03_check-timeout-null.patch
https://anonscm.debian.org/git/pkg-gnome/terminix.git/tree/debian/patches/04_no-timeout-null.patch
https://anonscm.debian.org/git/pkg-gnome/terminix.git/tree/debian/patches/05_resolve-bug856153.patch