Crash when compiling Terminix on armhf (and i386) #2022

Closed
ximion opened this Issue Mar 3, 2017 · 76 comments

Comments

Projects
None yet
6 participants
Member

kinke commented Mar 4, 2017

Probably a duplicate of highly related to #1996.

Contributor

ximion commented Mar 7, 2017

With upgrading Terminix to 1.5.2, the i386 issue has fixed itself (rather worked around, I guess), while the FTBFS on armhf persisted.
Oddly, after fixing an unrelated build failure on ppc64el via this patch: https://github.com/ximion/terminix/commit/da12d1322ac0d94c62527b93a804d48c1da0e78d builds started working without LDC segfault on armhf too.
Because of this I assume this crash is triggered by LDC doing something bad with different integer types on the respective architectures.

We now have an RC bug against LDC about this, unfortunately, which makes this issue quite high-priority to keep LDC in the next Debian release: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085
Maybe it can be downgraded if a workaround is found, but having a fix would make me (and the release team) happier for sure.
Unfortunately I can't dustmite this on armhf since I don't have this architecture here (maybe a Raspberry Pi would do to reproduce...)

Member

JohanEngelen commented Mar 7, 2017

@ximion Did you try to cross-compile to armhf to reproduce (for dustmiting)? Not sure about the triple, perhaps -mtriple=arm-linux-gnueabihf.

Contributor

ximion commented Mar 7, 2017

Looks like it's failing again after just slightly modifying the mentioned patch: https://buildd.debian.org/status/fetch.php?pkg=terminix&arch=armhf&ver=1.5.2-3&stamp=1488930891&raw=0
Meh... But ppc64el works now. This is the stupidest game of whack-a-mole I've ever played.

Member

kinke commented Mar 8, 2017

A way to quickly reproduce this on x86, e.g., via archiving the required files and providing a command line incl. target triple to make it crash, would be extremely helpful so that we can immediately move on to debugging. Terminix doesn't compile on Win64, I already tried that (the parts/dependencies that didn't compile didn't crash, just missing POSIX imports etc.).

Contributor

ximion commented Mar 8, 2017

Unfortunately Terminix is a large codebase and I can only guess what is relevant for the issue here by the patches I sent and the behavior I observed.
The best shot is likely to fire Dustmite at it and lat it run for a while, I'll try the suggestion of @JohanEngelen tomorrow (or Thursday) to see if I can get anything useful out.

Contributor

ximion commented Mar 8, 2017

Oh, and just in case: Sorry for the offhand bug description... I guess I was a bit frustrated by hitting compiler bugs so often when writing it (and at that time I hoped to get Terminix updated in Stretch, which I think won't happen now, so this crash came at a really bad time). You're doing an amazing job on LDC, and I'll update the description when I have some better information on what's actually going on (it's only current content is pretty much "there's a bug when compiling X" :P)

Owner

klickverbot commented Mar 8, 2017

We now have an RC bug against LDC about this, unfortunately, which makes this issue quite high-priority to keep LDC in the next Debian release

Just on a side note: How did we end up with a situation like this in the first place? I thought it was clear that non-x86 support is on a bit of a tentative basis for now. For example, we never really had a CI setup for armhf or PPC on a permanent basis (Kai started to set something up, but it never quite entered normal the development cycle).

If I were to choose, I'd rather not have LDC packaged on other platforms at all than x86 support suffering from it. (Of course, we'll want to rectify the CI/testing situation as soon as possible, but until then…)

Contributor

ximion commented Mar 8, 2017

@klickverbot Debian packages build on all architectures and we are encouraged to support as many platforms as possible. LDC won't get dropped from the Stretch release, before that happens I would rather negotiate something with the release team to drop the faulty architecture. Not sure what's it gonna be yet. Apparently armhf stuff built with LDC works though, so does ppc64el - some support is better than none.
Unfortunately I uploaded a fix for a Terminix crash which triggered this bug :P (and it comes at a really bad time, since I requested a freeze exception to get the final 1.1 release into the Stretch release a few weeks ago)

On a related note: I think LDC should really get CI set up for multiple architectures. Since D has a foundation now, I think it would be hugely beneficial to get some quota from an arm/x86/amd64 cloud provider and a Jenkins instance to run tests easily (would also help GDC and potentially DMD). I can give you access to Debian porterboxes too, but that is always temporary and not really a good permanent solution.

kalev commented Mar 8, 2017

For what it's worth, Fedora is pretty much in the same position and strongly encouraged to support as many architectures as possible. We're currently building ldc for armv7hl, i686, ppc64, ppc64le, x86_64.

Owner

klickverbot commented Mar 8, 2017

Building on other architectures is very welcome – we (myself included) did spend considerable development effort on non-x86 archs, and if it makes it easier for users to evaluate where we stand, then all the better. It's just that we can't offer the same level of support yet as for the production-quality x86 compiler (well, as production-quality as any D compiler is).

Contributor

ximion commented Mar 10, 2017

Crosscompiling with -mtriple=arm-linux-gnueabihfdidn't trigger this crash unfortunately.
I build myself an armhf chroot now, maybe that helps...

Contributor

ximion commented Mar 10, 2017

Okay, no crash in armhf chroot either, so emulation doesn't work. Could this maybe be another case of NEON being (not) present?

@ximion ximion changed the title from Crash when compiling Terminix on i386 and armhf to Crash when compiling Terminix on armhf (and i386) Mar 11, 2017

Contributor

ximion commented Mar 11, 2017

Looks like all porterboxes for armhf are unreachable at time too...
Maybe I can roll back Terminix to when LDC crashed on i386 as well and run Dustmite on that bug later.

Contributor

ximion commented Mar 18, 2017

The porterboxes are accessible again, since we have Dustmite in the archive I will try to create a minimal testcase there.

EDIT: Crap, this bug seems to be of the unstable kind, sometimes it happens and sometimes it doesn't, and it seems to especially not-happen when running under GDB. I wonder if Dustmite will yield anything useful under these conditions (it will likely run for many hours, at least :-/ )

Contributor

ximion commented Mar 21, 2017

This will run a few days longer - I helped it out a bit by removing stuff, but this bug is very evasive. It doesn't appear under GDB, and the slightest change on the sources can make it disappear.
Also, interestingly, it doesn't appear when not writing an output file (-o- instead of -of).

And sometimes it is just flaky for no reason. So something is really weird here.

Owner

klickverbot commented Mar 21, 2017

Darn... Can you get a core dump and maybe gain some idea about the details from the back trace?

Contributor

ximion commented Mar 22, 2017

Hah, I completely forgot about coredumps ^^
The porterbox I use is very restricted, but I see no reason why it wouldn't allow me to generate a coredump if I set the right limits - I'll get back with one tomorrow. Also, dustmite got slightly faster now (but it's still crazy slow, a single-core armhf machine isn't up to this task - it's running for three days straight now, with one small interruption. At least dustmite is narrowing the issue down a little).

Contributor

ximion commented Mar 22, 2017

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb6e1c710 in TemplateInstance::needsCodegen() ()
(gdb) bt full
#0  0xb6e1c710 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#1  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#2  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#3  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#4  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#5  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#6  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#7  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#8  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
=> To infinity!

Looks like this recursion never stops - I'll try to maybe generate a better backtrace.

Owner

klickverbot commented Mar 22, 2017

Depending on how flaky the issue is, you should be able to do a release+debug or debug build as well. Then, you could also dump the source location information in the debugger to get further hints as to what causes the issue. Right now, I can only guess that it is a memory corruption issue in the compiler leading to invalid AST... Is the issue reproducible in Valgrind?

Contributor

ximion commented Mar 22, 2017

More information:

#10786 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#10787 0xb6e1cc20 in TemplateInstance::needsCodegen() ()
No symbol table info available.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/gen/declarations.cpp.dwo(0xf48a14f6c605bd93) referenced by CU at offset 0xa68 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10788 0xb6ef03d0 in CodegenVisitor::visit(TemplateInstance*) () at ./gen/declarations.cpp:448
No locals.
#10789 0xb6ef0956 in Declaration_codegen(Dsymbol*) () at ./gen/declarations.cpp:576
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/gen/modules.cpp.dwo(0xb03109d7010ced7c) referenced by CU at offset 0x1a4 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10790 0xb6e9c770 in codegenModule(IRState*, Module*) () at ./gen/modules.cpp:635
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/driver/codegenerator.cpp.dwo(0x178f2d9de69b88a2) referenced by CU at offset 0xc48 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10791 0xb6efb516 in ldc::CodeGenerator::emit(Module*) () at ./driver/codegenerator.cpp:234
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/driver/main.cpp.dwo(0xabaacaa5774a7d6e) referenced by CU at offset 0x864 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10792 0xb6ee1fc4 in codegenModules(Array<Module*>&) () at ./driver/main.cpp:1047
No locals.
#10793 0xb6d7caa4 in mars_mainBody(Array<char const*>&, Array<char const*>&) ()
No symbol table info available.
#10794 0xb6ee33ae in cppmain(int, char**) () at ./driver/main.cpp:1021
No locals.
#10795 0xb6cc2f7c in D main ()
No symbol table info available.

Any debugging on this machine takes ages...

Contributor

ximion commented Mar 22, 2017

Valgrind isn't super useful...

EDIT: [removed clutter]

Member

kinke commented Mar 22, 2017

Seems like you ran valgrind for ldmd2. You'll want ldc2, as LDMD only translates the command line args and then starts an ldc2 process. [Adding the LDMD switch -vdmd outputs the ldc2 command line.]

Contributor

ximion commented Mar 22, 2017

Yeah, I noticed this right after writing the entry on Github (the suspiciously fast time Valgrind was running got be to examine the thing more).
So, now I ran it properly, and it looks like the segfault doesn't occur: http://paste.debian.net/923446/
This bug sucks.

Owner

klickverbot commented Mar 22, 2017

The uninitialized reads from the GC are benign, but these look potentially interesting (albeit probably not related?):

==13396== 273 errors in context 3 of 15:
==13396== Invalid write of size 4
==13396==    at 0x1E0190: Import::semantic(Scope*) (in /usr/bin/ldc2)
==13396==  Address 0x864279c is 4 bytes inside a block of size 7 alloc'd
==13396==    at 0x4840E94: realloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==13396== 
==13396== 
==13396== 395 errors in context 4 of 15:
==13396== Invalid write of size 4
==13396==    at 0x1E0080: Import::semantic(Scope*) (in /usr/bin/ldc2)
==13396==  Address 0x8642764 is 4 bytes inside a block of size 7 alloc'd
==13396==    at 0x4840E94: realloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==13396== 
==13396== 
==13396== 2076 errors in context 5 of 15:
==13396== Invalid write of size 4
==13396==    at 0x1E0240: Import::semantic(Scope*) (in /usr/bin/ldc2)
==13396==  Address 0x9786cc4 is 4 bytes inside a block of size 7 alloc'd
==13396==    at 0x483E4B0: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
Contributor

ximion commented Mar 24, 2017

Dustmite is removing around 40 source-code lines per day, so we will only need to wait 400 days for this process to minimize everything down to zero... (around 16960 lines still exist)

Member

kinke commented Mar 24, 2017

It should crash on 32-bit x86 with that 'special' terminix src too [at least sometimes], right? Just asking because that would at least be debuggable by us directly.

Contributor

ximion commented Mar 24, 2017

@kinke I don't know... I was crashing before with a different source, then something was changed and the crash disappeared. It does definitely not crash when cross-compiling.
The version that broke on i386 can still be fetched from http://snapshot.debian.org/package/terminix/1.4.2-4/, but last time I checked it wasn't always possible to reproduce the error. It appears like version 1.4.2-1 also crashed when compiling on x86.
If this is really the same bug, then debugging on x86 is of course much nicer.

Member

kinke commented Mar 24, 2017

Our CI systems use x86_64 compilers only except for the Win32 AppVeyor job, that's the only native 32-bit one. On Windows, we don't support shared runtime libs etc. Is the crashing x86 LDC linked against static or shared druntime/Phobos? And what was its D host compiler? Edit: From the logs apparently LDC 1.1 as host compiler too + shared druntime/Phobos. Same for your LDC used on ARM? Note that afaik, the LDCs we test in CI are all linked against static D runtime libs.

Member

kinke commented Mar 24, 2017

Hmm, spurious crashes at program shutdown on OSX and potentially Linux with both shared and static runtime libs have been fixed with LDC 1.2 only (dlang/druntime#1655 and ldc-developers/druntime@6ce3c20). @klickverbot: Right? But we've only seen it fail for OSX so far. That might explain spurious i386 crashes.

Contributor

ximion commented Mar 24, 2017

@kinke For things in Linux distributions you can almost always assume that shared libraries are used ;-)
LDC and the stuff it compiles links against the shared runtime/stdlib libraries on all architectures.

I might try to compile the thing with LDC 1.2 to see if that fixes the bug... Would be strange though, since the problem seems to be LDC not getting out of a TemplateInstance::needsCodegen() recursion for unknown reasons.

Member

kinke commented Mar 24, 2017

Yep the ARM issue appears to be something else.

Contributor

ximion commented Mar 28, 2017

Heureka! After about 5-6 days of runtime, we have a minimized testcase!
I only had Dustmite search for LDC returning a segmentation fault, so unfortunately this is highly likely a different crash, as the output is different and it also happens on any other architecture.

The build now also fails:

./application.d(8): Error: type SETTINGS_THEME_VARIANT_KEY has no value
Error: Error executing /usr/bin/ldc2: Segmentation fault

(using ldmd here, but ldc alone crashes as well)

Testcase is here: ldc-terminix-sigsegv-armhf.tar.gz

So, we are not closer to fixing this bug, but at least LDC gets better in the process...
Unfortunately this means that Dustmite won't be very useful to us anymore - runing Dustmite and GDB in parallel will likely be impractical due to the infinite loop and massive GDB output, that will slow down Dustmite even more.

Owner

klickverbot commented Mar 28, 2017

so unfortunately this is highly likely a different crash

Yep, unfortunately this looks very much like a crash on invalid code.

Member

kinke commented Mar 28, 2017

Yep, and DMD 2.073.0 crashes too, so I guess it's a front-end bug (I'm at work)...

Contributor

ximion commented Mar 28, 2017

I'm afraid I can't give any more information on this issue. Since the crash doesn't happen with GDB, narrowing it down with Dustmite won't work unless the other crash and potentially more are fixed.
GDB in itself isn't super useful either, I am not sure what to make out of Valgrind.

If there's anything more you need or any idea you have, let me know.

Member

kinke commented Mar 29, 2017

I filed an upstream issue regarding your unrelated latest crash on invalid code.

Contributor

ximion commented Mar 31, 2017

@kinke Thanks!

As per https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085#26 it looks like this bug might actually be a regression between LDC 1.1 Beta3/6 and the final LDC 1.1 release...
Or at least the bug became more easily reproducible.

I wonder whether this has something to do with Debian's machines not allowing NEON.

In any case, it it would be great to have this fixed, but I have no idea how to track it down properly. The best that came out was the GDB backtrace so far...

Contributor

ximion commented Apr 4, 2017

It looks like the i386 crash is back with Tilix (the latest version of Terminix after it's name change): https://buildd.debian.org/status/package.php?p=tilix

This might be a generic 32bit problem...

Member

kinke commented Apr 4, 2017

Alright, if time allows, I'll debug it these days with a 32-bit LDC on Ubuntu; the crashing command-line isn't horrendously long... ;)

Contributor

ximion commented Apr 4, 2017

Last time I could remove at least the X11 stuff without killing the crash as well. But it's pretty great that the issue happens on ia32 again, that makes debugging much easier and potential Dustmiting much faster :-)

@ximion ximion referenced this issue in gnunn1/tilix Apr 4, 2017

Closed

Need package maintainers #25

Member

kinke commented Apr 7, 2017

I performed some tests on a 32-bit Xubuntu 16.04 Live-DVD in a VM. I extracted the vanilla Tilix 1.5.4 source as well as the GtkD 3.5.1 source and then tested this command-line (I had to add GtkD's srcvte subdir as additional include directory, otherwise equivalent to the crashing Debian command-line):

<...>/bin/ldmd2 -O -inline -release -g -version=StdLoggerDisableTrace -I/home/xubuntu/GtkD-3.5.1/src/ -I/home/xubuntu/GtkD-3.5.1/srcvte/ -c source/app.d source/gx/gtk/actions.d source/gx/gtk/cairo.d source/gx/gtk/clipboard.d source/gx/gtk/dialog.d source/gx/gtk/resource.d source/gx/gtk/settings.d source/gx/gtk/threads.d source/gx/gtk/util.d source/gx/gtk/vte.d source/gx/gtk/x11.d source/gx/i18n/l10n.d source/gx/tilix/application.d source/gx/tilix/appwindow.d source/gx/tilix/bookmark/bmchooser.d source/gx/tilix/bookmark/bmeditor.d source/gx/tilix/bookmark/bmtreeview.d source/gx/tilix/bookmark/manager.d source/gx/tilix/closedialog.d source/gx/tilix/cmdparams.d source/gx/tilix/colorschemes.d source/gx/tilix/common.d source/gx/tilix/constants.d source/gx/tilix/customtitle.d source/gx/tilix/encoding.d source/gx/tilix/prefeditor/bookmarkeditor.d source/gx/tilix/prefeditor/prefdialog.d source/gx/tilix/prefeditor/profileeditor.d source/gx/tilix/prefeditor/titleeditor.d source/gx/tilix/preferences.d source/gx/tilix/session.d source/gx/tilix/sessionswitcher.d source/gx/tilix/shortcuts.d source/gx/tilix/sidebar.d source/gx/tilix/terminal/actions.d source/gx/tilix/terminal/advpaste.d source/gx/tilix/terminal/exvte.d source/gx/tilix/terminal/layout.d source/gx/tilix/terminal/password.d source/gx/tilix/terminal/regex.d source/gx/tilix/terminal/search.d source/gx/tilix/terminal/terminal.d source/gx/tilix/terminal/util.d source/gx/util/array.d source/gx/util/string.d source/secret/Collection.d source/secret/Item.d source/secret/Prompt.d source/secret/Schema.d source/secret/SchemaAttribute.d source/secret/Secret.d source/secret/Service.d source/secret/Value.d source/secretc/secret.d source/secretc/secrettypes.d source/x11/X.d source/x11/Xlib.d -oftilix.o

I performed 10 runs with our 1.1.1 release package, 10 runs with our 1.2.0-beta2 package and finally 10 runs with master (compiled by 1.2.0-beta2 and using Ubuntu's LLVM 3.8). And you guessed it, no issues. All of these were linked against static druntime/Phobos fwiw.

Edit: I let master rebuild itself and linked it against the shared runtime libs; no problems after 10 100 runs either.

Member

kinke commented Apr 7, 2017

(Trimmed) Valgrind log for the master build linked against shared runtime libs: http://paste.debian.net/926447/
Some more uninitialized values, but no invalid writes.

Owner

klickverbot commented Apr 7, 2017

This one looks potentially non-benign:

==1233== 34 errors in context 1 of 43310:
==1233== Conditional jump or move depends on uninitialised value(s)
==1233==    at 0x951A28E: llvm::ScalarEvolution::computeShiftCompareExitLimit(llvm::Value*, llvm::Value*, llvm::Loop const*, llvm::CmpInst::Predicate) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x9527CA9: llvm::ScalarEvolution::computeExitLimitFromICmp(llvm::Loop const*, llvm::ICmpInst*, llvm::BasicBlock*, llvm::BasicBlock*, bool) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x95281B8: llvm::ScalarEvolution::computeExitLimitFromCond(llvm::Loop const*, llvm::Value*, llvm::BasicBlock*, llvm::BasicBlock*, bool) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x952885D: llvm::ScalarEvolution::computeExitLimit(llvm::Loop const*, llvm::BasicBlock*) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x95289FA: llvm::ScalarEvolution::computeBackedgeTakenCount(llvm::Loop const*) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x9528CCE: llvm::ScalarEvolution::getBackedgeTakenInfo(llvm::Loop const*) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x952950D: llvm::ScalarEvolution::getBackedgeTakenCount(llvm::Loop const*) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x90E11EB: (anonymous namespace)::IndVarSimplify::runOnLoop(llvm::Loop*, llvm::LPPassManager&) [clone .part.287] [clone .constprop.302] (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x94DDC72: llvm::LPPassManager::runOnFunction(llvm::Function&) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x966E4B7: llvm::FPPassManager::runOnFunction(llvm::Function&) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x9445511: (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==    by 0x966E0B9: llvm::legacy::PassManagerImpl::run(llvm::Module&) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
==1233==  Uninitialised value was created by a stack allocation
==1233==    at 0x9519F7F: llvm::ScalarEvolution::computeShiftCompareExitLimit(llvm::Value*, llvm::Value*, llvm::Loop const*, llvm::CmpInst::Predicate) (in /home/xubuntu/build-ldc-shared/bin/ldc2)
Contributor

ximion commented Apr 15, 2017

Any news on this? (sorry for nagging, but this issue is critical for the next Debian release, and I need a plan on how to deal with it - ideally that would be fixing the issue, but we could also drop Terminix from the release on armhf).

Owner

klickverbot commented Apr 16, 2017

So far, nobody was able to reproduce the x86 issue yet. I just got an ARM VPS to try and reproduce the issue there (don't have my dev boards handy). No guarantees I'll get anything done over the holidays, though.

Owner

klickverbot commented Apr 17, 2017

Can't reproduce on x86 or

/build/work/ldc-system/bin/ldmd2 --version
LDC - the LLVM D compiler (1.3.0git-3c297dc-dirty):
  based on DMD v2.073.2 and LLVM 3.9.1
  built with LDC - the LLVM D compiler (0.17.3)
  Default target: armv7l-unknown-linux-gnueabihf

either.

With gtk-d and tilix from the Debian sid source repos:

$ /build/work/ldc-system/bin/ldmd2 -I/build/src/gtk-d/srcvte -I/build/src/gtk-d/src -c source/app.d source/gx/gtk/actions.d source/gx/gtk/cairo.d source/gx/gtk/clipboard.d source/gx/gtk/dialog.d source/gx/gtk/resource.d source/gx/gtk/settings.d source/gx/gtk/threads.d source/gx/gtk/util.d source/gx/gtk/vte.d source/gx/gtk/x11.d source/gx/i18n/l10n.d source/gx/tilix/application.d source/gx/tilix/appwindow.d source/gx/tilix/bookmark/bmchooser.d source/gx/tilix/bookmark/bmeditor.d source/gx/tilix/bookmark/bmtreeview.d source/gx/tilix/bookmark/manager.d source/gx/tilix/closedialog.d source/gx/tilix/cmdparams.d source/gx/tilix/colorschemes.d source/gx/tilix/common.d source/gx/tilix/constants.d source/gx/tilix/customtitle.d source/gx/tilix/encoding.d source/gx/tilix/prefeditor/bookmarkeditor.d source/gx/tilix/prefeditor/prefdialog.d source/gx/tilix/prefeditor/profileeditor.d source/gx/tilix/prefeditor/titleeditor.d source/gx/tilix/preferences.d source/gx/tilix/session.d source/gx/tilix/sessionswitcher.d source/gx/tilix/shortcuts.d source/gx/tilix/sidebar.d source/gx/tilix/terminal/actions.d source/gx/tilix/terminal/advpaste.d source/gx/tilix/terminal/exvte.d source/gx/tilix/terminal/layout.d source/gx/tilix/terminal/password.d source/gx/tilix/terminal/regex.d source/gx/tilix/terminal/search.d source/gx/tilix/terminal/terminal.d source/gx/tilix/terminal/util.d source/gx/util/array.d source/gx/util/string.d source/secret/Collection.d source/secret/Item.d source/secret/Prompt.d source/secret/Schema.d source/secret/SchemaAttribute.d source/secret/Secret.d source/secret/Service.d source/secret/Value.d source/secretc/secret.d source/secretc/secrettypes.d source/x11/X.d source/x11/Xlib.d -oftilix.o

Owner

klickverbot commented Apr 17, 2017

http://paste.debian.net/926447/

Side note: Please avoid links to logs/pastes that expire quickly.

Owner

klickverbot commented Apr 17, 2017

@ximion: I'm building a compiler from the 1.1 release to verify, will have the results tomorrow morning. Also retrying with the exact same command line you used (I just used ./configure before).

At this point, it looks like we have to consider a miscompilation of the LDC binary you are using. How are you building/bootstrapping the compiler?

Owner

klickverbot commented Apr 17, 2017

/build/work/ldc-system/bin/ldmd2 -I/build/src/gtk-d/srcvte -I/build/src/gtk-d/src -O -inline -release -g -version=StdLoggerDisableTrace -I/usr/include/d/gtkd-3/ -L-lvted-3 -L-L/usr/lib/arm-linux-gnueabihf/ -L-lgtkd-3 -L-ldl -c source/app.d source/gx/gtk/actions.d source/gx/gtk/cairo.d source/gx/gtk/clipboard.d source/gx/gtk/dialog.d source/gx/gtk/resource.d source/gx/gtk/settings.d source/gx/gtk/threads.d source/gx/gtk/util.d source/gx/gtk/vte.d source/gx/gtk/x11.d source/gx/i18n/l10n.d source/gx/tilix/application.d source/gx/tilix/appwindow.d source/gx/tilix/bookmark/bmchooser.d source/gx/tilix/bookmark/bmeditor.d source/gx/tilix/bookmark/bmtreeview.d source/gx/tilix/bookmark/manager.d source/gx/tilix/closedialog.d source/gx/tilix/cmdparams.d source/gx/tilix/colorschemes.d source/gx/tilix/common.d source/gx/tilix/constants.d source/gx/tilix/customtitle.d source/gx/tilix/encoding.d source/gx/tilix/prefeditor/bookmarkeditor.d source/gx/tilix/prefeditor/prefdialog.d source/gx/tilix/prefeditor/profileeditor.d source/gx/tilix/prefeditor/titleeditor.d source/gx/tilix/preferences.d source/gx/tilix/session.d source/gx/tilix/sessionswitcher.d source/gx/tilix/shortcuts.d source/gx/tilix/sidebar.d source/gx/tilix/terminal/actions.d source/gx/tilix/terminal/advpaste.d source/gx/tilix/terminal/exvte.d source/gx/tilix/terminal/layout.d source/gx/tilix/terminal/password.d source/gx/tilix/terminal/regex.d source/gx/tilix/terminal/search.d source/gx/tilix/terminal/terminal.d source/gx/tilix/terminal/util.d source/gx/util/array.d source/gx/util/string.d source/secret/Collection.d source/secret/Item.d source/secret/Prompt.d source/secret/Schema.d source/secret/SchemaAttribute.d source/secret/Secret.d source/secret/Service.d source/secret/Value.d source/secretc/secret.d source/secretc/secrettypes.d source/x11/X.d source/x11/Xlib.d -oftilix.o

also works.

Owner

klickverbot commented Apr 17, 2017

(For reference, this is on MV78460 Marvell Armada XP/370 SoC with 2 GiB RAM running Arch Linux, armv7l-linux-gnueabihf (NEON disabled), GCC 6.3.1, GNU ld 2.28.0.20170322.)

Contributor

ximion commented Apr 17, 2017

Ping @markos for bootstrapping questions.
The compiler on armhf was manually bootstrapped with the LTS branch. Maybe re-bootstrapping with the latest LTS C++ compiler is an option, just to be safe?

Owner

klickverbot commented Apr 17, 2017

Manually bootstrapping from 0.17.3 is what I did above, yes.

Owner

klickverbot commented Apr 17, 2017

Same command line as above works with

LDC - the LLVM D compiler (1.1.1):
  based on DMD v2.071.2 and LLVM 3.9.1
  built with LDC - the LLVM D compiler (0.17.3)
  Default target: armv7l-unknown-linux-gnueabihf

as well.

Contributor

ximion commented Apr 17, 2017

@klickverbot Just to be safe: You're not cross-compiling anything and are on a real machine?
A bootstrap error which generated some subtle breakage in the armhf LDC would explain pretty much all behavior we are seeing...
Maybe @markos has a bit of time to once again bootstrap an LDC...
(I am starting to like Fedora here which can auto-bootstrap relatively easily. Maybe we should set up something like this for Debian's LDC package as well, now that Debian packages can have multiple source packages - I have seen no package using this feature for bootstrapping yet, though...).

Owner

klickverbot commented Apr 17, 2017

Yes, all happened on the above host == build == target, an up-to-date Arch Linux/ARM on armv7l-linux-gnueabihf, cortex-a8,-neon.

Owner

klickverbot commented Apr 17, 2017

I suppose I should try a self-hosted 1.1.1 build. Give me a second until tomorrow morning.

Owner

klickverbot commented Apr 17, 2017

LDC - the LLVM D compiler (1.1.1):
  based on DMD v2.071.2 and LLVM 3.9.1
  built with LDC - the LLVM D compiler (1.1.1)
  Default target: armv7l-unknown-linux-gnueabihf

built with the above 1.1.1 (i.e. built in turn by 0.17.3) also works.

I guess the notification spam on this issue is over for now, with the conclusion that neither @kinke nor me can reproduce the issue on i686 and armhf.

If it helps for tracking down any specifics of your setup, I can give you SSH access to the box I've done this on.

Contributor

ximion commented Apr 23, 2017

@klickverbot You did all your experiments on Arch, right?
I changed our LDC packaging to always re-bootstrap LDC with the LTS compiler branch, and I got:

cd /«PKGBUILDDIR»/bootstrap && /«PKGBUILDDIR»/bootstrap/b/bin/ldc2 --output-o -c -I/«PKGBUILDDIR»/bootstrap/runtime/druntime/src -I/«PKGBUILDDIR»/bootstrap/runtime/druntime/src/gc /«PKGBUILDDIR»/bootstrap/runtime/phobos/std/regex/internal/ir.d -of/«PKGBUILDDIR»/bootstrap/b/runtime/std/regex/internal/ir-debug.o -w -relocation-model=pic -g -link-debuglib -I/«PKGBUILDDIR»/bootstrap/runtime/phobos
0  libLLVM-3.9.so.1 0xf486eafd llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 45
1  libLLVM-3.9.so.1 0xf486ef4d
2  libLLVM-3.9.so.1 0xf486cb60 llvm::sys::RunSignalHandlers() + 64
3  libLLVM-3.9.so.1 0xf486cc8b
4  linux-gate.so.1  0xf735bd40 __kernel_sigreturn + 0
5  ldc2             0xf753255d TemplateInstance::needsCodegen() + 573
6  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
7  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
8  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
9  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
10 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
11 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
12 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
13 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
14 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
...
255 ldc2             0xf75325de TemplateInstance::needsCodegen() + 702
Segmentation fault

on i386 - for the LTC compiler build!

So, now I wonder whether this might have something to do with LLVM. The full build log is here: https://buildd.debian.org/status/fetch.php?pkg=ldc&arch=i386&ver=1%3A1.1.1-2&stamp=1492986438&raw=0

Member

kinke commented Apr 23, 2017

Building ltsmaster works fine on i686 with Ubuntu 16.04 using its LLVM 3.8 libs (static ones apparently by default).

Owner

klickverbot commented Apr 23, 2017

You did all your experiments on Arch, right?

Yes.

So, now I wonder whether this might have something to do with LLVM.

I wouldn't think so – the infinite recursion seem to happen in the frontend (in this case compiled by GCC), so it would have to be something like the AST being corrupted by a memory issue within LLVM, ABI issues messing up the stack due to header/executable issues, etc. That one Valgrind issue from @kinke's post I pointed out above occurs after IR generation is done (where TemplateInstance::needsCodegen is called).

But just in case, I was using the Arch Linux/ARM packages for LLVM earlier, while a straightforward source build is used for the binary releases.

I suppose there is a slim chance that the invalid writes in that other Valgrind log end up corrupting the AST in a way for the infinite recursion to happen. Do they still occur with the ltsmaster compiler?
Perhaps you could have a look at what is going on (e.g. using Valgrind's gdbserver) to see if that is an issue?

Do you know whether the issue occurs on all i686 Debian Sid boxes (cf. us not being able to reproduce it on Ubuntu)? If you swap out the compiler for one from a binary release, does that crash as well?

(Unfortunately, I'm a bit short on time right now.)

Contributor

ximion commented Apr 23, 2017

@kinke

Building ltsmaster works fine on i686 with Ubuntu 16.04 using its LLVM 3.8 libs

I could switch back to LLVM 3.8 instead of 3.9.1 to see if that changes anything...

@klickverbot I can probably test these things tomorrow :-)

I still find it an interesting detail that this only seems to affect 32bit architectures, as amd64 and ppc64el are fine.

Contributor

ximion commented Apr 24, 2017

Compiling in an i386 chroot doesn't trigger the same behavior.

LCD and all its dependencies was removed from Debian testing (aka the next release) today because of this issue, see https://tracker.debian.org/pkg/ldc for the latest status.

Contributor

ximion commented May 22, 2017

Yes, and I am not happy with how the release team handled this matter. In any case, I apologize for this and hope to be able to introduce LDC to Stretch backports instead.

This issue still needs to be resolved for that to happen, though.

FTR: Link to the downstream bug report again https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085

Owner

klickverbot commented May 22, 2017

Welp, that's not great. I am not sure what we could have done differently as the upstream devs (apart from directly engaging with release people, which probably they wouldn't like), given that we still can't reproduce the issues on a number of systems/configurations.

Owner

klickverbot commented May 22, 2017

(@ximion Many thanks for your work though, it is very much appreciated!)

Contributor

ximion commented May 22, 2017

@klickverbot

Welp, that's not great. I am not sure what we could have done differently as the upstream devs (apart from directly engaging with release people, which probably they wouldn't like), given that we still can't reproduce the issues on a number of systems/configurations.

There's not really much you could have done - I was discussing this with people on IRC yesterday, and decided to just drop the armhf port, since it didn't look like we could resolve the issue prior to the release. I submitted an upload doing that, and apparently one of the release-team member who listened and commented on our discussion force-removed the package today without giving any prior warning or reason.
I am not happy.
But, as said, our fault, not yours.

Regarding the bug, I summarized what we know at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085#41 - hopefully we can resolve this somehow. I might just try to build LDC with LLVM in Debian experimental now, to see if that makes a difference (with no chance of getting LDC into the release anymore, I can at least go crazy in unstable to find this bug).

Unfortunately I have never been able to reproduce this issue locally, it seems to happen exclusively on Debian buildds.

Owner

klickverbot commented May 22, 2017

Unfortunately I have never been able to reproduce this issue locally, it seems to happen exclusively on Debian buildds.

I thought Debian was moving towards reproducible builds? (That is, isn't there some way to get a chroot with the same environment locally?)

Contributor

ximion commented May 22, 2017

Yes you can, but that still makes the build not-fail here. Which makes me believe that it might have something to do with the exact architecture used.

Contributor

ximion commented May 22, 2017

@petterreinholdtsen In general good advise, but this particular issue is an infinite recursion which happens for no obvious reason (I looked through the relevant pieces of code).
Randomly fixing warnings and hoping that betters things would take quite a while ^^
But yeah, in general Coverity is awesome and a great QA tool.

Contributor

ximion commented May 22, 2017

Building with LLVM 4.0 made the FTBFS of LDC itself on the i386 architecture disappear. Interesting. I am very curious if this maybe fixes the Tilix build as well.

Contributor

ximion commented May 23, 2017

LDC with LLVM 4.0 builds Tilix 1.5.6 flawlessly on all architectures. The LLVM version is also the biggest difference between the Fedora and Debian toolchain. Unfortunately, LLVM 4.0 won't be in the next Debian release, which is why I have not explored that option until now, because it wouldn't have helped the cause of getting LDC to work on Debian 9.

The failing Tilix version was 1.5.4 though, before. So, to be 100% certain that LLVM was indeed the culprit here, I will need to build that exact version of Tilix with the updated LLVM 4 LDC on armhf and see if that resolves the issue. I think the answer will be yes.

In any case, it now seems to be highly likely that the thing we were after for months is actually a bug somewhere in LLVM, that got resolved with the 4.0 release (or worked around / not triggered).

Contributor

ximion commented May 23, 2017

Okay, I tried to reproduce the issue with the new LLVM4 LDC and the exact same sources on a Debian armhf porterbox that I used before and which was the only place where I could reliably reproduce the bug: The issue did not appear anymore!

On a hunch, I also tried the old LDC (same version, but with LLVM 3.8) binary, and the bug also didn't show itself. I also had a manually compiled version laying around which I had used before to reproduce the issue, that was compiled with LLVM 3.9: The bug also was gone with that one.

The LLVM changelog reads:

llvm-toolchain-3.9 (1:3.9.1-8) unstable; urgency=medium

  * Really fix "use versioned symbols" for llvm
    Thanks to Julien Cristau for the patch (Closes: #849098)

 -- Sylvestre Ledru <sylvestre@debian.org>  Tue, 25 Apr 2017 15:10:10 +0200

llvm-toolchain-3.9 (1:3.9.1-7) unstable; urgency=medium

  * Limit the archs where the ocaml binding is built
    Should fix the FTBFS
    Currently amd64 arm64 armel armhf i386

 -- Sylvestre Ledru <sylvestre@debian.org>  Sat, 15 Apr 2017 12:03:30 +0200

llvm-toolchain-3.9 (1:3.9.1-6) unstable; urgency=medium

  * Upload in unstable
  * Bring back ocaml. Thanks to Cyril Soldani (Closes: #858626)

 -- Sylvestre Ledru <sylvestre@debian.org>  Fri, 14 Apr 2017 10:02:03 +0200

llvm-toolchain-3.9 (1:3.9.1-6~exp2) experimental; urgency=medium

  * Add override_dh_makeshlibs for the libllvm or liblldb versions
    Thanks to Julien Cristau for the patch
  * change the min version of the libclang1 symbols to 1:3.9.1-6~
  * Fix the symlink on scan-build-py

 -- Sylvestre Ledru <sylvestre@debian.org>  Tue, 28 Mar 2017 06:32:40 +0200

llvm-toolchain-3.9 (1:3.9.1-6~exp1) experimental; urgency=medium

  [ Rebecca N. Palmer ]
  * Allow '!pointer' in OpenCL (Closes: #857623)
  * Add missing liblldb symlink (Closes: #857683)
  * Use versioned symbols (Closes: #848368)

 -- Sylvestre Ledru <sylvestre@debian.org>  Sun, 19 Mar 2017 10:12:03 +0100

llvm-toolchain-3.9 (1:3.9.1-5) unstable; urgency=medium

  * Fix the incorrect symlink to scan-build-py (Closes: #856869)

 -- Sylvestre Ledru <sylvestre@debian.org>  Sun, 12 Mar 2017 10:01:10 +0100

None of these changes look like they could have caused this bug. But I think it's relatively safe to assume now that this issue might not actually be an issue in LDC at all.

Contributor

ximion commented Sep 6, 2017

I think we can close this. If it ever happens again, I will file a new bug report. Thanks for all your help!

@ximion ximion closed this Sep 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment