Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when compiling Terminix on armhf (and i386) #2022

Closed
ximion opened this issue Mar 3, 2017 · 76 comments
Closed

Crash when compiling Terminix on armhf (and i386) #2022

ximion opened this issue Mar 3, 2017 · 76 comments

Comments

@kinke
Copy link
Member

kinke commented Mar 4, 2017

Probably a duplicate of highly related to #1996.

@ximion
Copy link
Contributor Author

ximion commented Mar 7, 2017

With upgrading Terminix to 1.5.2, the i386 issue has fixed itself (rather worked around, I guess), while the FTBFS on armhf persisted.
Oddly, after fixing an unrelated build failure on ppc64el via this patch: https://github.com/ximion/terminix/commit/da12d1322ac0d94c62527b93a804d48c1da0e78d builds started working without LDC segfault on armhf too.
Because of this I assume this crash is triggered by LDC doing something bad with different integer types on the respective architectures.

We now have an RC bug against LDC about this, unfortunately, which makes this issue quite high-priority to keep LDC in the next Debian release: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085
Maybe it can be downgraded if a workaround is found, but having a fix would make me (and the release team) happier for sure.
Unfortunately I can't dustmite this on armhf since I don't have this architecture here (maybe a Raspberry Pi would do to reproduce...)

@JohanEngelen
Copy link
Member

@ximion Did you try to cross-compile to armhf to reproduce (for dustmiting)? Not sure about the triple, perhaps -mtriple=arm-linux-gnueabihf.

@ximion
Copy link
Contributor Author

ximion commented Mar 7, 2017

Looks like it's failing again after just slightly modifying the mentioned patch: https://buildd.debian.org/status/fetch.php?pkg=terminix&arch=armhf&ver=1.5.2-3&stamp=1488930891&raw=0
Meh... But ppc64el works now. This is the stupidest game of whack-a-mole I've ever played.

@kinke
Copy link
Member

kinke commented Mar 8, 2017

A way to quickly reproduce this on x86, e.g., via archiving the required files and providing a command line incl. target triple to make it crash, would be extremely helpful so that we can immediately move on to debugging. Terminix doesn't compile on Win64, I already tried that (the parts/dependencies that didn't compile didn't crash, just missing POSIX imports etc.).

@ximion
Copy link
Contributor Author

ximion commented Mar 8, 2017

Unfortunately Terminix is a large codebase and I can only guess what is relevant for the issue here by the patches I sent and the behavior I observed.
The best shot is likely to fire Dustmite at it and lat it run for a while, I'll try the suggestion of @JohanEngelen tomorrow (or Thursday) to see if I can get anything useful out.

@ximion
Copy link
Contributor Author

ximion commented Mar 8, 2017

Oh, and just in case: Sorry for the offhand bug description... I guess I was a bit frustrated by hitting compiler bugs so often when writing it (and at that time I hoped to get Terminix updated in Stretch, which I think won't happen now, so this crash came at a really bad time). You're doing an amazing job on LDC, and I'll update the description when I have some better information on what's actually going on (it's only current content is pretty much "there's a bug when compiling X" :P)

@dnadlinger
Copy link
Member

We now have an RC bug against LDC about this, unfortunately, which makes this issue quite high-priority to keep LDC in the next Debian release

Just on a side note: How did we end up with a situation like this in the first place? I thought it was clear that non-x86 support is on a bit of a tentative basis for now. For example, we never really had a CI setup for armhf or PPC on a permanent basis (Kai started to set something up, but it never quite entered normal the development cycle).

If I were to choose, I'd rather not have LDC packaged on other platforms at all than x86 support suffering from it. (Of course, we'll want to rectify the CI/testing situation as soon as possible, but until then…)

@ximion
Copy link
Contributor Author

ximion commented Mar 8, 2017

@klickverbot Debian packages build on all architectures and we are encouraged to support as many platforms as possible. LDC won't get dropped from the Stretch release, before that happens I would rather negotiate something with the release team to drop the faulty architecture. Not sure what's it gonna be yet. Apparently armhf stuff built with LDC works though, so does ppc64el - some support is better than none.
Unfortunately I uploaded a fix for a Terminix crash which triggered this bug :P (and it comes at a really bad time, since I requested a freeze exception to get the final 1.1 release into the Stretch release a few weeks ago)

On a related note: I think LDC should really get CI set up for multiple architectures. Since D has a foundation now, I think it would be hugely beneficial to get some quota from an arm/x86/amd64 cloud provider and a Jenkins instance to run tests easily (would also help GDC and potentially DMD). I can give you access to Debian porterboxes too, but that is always temporary and not really a good permanent solution.

@kalev
Copy link
Contributor

kalev commented Mar 8, 2017

For what it's worth, Fedora is pretty much in the same position and strongly encouraged to support as many architectures as possible. We're currently building ldc for armv7hl, i686, ppc64, ppc64le, x86_64.

@dnadlinger
Copy link
Member

Building on other architectures is very welcome – we (myself included) did spend considerable development effort on non-x86 archs, and if it makes it easier for users to evaluate where we stand, then all the better. It's just that we can't offer the same level of support yet as for the production-quality x86 compiler (well, as production-quality as any D compiler is).

@ximion
Copy link
Contributor Author

ximion commented Mar 10, 2017

Crosscompiling with -mtriple=arm-linux-gnueabihfdidn't trigger this crash unfortunately.
I build myself an armhf chroot now, maybe that helps...

@ximion
Copy link
Contributor Author

ximion commented Mar 10, 2017

Okay, no crash in armhf chroot either, so emulation doesn't work. Could this maybe be another case of NEON being (not) present?

@ximion ximion changed the title Crash when compiling Terminix on i386 and armhf Crash when compiling Terminix on armhf (and i386) Mar 11, 2017
@ximion
Copy link
Contributor Author

ximion commented Mar 11, 2017

Looks like all porterboxes for armhf are unreachable at time too...
Maybe I can roll back Terminix to when LDC crashed on i386 as well and run Dustmite on that bug later.

@ximion
Copy link
Contributor Author

ximion commented Mar 18, 2017

The porterboxes are accessible again, since we have Dustmite in the archive I will try to create a minimal testcase there.

EDIT: Crap, this bug seems to be of the unstable kind, sometimes it happens and sometimes it doesn't, and it seems to especially not-happen when running under GDB. I wonder if Dustmite will yield anything useful under these conditions (it will likely run for many hours, at least :-/ )

@ximion
Copy link
Contributor Author

ximion commented Mar 21, 2017

This will run a few days longer - I helped it out a bit by removing stuff, but this bug is very evasive. It doesn't appear under GDB, and the slightest change on the sources can make it disappear.
Also, interestingly, it doesn't appear when not writing an output file (-o- instead of -of).

And sometimes it is just flaky for no reason. So something is really weird here.

@dnadlinger
Copy link
Member

Darn... Can you get a core dump and maybe gain some idea about the details from the back trace?

@ximion
Copy link
Contributor Author

ximion commented Mar 22, 2017

Hah, I completely forgot about coredumps ^^
The porterbox I use is very restricted, but I see no reason why it wouldn't allow me to generate a coredump if I set the right limits - I'll get back with one tomorrow. Also, dustmite got slightly faster now (but it's still crazy slow, a single-core armhf machine isn't up to this task - it's running for three days straight now, with one small interruption. At least dustmite is narrowing the issue down a little).

@ximion
Copy link
Contributor Author

ximion commented Mar 22, 2017

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0xb6e1c710 in TemplateInstance::needsCodegen() ()
(gdb) bt full
#0  0xb6e1c710 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#1  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#2  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#3  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#4  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#5  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#6  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#7  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#8  0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
=> To infinity!

Looks like this recursion never stops - I'll try to maybe generate a better backtrace.

@dnadlinger
Copy link
Member

Depending on how flaky the issue is, you should be able to do a release+debug or debug build as well. Then, you could also dump the source location information in the debugger to get further hints as to what causes the issue. Right now, I can only guess that it is a memory corruption issue in the compiler leading to invalid AST... Is the issue reproducible in Valgrind?

@ximion
Copy link
Contributor Author

ximion commented Mar 22, 2017

More information:

#10786 0xb6e1c898 in TemplateInstance::needsCodegen() ()
No symbol table info available.
#10787 0xb6e1cc20 in TemplateInstance::needsCodegen() ()
No symbol table info available.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/gen/declarations.cpp.dwo(0xf48a14f6c605bd93) referenced by CU at offset 0xa68 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10788 0xb6ef03d0 in CodegenVisitor::visit(TemplateInstance*) () at ./gen/declarations.cpp:448
No locals.
#10789 0xb6ef0956 in Declaration_codegen(Dsymbol*) () at ./gen/declarations.cpp:576
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/gen/modules.cpp.dwo(0xb03109d7010ced7c) referenced by CU at offset 0x1a4 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10790 0xb6e9c770 in codegenModule(IRState*, Module*) () at ./gen/modules.cpp:635
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/driver/codegenerator.cpp.dwo(0x178f2d9de69b88a2) referenced by CU at offset 0xc48 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10791 0xb6efb516 in ldc::CodeGenerator::emit(Module*) () at ./driver/codegenerator.cpp:234
No locals.
warning: Could not find DWO CU CMakeFiles/LDCShared.dir/driver/main.cpp.dwo(0xabaacaa5774a7d6e) referenced by CU at offset 0x864 [in module /usr/lib/debug/.build-id/fc/bc27ce25e3f250c055847441ce0277f089a80c.debug]
#10792 0xb6ee1fc4 in codegenModules(Array<Module*>&) () at ./driver/main.cpp:1047
No locals.
#10793 0xb6d7caa4 in mars_mainBody(Array<char const*>&, Array<char const*>&) ()
No symbol table info available.
#10794 0xb6ee33ae in cppmain(int, char**) () at ./driver/main.cpp:1021
No locals.
#10795 0xb6cc2f7c in D main ()
No symbol table info available.

Any debugging on this machine takes ages...

@ximion
Copy link
Contributor Author

ximion commented Mar 22, 2017

Valgrind isn't super useful...

EDIT: [removed clutter]

@kinke
Copy link
Member

kinke commented Mar 22, 2017

Seems like you ran valgrind for ldmd2. You'll want ldc2, as LDMD only translates the command line args and then starts an ldc2 process. [Adding the LDMD switch -vdmd outputs the ldc2 command line.]

@ximion
Copy link
Contributor Author

ximion commented Mar 22, 2017

Yeah, I noticed this right after writing the entry on Github (the suspiciously fast time Valgrind was running got be to examine the thing more).
So, now I ran it properly, and it looks like the segfault doesn't occur: http://paste.debian.net/923446/
This bug sucks.

@dnadlinger
Copy link
Member

The uninitialized reads from the GC are benign, but these look potentially interesting (albeit probably not related?):

==13396== 273 errors in context 3 of 15:
==13396== Invalid write of size 4
==13396==    at 0x1E0190: Import::semantic(Scope*) (in /usr/bin/ldc2)
==13396==  Address 0x864279c is 4 bytes inside a block of size 7 alloc'd
==13396==    at 0x4840E94: realloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==13396== 
==13396== 
==13396== 395 errors in context 4 of 15:
==13396== Invalid write of size 4
==13396==    at 0x1E0080: Import::semantic(Scope*) (in /usr/bin/ldc2)
==13396==  Address 0x8642764 is 4 bytes inside a block of size 7 alloc'd
==13396==    at 0x4840E94: realloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
==13396== 
==13396== 
==13396== 2076 errors in context 5 of 15:
==13396== Invalid write of size 4
==13396==    at 0x1E0240: Import::semantic(Scope*) (in /usr/bin/ldc2)
==13396==  Address 0x9786cc4 is 4 bytes inside a block of size 7 alloc'd
==13396==    at 0x483E4B0: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)

@ximion
Copy link
Contributor Author

ximion commented Mar 24, 2017

Dustmite is removing around 40 source-code lines per day, so we will only need to wait 400 days for this process to minimize everything down to zero... (around 16960 lines still exist)

@kinke
Copy link
Member

kinke commented Mar 24, 2017

It should crash on 32-bit x86 with that 'special' terminix src too [at least sometimes], right? Just asking because that would at least be debuggable by us directly.

@ximion
Copy link
Contributor Author

ximion commented Mar 24, 2017

@kinke I don't know... I was crashing before with a different source, then something was changed and the crash disappeared. It does definitely not crash when cross-compiling.
The version that broke on i386 can still be fetched from http://snapshot.debian.org/package/terminix/1.4.2-4/, but last time I checked it wasn't always possible to reproduce the error. It appears like version 1.4.2-1 also crashed when compiling on x86.
If this is really the same bug, then debugging on x86 is of course much nicer.

@kinke
Copy link
Member

kinke commented Mar 24, 2017

Our CI systems use x86_64 compilers only except for the Win32 AppVeyor job, that's the only native 32-bit one. On Windows, we don't support shared runtime libs etc. Is the crashing x86 LDC linked against static or shared druntime/Phobos? And what was its D host compiler? Edit: From the logs apparently LDC 1.1 as host compiler too + shared druntime/Phobos. Same for your LDC used on ARM? Note that afaik, the LDCs we test in CI are all linked against static D runtime libs.

@dnadlinger
Copy link
Member

http://paste.debian.net/926447/

Side note: Please avoid links to logs/pastes that expire quickly.

@dnadlinger
Copy link
Member

@ximion: I'm building a compiler from the 1.1 release to verify, will have the results tomorrow morning. Also retrying with the exact same command line you used (I just used ./configure before).

At this point, it looks like we have to consider a miscompilation of the LDC binary you are using. How are you building/bootstrapping the compiler?

@dnadlinger
Copy link
Member

/build/work/ldc-system/bin/ldmd2 -I/build/src/gtk-d/srcvte -I/build/src/gtk-d/src -O -inline -release -g -version=StdLoggerDisableTrace -I/usr/include/d/gtkd-3/ -L-lvted-3 -L-L/usr/lib/arm-linux-gnueabihf/ -L-lgtkd-3 -L-ldl -c source/app.d source/gx/gtk/actions.d source/gx/gtk/cairo.d source/gx/gtk/clipboard.d source/gx/gtk/dialog.d source/gx/gtk/resource.d source/gx/gtk/settings.d source/gx/gtk/threads.d source/gx/gtk/util.d source/gx/gtk/vte.d source/gx/gtk/x11.d source/gx/i18n/l10n.d source/gx/tilix/application.d source/gx/tilix/appwindow.d source/gx/tilix/bookmark/bmchooser.d source/gx/tilix/bookmark/bmeditor.d source/gx/tilix/bookmark/bmtreeview.d source/gx/tilix/bookmark/manager.d source/gx/tilix/closedialog.d source/gx/tilix/cmdparams.d source/gx/tilix/colorschemes.d source/gx/tilix/common.d source/gx/tilix/constants.d source/gx/tilix/customtitle.d source/gx/tilix/encoding.d source/gx/tilix/prefeditor/bookmarkeditor.d source/gx/tilix/prefeditor/prefdialog.d source/gx/tilix/prefeditor/profileeditor.d source/gx/tilix/prefeditor/titleeditor.d source/gx/tilix/preferences.d source/gx/tilix/session.d source/gx/tilix/sessionswitcher.d source/gx/tilix/shortcuts.d source/gx/tilix/sidebar.d source/gx/tilix/terminal/actions.d source/gx/tilix/terminal/advpaste.d source/gx/tilix/terminal/exvte.d source/gx/tilix/terminal/layout.d source/gx/tilix/terminal/password.d source/gx/tilix/terminal/regex.d source/gx/tilix/terminal/search.d source/gx/tilix/terminal/terminal.d source/gx/tilix/terminal/util.d source/gx/util/array.d source/gx/util/string.d source/secret/Collection.d source/secret/Item.d source/secret/Prompt.d source/secret/Schema.d source/secret/SchemaAttribute.d source/secret/Secret.d source/secret/Service.d source/secret/Value.d source/secretc/secret.d source/secretc/secrettypes.d source/x11/X.d source/x11/Xlib.d -oftilix.o

also works.

@dnadlinger
Copy link
Member

(For reference, this is on MV78460 Marvell Armada XP/370 SoC with 2 GiB RAM running Arch Linux, armv7l-linux-gnueabihf (NEON disabled), GCC 6.3.1, GNU ld 2.28.0.20170322.)

@ximion
Copy link
Contributor Author

ximion commented Apr 17, 2017

Ping @markos for bootstrapping questions.
The compiler on armhf was manually bootstrapped with the LTS branch. Maybe re-bootstrapping with the latest LTS C++ compiler is an option, just to be safe?

@dnadlinger
Copy link
Member

Manually bootstrapping from 0.17.3 is what I did above, yes.

@dnadlinger
Copy link
Member

Same command line as above works with

LDC - the LLVM D compiler (1.1.1):
  based on DMD v2.071.2 and LLVM 3.9.1
  built with LDC - the LLVM D compiler (0.17.3)
  Default target: armv7l-unknown-linux-gnueabihf

as well.

@ximion
Copy link
Contributor Author

ximion commented Apr 17, 2017

@klickverbot Just to be safe: You're not cross-compiling anything and are on a real machine?
A bootstrap error which generated some subtle breakage in the armhf LDC would explain pretty much all behavior we are seeing...
Maybe @markos has a bit of time to once again bootstrap an LDC...
(I am starting to like Fedora here which can auto-bootstrap relatively easily. Maybe we should set up something like this for Debian's LDC package as well, now that Debian packages can have multiple source packages - I have seen no package using this feature for bootstrapping yet, though...).

@dnadlinger
Copy link
Member

dnadlinger commented Apr 17, 2017

Yes, all happened on the above host == build == target, an up-to-date Arch Linux/ARM on armv7l-linux-gnueabihf, cortex-a8,-neon.

@dnadlinger
Copy link
Member

dnadlinger commented Apr 17, 2017

I suppose I should try a self-hosted 1.1.1 build. Give me a second until tomorrow morning.

@dnadlinger
Copy link
Member

LDC - the LLVM D compiler (1.1.1):
  based on DMD v2.071.2 and LLVM 3.9.1
  built with LDC - the LLVM D compiler (1.1.1)
  Default target: armv7l-unknown-linux-gnueabihf

built with the above 1.1.1 (i.e. built in turn by 0.17.3) also works.

I guess the notification spam on this issue is over for now, with the conclusion that neither @kinke nor me can reproduce the issue on i686 and armhf.

If it helps for tracking down any specifics of your setup, I can give you SSH access to the box I've done this on.

@ximion
Copy link
Contributor Author

ximion commented Apr 23, 2017

@klickverbot You did all your experiments on Arch, right?
I changed our LDC packaging to always re-bootstrap LDC with the LTS compiler branch, and I got:

cd /«PKGBUILDDIR»/bootstrap && /«PKGBUILDDIR»/bootstrap/b/bin/ldc2 --output-o -c -I/«PKGBUILDDIR»/bootstrap/runtime/druntime/src -I/«PKGBUILDDIR»/bootstrap/runtime/druntime/src/gc /«PKGBUILDDIR»/bootstrap/runtime/phobos/std/regex/internal/ir.d -of/«PKGBUILDDIR»/bootstrap/b/runtime/std/regex/internal/ir-debug.o -w -relocation-model=pic -g -link-debuglib -I/«PKGBUILDDIR»/bootstrap/runtime/phobos
0  libLLVM-3.9.so.1 0xf486eafd llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 45
1  libLLVM-3.9.so.1 0xf486ef4d
2  libLLVM-3.9.so.1 0xf486cb60 llvm::sys::RunSignalHandlers() + 64
3  libLLVM-3.9.so.1 0xf486cc8b
4  linux-gate.so.1  0xf735bd40 __kernel_sigreturn + 0
5  ldc2             0xf753255d TemplateInstance::needsCodegen() + 573
6  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
7  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
8  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
9  ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
10 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
11 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
12 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
13 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
14 ldc2             0xf753256a TemplateInstance::needsCodegen() + 586
...
255 ldc2             0xf75325de TemplateInstance::needsCodegen() + 702
Segmentation fault

on i386 - for the LTC compiler build!

So, now I wonder whether this might have something to do with LLVM. The full build log is here: https://buildd.debian.org/status/fetch.php?pkg=ldc&arch=i386&ver=1%3A1.1.1-2&stamp=1492986438&raw=0

@kinke
Copy link
Member

kinke commented Apr 23, 2017

Building ltsmaster works fine on i686 with Ubuntu 16.04 using its LLVM 3.8 libs (static ones apparently by default).

@dnadlinger
Copy link
Member

You did all your experiments on Arch, right?

Yes.

So, now I wonder whether this might have something to do with LLVM.

I wouldn't think so – the infinite recursion seem to happen in the frontend (in this case compiled by GCC), so it would have to be something like the AST being corrupted by a memory issue within LLVM, ABI issues messing up the stack due to header/executable issues, etc. That one Valgrind issue from @kinke's post I pointed out above occurs after IR generation is done (where TemplateInstance::needsCodegen is called).

But just in case, I was using the Arch Linux/ARM packages for LLVM earlier, while a straightforward source build is used for the binary releases.

I suppose there is a slim chance that the invalid writes in that other Valgrind log end up corrupting the AST in a way for the infinite recursion to happen. Do they still occur with the ltsmaster compiler?
Perhaps you could have a look at what is going on (e.g. using Valgrind's gdbserver) to see if that is an issue?

Do you know whether the issue occurs on all i686 Debian Sid boxes (cf. us not being able to reproduce it on Ubuntu)? If you swap out the compiler for one from a binary release, does that crash as well?

(Unfortunately, I'm a bit short on time right now.)

@ximion
Copy link
Contributor Author

ximion commented Apr 23, 2017

@kinke

Building ltsmaster works fine on i686 with Ubuntu 16.04 using its LLVM 3.8 libs

I could switch back to LLVM 3.8 instead of 3.9.1 to see if that changes anything...

@klickverbot I can probably test these things tomorrow :-)

I still find it an interesting detail that this only seems to affect 32bit architectures, as amd64 and ppc64el are fine.

@ximion
Copy link
Contributor Author

ximion commented Apr 24, 2017

Compiling in an i386 chroot doesn't trigger the same behavior.

@petterreinholdtsen
Copy link

LCD and all its dependencies was removed from Debian testing (aka the next release) today because of this issue, see https://tracker.debian.org/pkg/ldc for the latest status.

@ximion
Copy link
Contributor Author

ximion commented May 22, 2017

Yes, and I am not happy with how the release team handled this matter. In any case, I apologize for this and hope to be able to introduce LDC to Stretch backports instead.

This issue still needs to be resolved for that to happen, though.

FTR: Link to the downstream bug report again https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085

@dnadlinger
Copy link
Member

Welp, that's not great. I am not sure what we could have done differently as the upstream devs (apart from directly engaging with release people, which probably they wouldn't like), given that we still can't reproduce the issues on a number of systems/configurations.

@dnadlinger
Copy link
Member

(@ximion Many thanks for your work though, it is very much appreciated!)

@ximion
Copy link
Contributor Author

ximion commented May 22, 2017

@klickverbot

Welp, that's not great. I am not sure what we could have done differently as the upstream devs (apart from directly engaging with release people, which probably they wouldn't like), given that we still can't reproduce the issues on a number of systems/configurations.

There's not really much you could have done - I was discussing this with people on IRC yesterday, and decided to just drop the armhf port, since it didn't look like we could resolve the issue prior to the release. I submitted an upload doing that, and apparently one of the release-team member who listened and commented on our discussion force-removed the package today without giving any prior warning or reason.
I am not happy.
But, as said, our fault, not yours.

Regarding the bug, I summarized what we know at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=857085#41 - hopefully we can resolve this somehow. I might just try to build LDC with LLVM in Debian experimental now, to see if that makes a difference (with no chance of getting LDC into the release anymore, I can at least go crazy in unstable to find this bug).

Unfortunately I have never been able to reproduce this issue locally, it seems to happen exclusively on Debian buildds.

@dnadlinger
Copy link
Member

Unfortunately I have never been able to reproduce this issue locally, it seems to happen exclusively on Debian buildds.

I thought Debian was moving towards reproducible builds? (That is, isn't there some way to get a chroot with the same environment locally?)

@ximion
Copy link
Contributor Author

ximion commented May 22, 2017

Yes you can, but that still makes the build not-fail here. Which makes me believe that it might have something to do with the exact architecture used.

@petterreinholdtsen
Copy link

petterreinholdtsen commented May 22, 2017 via email

@ximion
Copy link
Contributor Author

ximion commented May 22, 2017

@petterreinholdtsen In general good advise, but this particular issue is an infinite recursion which happens for no obvious reason (I looked through the relevant pieces of code).
Randomly fixing warnings and hoping that betters things would take quite a while ^^
But yeah, in general Coverity is awesome and a great QA tool.

@ximion
Copy link
Contributor Author

ximion commented May 22, 2017

Building with LLVM 4.0 made the FTBFS of LDC itself on the i386 architecture disappear. Interesting. I am very curious if this maybe fixes the Tilix build as well.

@ximion
Copy link
Contributor Author

ximion commented May 23, 2017

LDC with LLVM 4.0 builds Tilix 1.5.6 flawlessly on all architectures. The LLVM version is also the biggest difference between the Fedora and Debian toolchain. Unfortunately, LLVM 4.0 won't be in the next Debian release, which is why I have not explored that option until now, because it wouldn't have helped the cause of getting LDC to work on Debian 9.

The failing Tilix version was 1.5.4 though, before. So, to be 100% certain that LLVM was indeed the culprit here, I will need to build that exact version of Tilix with the updated LLVM 4 LDC on armhf and see if that resolves the issue. I think the answer will be yes.

In any case, it now seems to be highly likely that the thing we were after for months is actually a bug somewhere in LLVM, that got resolved with the 4.0 release (or worked around / not triggered).

@ximion
Copy link
Contributor Author

ximion commented May 23, 2017

Okay, I tried to reproduce the issue with the new LLVM4 LDC and the exact same sources on a Debian armhf porterbox that I used before and which was the only place where I could reliably reproduce the bug: The issue did not appear anymore!

On a hunch, I also tried the old LDC (same version, but with LLVM 3.8) binary, and the bug also didn't show itself. I also had a manually compiled version laying around which I had used before to reproduce the issue, that was compiled with LLVM 3.9: The bug also was gone with that one.

The LLVM changelog reads:

llvm-toolchain-3.9 (1:3.9.1-8) unstable; urgency=medium

  * Really fix "use versioned symbols" for llvm
    Thanks to Julien Cristau for the patch (Closes: #849098)

 -- Sylvestre Ledru <sylvestre@debian.org>  Tue, 25 Apr 2017 15:10:10 +0200

llvm-toolchain-3.9 (1:3.9.1-7) unstable; urgency=medium

  * Limit the archs where the ocaml binding is built
    Should fix the FTBFS
    Currently amd64 arm64 armel armhf i386

 -- Sylvestre Ledru <sylvestre@debian.org>  Sat, 15 Apr 2017 12:03:30 +0200

llvm-toolchain-3.9 (1:3.9.1-6) unstable; urgency=medium

  * Upload in unstable
  * Bring back ocaml. Thanks to Cyril Soldani (Closes: #858626)

 -- Sylvestre Ledru <sylvestre@debian.org>  Fri, 14 Apr 2017 10:02:03 +0200

llvm-toolchain-3.9 (1:3.9.1-6~exp2) experimental; urgency=medium

  * Add override_dh_makeshlibs for the libllvm or liblldb versions
    Thanks to Julien Cristau for the patch
  * change the min version of the libclang1 symbols to 1:3.9.1-6~
  * Fix the symlink on scan-build-py

 -- Sylvestre Ledru <sylvestre@debian.org>  Tue, 28 Mar 2017 06:32:40 +0200

llvm-toolchain-3.9 (1:3.9.1-6~exp1) experimental; urgency=medium

  [ Rebecca N. Palmer ]
  * Allow '!pointer' in OpenCL (Closes: #857623)
  * Add missing liblldb symlink (Closes: #857683)
  * Use versioned symbols (Closes: #848368)

 -- Sylvestre Ledru <sylvestre@debian.org>  Sun, 19 Mar 2017 10:12:03 +0100

llvm-toolchain-3.9 (1:3.9.1-5) unstable; urgency=medium

  * Fix the incorrect symlink to scan-build-py (Closes: #856869)

 -- Sylvestre Ledru <sylvestre@debian.org>  Sun, 12 Mar 2017 10:01:10 +0100

None of these changes look like they could have caused this bug. But I think it's relatively safe to assume now that this issue might not actually be an issue in LDC at all.

@ximion
Copy link
Contributor Author

ximion commented Sep 6, 2017

I think we can close this. If it ever happens again, I will file a new bug report. Thanks for all your help!

@ximion ximion closed this as completed Sep 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants