-
Notifications
You must be signed in to change notification settings - Fork 216
Cannot build lean on linux armv7l (ubuntu 16) due to segfault compiling library #1679
Comments
Here's a zipfile containing the core dump.
At a glance, it would appear that (1) scanner.m_spos is a signed int, when it should probably be unsigned, and (2) some bug has caused it to overflow during the scanning of some file. Here's some more context on the exception that was being thrown:
Stepping through the live code in gdb, the problem seems to be that the test This does seem to be borne out by the compiler warning:
For reference, here is the location of the EOF definition:
... and here's a small example showing the difference in architecture:
When I compile that with Actually here's an even smaller example:
This returns 1 on amd64 and 0 on armhf when simply compiled with
So I guess |
Anyway, it seems like the least invasive fix for this is to add |
@abliss Thanks for reporting the problem. I checked the C++ documentation, and the meaning of
BTW, according to the standard EOF is a macro definition of type @gebner @Kha @jroesch I think we should go with @abliss' suggestion (i.e., add |
BTW, note that according to the standard there is no guarantee that |
@abliss I don't have an ARM machine to make experiments.
|
Thanks @leodemoura , I'll give that a try and report back. Meanwhile, adding
So it seems that there may be another arch-specific assumption buried in there somewhere. |
I think it is the same assumption. The whole code base is assuming |
Sorry, my previous comment doesn't make sense. The |
Yeah, I figured that the BTW, if you want to experiment locally, you might have some luck installing the Android NDK on your x86 machine. I believe it comes with a cross-compiling toolkit and a QEMU emulator. (I'm running out of an ubuntu chroot on my quad-core Nexus 7; the qemu performance will likely be much slower.) |
BTW is there documentation on the |
@abliss #include <iostream>
int main() {
char c = -1;
return (c == EOF);
} |
@leodemoura The I have now rebooted my Raspberry Pi for the first time in 2 years and will try to fix this issue:
|
The problem already happens when reading I'm now compiling a debug build with the latest changes. Each build takes about 8 hours on my rpi, so I'm afraid this will take a bit of time. |
@gebner Thanks again for your heroic efforts! |
I tried reducing the library to just Is there expected variation in a produced .olean file (timestamps, random numbers, threading?) or should it be deterministic based on the inputs? (I did notice that if I run it under valgrind on ARM, it reported a handful of operations on undefined data, but this was due to a bug in valgrind: https://lists.launchpad.net/touch-packages/msg70519.html ... after upgrading valgrind from 3.11 to 3.13, it seems to be running cleanly on ARM as well so far.) I tried this experiment to answer my own question:
On ARM, This produced no successes, and a string of failures in 4 different flavors:
(without the -j 1 flag, I seem to get different output every time.) On amd64, this produced a string of 15 identical successes:
Moving one of these successful core.olean files to the ARM machine allowed it to run correctly, so it seems that deserialization is fine, but serialization is flaky. I wonder if there might be a thread synch issue (even with -j1), which for some reason only affects ARM (different scheduler, or maybe just slower). Other than that I don't know what could be causing the nondeterministic corruption. |
Thanks for figuring out that it works with the olean file generated on amd64. I have tried to add a few checks in the code to narrow down the problem. From a top-down view, an olean file consists of a bit of metadata (lean version, module dependencies), followed by the actual content (a byte array) plus a hash code (which we don't check for performance reasons). For some reason, when reading the olean file, already this actual content part is corrupted and the hash code is wrong.
Yes, mainly due to multi-threading. AFAICT we don't use random numbers or store timestamps. The variation comes mainly from the
You can run lean with the Typically these non-deterministic threading bugs arise due to timing differences: the ARM machine is simply slower and hence hits some corner cases. We've had this situation with slow debug builds in the past as well. |
Thanks for those details. I can confirm that the bug still happens for me with |
This is terribly, terribly weird. We seem to lose 3 bytes when writing the blob section of the olean file. All of these bytes have the value Stepping through with the debugger, when writing, the blob has size 152240--this is consistent between amd64 and arm and also the size we write into the file. However 3 out of those 152240 bytes don't make it to the file. So when reading, we are 3 bytes short.... Here is the full diff:
|
Are these 3 bytes the only Even worse: they get lost at |
That was my first thought as well, but no, other 0xFF bytes survive just fine.
Good catch! Now the issue becomes interesting! Even better, it only happens at some 4096*n page boundaries, namely n=10, n=14, and n=34 ?!? |
The 4096 byte boundaries align perfectly with the 8192 byte buffer that libstdc++ uses:
Well, except for the times where it flushes a buffer with 8191 bytes instead of 8192?!?! |
And I have a reproducible minimal test: #include <iostream>
#include <fstream>
int main() {
constexpr auto fn = "buffer_0xff.out";
constexpr unsigned size = 1000000;
{ // write a million times 0xFF
std::ofstream out(fn, std::ios_base::binary);
for (unsigned i = 0; i < size; i++) {
out.put(0xFF);
}
}
auto actual = std::ifstream(fn, std::ios_base::ate | std::ios_base::binary).tellg();
std::cout << "The file " << fn << " has size " << actual
<< " (expected: " << size << ")" << std::endl;
} The output depends on
|
Upstream knows about this breakage: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47275
So, this means we must not use We'll have to fix all places where we assume that char is signed. |
Amazing. I get the exact same result on my device (999877 with -fsigned-char, 1000000 without). Nice detective work. Sorry for suggesting |
@abliss The latest git version can now compile the standard library (well, half of it, my rpi is slow). Can you confirm that it compiles on your device, and could you please run the tests to make sure everything works? |
I think curr() function should return int, because EOF is not char but int. |
Thanks for the fix @gebner . Here's what I've got on my device at commit ce5ca7:
(btw it failed at first because |
Thanks so much for running the tests! I'll take a look at
Yes, unfortunately there is no way to add additional commands to the |
Prerequisites
or feature requests.
Description
The build of lean fails due to a segfault.
Steps to Reproduce
Expected behavior: [What you expect to happen]
make should succeed
Actual behavior: [What actually happens]
Make segfaults, apparently during a compile of the lean core?
(The exact target printed in the last line varies slightly.)
Reproduces how often: [What percentage of the time does it reproduce?]
5/5
Versions
Additional Information
I'll try attaching a core dump, but it's 46M and I don't know if github will allow it. It seems to be some problem during the handling of an exception:
The text was updated successfully, but these errors were encountered: