Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal compiler error when compiling on ARM v8 ThunderX2 #13622

Closed
undertherain opened this issue May 25, 2019 · 20 comments
Closed

internal compiler error when compiling on ARM v8 ThunderX2 #13622

undertherain opened this issue May 25, 2019 · 20 comments

Comments

@undertherain
Copy link

undertherain commented May 25, 2019

Compilation of NumPy 1.16.x fails when I compile it on a machine with Cavium ThunderX2 CPU

Versions 1.15.4 and below compile normally

Reproducing code example:

get last stable release, run python3 setup.py install
or pip install

Error message:

the first error which is not from _configtest.c is

In file included from numpy/core/include/numpy/npy_math.h:580:0,
                 from numpy/core/src/npymath/npy_math_common.h:9,
                 from numpy/core/src/npymath/npy_math_complex.c.src:34:
numpy/core/src/npymath/npy_math_internal.h.src: In function ‘npy_cacoshf’:
numpy/core/src/npymath/npy_math_internal.h.src:482:12: internal compiler error: Segmentation fault
     return @kind@@c@(x, y);
            ^~~~~~~~~~~~~~~
0x994aff crash_signal

full compilation log: https://pastebin.com/83Jj3knk

Environment:

OS: CentOS Linux release 7.6.1810 (AltArch)
Python 3.7.2, tried several other versions.
GCC 7.4.0, also tried system's default 4.8.5
CPU:

Family: ARM
Manufacturer: Cavium Inc.
ID: F1 0A 1F 43 00 00 00 00
Signature: Implementor 0x43, Variant 0x1, Architecture 15, Part 0x0af, Revision 1
Version: Cavium ThunderX2(R) CPU CN9975 v2.1 @ 2.0GHz
@tylerjereddy
Copy link
Contributor

Hmm, ARMv8 testing is part of our CI these days--I wonder what is so different about your setup? Pretty sure shippable is using Ubuntu for their native builds--so that's one difference.

@undertherain
Copy link
Author

undertherain commented May 29, 2019

Well, one thing I should mention is that we have x64 nodes with amlost identical software stack (same OS, packages version, Python installed with same spack comand etc) - and NumPy 1.16.x compiles ok.
I'll be able to physically access servers only in a couple of weeks and try to live-boot to ubuntu arm server and try compiling on it, if no other solutions emerge until then...

@crbaird
Copy link

crbaird commented Jun 7, 2019

I'm also seeing this issue on CentOS 7.6. Interestingly, it builds just fine on the same hardware with SLE 12 SP4.

@mattip
Copy link
Member

mattip commented Jun 11, 2019

I am not sure we can help with this too much since our CI succeeds. What compiler is SLE 12 SP4 using?

You might be able to extract this as a stand-alone compiler bug by looking a few lines up for the actual gcc call and the gcc-options, then whittling down the example until it fails, starting off by removing unneeded include paths. Mine is below. Note the file you want to compile is actually the processed numpy/core/src/npymath/npy_math.c

``` x86_64-linux-gnu-gcc -Ibuild/src.linux-x86_64-3.6/numpy/core/src/npymath -Inumpy/core/include -Ibuild/src.linux-x86_64-3.6/numpy/core/include/numpy -Inumpy/core/src/common -Inumpy/core/src -Inumpy/core -Inumpy/core/src/npymath -Inumpy/core/src/multiarray -Inumpy/core/src/umath -Inumpy/core/src/npysort -I/usr/include/python3.6m -I/path/to/python/include/python3.6m -Ibuild/src.linux-x86_64-3.6/numpy/core/src/common -Ibuild/src.linux-x86_64-3.6/numpy/core/src/npymath -Ibuild/src.linux-x86_64-3.6/numpy/core/src/common -Ibuild/src.linux-x86_64-3.6/numpy/core/src/npymath -c numpy/core/src/npymath/npy_math.c ```

@eric-wieser
Copy link
Member

eric-wieser commented Jun 11, 2019

It would be useful to know what the value of NPY_USE_C99_COMPLEX is on your build - that seems like the most likely difference between 1.15 and 1.16

I attempted to reduce it here but it didn't fail on any of the compilers I tried.

@linedot
Copy link

linedot commented Jun 17, 2019

Experiencing the same issue on CentOS 7.6 on Huawei Taishan 2280 ARM64 servers.
Reduced example not failing to compile on native GCC versions 8.2.0, 8.3.0 and 9.1.0
Any advice on further debugging?

Edit: Bug does not occur when using -O0 on GCC 8.2.0 and does not occur at all with GCC 9.1.0

@ginomcevoy
Copy link

ginomcevoy commented Aug 27, 2019

I am also experiencing this issue on this environment:
Processor: ThunderX2
OS: RHEL 7.5 with updates and a custom Python 3.6 installation
compiler: GCC 7.2.1 (RHSCL 3.0)

I believe that the difference between the aarch64 CI and the failing environments could be related to GLIBC version. Numpy uses the "numpy" version of npy_cacoshf (npy_math_complex.c.src:1389 in the source) if GLIBC version is below 2.18, and the "glibc" version otherwise (npy_math_complex.c:5343 in the build). Centos 7.6 uses GLIBC 2.17.

After playing with the compiler/numpy flags, I confirmed the issue occurs when GCC tries to inline the function. Adding either -O0 or -fno-inline flags makes the compilation succeed, but this is not desirable.

The only (ugly) workaround that I found so far that did not affect other functions was to force -O0 on the function definition. Here is the source diff for the workaround (GNU GCC only of course):

diff --git a/numpy/core/src/npymath/npy_math_complex.c.src b/numpy/core/src/npymath/npy_math_complex.c.src
index dad3812..adcd83c 100644
--- a/numpy/core/src/npymath/npy_math_complex.c.src
+++ b/numpy/core/src/npymath/npy_math_complex.c.src
@@ -1385,7 +1385,7 @@ npy_catan@c@(@ctype@ z)
 #endif

 #ifndef HAVE_CACOSH@C@
-@ctype@
+@ctype@ __attribute__((optimize("-O0")))
 npy_cacosh@c@(@ctype@ z)
 {
     /*

More debugging info:

  • noinline / __attribute__((noinline)) / __attribute__((gnu_inline)) attributes didn't work for me
  • Setting -std=c89 or -std=gnu89 works for that file, but numpy expects C99 standard in other places
  • GCC 4.8.5 can compile the file, even with -std=c99, for some reason.

@charris
Copy link
Member

charris commented Aug 27, 2019

@ginomcevoy Thanks for the informative debug info. You should be able to compile with GCC 7.2.1 even without the -std=c99 flag, I believe it is only needed for the 4.8 series. But that shouldn't make any difference here. Hmm...

@charris
Copy link
Member

charris commented Aug 27, 2019

Note that c99 is only needed for NumPy >= 1.17.

@ginomcevoy
Copy link

Yes, I forgot to mention that I was trying to install NumPy 1.17, that is why I didn't go for -std=c89 which works for that particular file. c99 is the default value for -std in GCC 7.2.1, and it fails with that (and with any value higher than that). Had the same problem for GCC 8.2.0, but not for GCC 4.8.5 (which is terrible at optimizing code for ThunderX2).

I was able to compile NumPy after that ugly fix, and only one test failed for me (TestComplexFunctions.test_loss_of_precision[complex256]).

@siddhesh
Copy link
Contributor

siddhesh commented Oct 9, 2019

An ICE would typically mean a compiler bug. I'll see if I can find a centos box to reproduce this.

EDIT: I should add that I just built numpy with gcc 8.3.0, so whatever the bug, it's likely to have already been fixed and only a backport may be necessary to get this working again.

EDIT2: Fixed gcc version. Sorry, checked the wrong machine, too many tabs open :/

@siddhesh
Copy link
Contributor

Sorry I couldn't find a suitable machine to reproduce this. If someone can give me access I'll be happy to help isolate the problem.

@maxim-kuvyrkov
Copy link

I've reproduced this on cortex-a72 with both CentOS 7.6 and Ubuntu 18.04. I'm looking into this.
The hunch that this is a glibc bug (also my first guess) seems to be wrong. The crash reproduces with glibc-2.27.

Please report bugs like this to the upstream community (https://gcc.gnu.org/bugzilla/). The GNU Toolchain community prioritizes ICE (internal compiler error) and wrong-code bugs. Most of the time you just need to attach pre-processed source file and cc1 command line -- add "-v -save-temps" to compilation flags to get these.

@nSircombe
Copy link

nSircombe commented Oct 23, 2019

I think it was here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90075

...and fixed, I believe.

NumPy builds with 9.2.0 on Aarch64 for me now.

@maxim-kuvyrkov
Copy link

@nSircombe , thanks! This saved me some cycles digging into this further. The patch was backported to gcc-7 and gcc-8 release branches, and will be in next update releases, which distros should pick up.
GCC 7.5 is expected to release in the next couple of weeks, and will be the final GCC 7 update.
Is anything else required from GCC side of things? Or should this be closed?

@nSircombe
Copy link

I've not tested the newer versions of GCC 7 & 8. But I can confirm that pip install numpy works for 9.2.0.

@maxim-kuvyrkov
Copy link

Confirmed that GCC 7 built from current gcc-7-branch works fine.

@mattip
Copy link
Member

mattip commented Oct 23, 2019

Thanks for the detective work. Should we wait for the toolchain to be released to close this?

@BaptisteGerondeau
Copy link

Release 1.17.4 has moved things around, and the above workaround patch is no longer applicable. Here is what works for me at the moment :

--- numpy/core/src/npymath/npy_math_internal.h.src.orig	2019-11-14 12:20:01.387180922 +0000
+++ numpy/core/src/npymath/npy_math_internal.h.src	2019-11-14 12:19:17.960646234 +0000
@@ -477,7 +477,7 @@
  * #KIND = ATAN2,HYPOT,POW,FMOD,COPYSIGN#
  */
 #ifdef HAVE_@KIND@@C@
-NPY_INPLACE @type@ npy_@kind@@c@(@type@ x, @type@ y)
+NPY_INPLACE __attribute__((optimize("-O0"))) @type@ npy_@kind@@c@(@type@ x, @type@ y)
 {
     return @kind@@c@(x, y);
 }

Not sure if it is the "best" (least worse) workaround though, feel free to give some feedback !

@mattip
Copy link
Member

mattip commented Mar 7, 2020

Closing, please reopen if there are still problems with this particular toolchain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests