Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inlining failed in call to ‘always_inline’ ‘void vst1q_u64(uint64_t*, uint64x2_t)’: target specific option mismatch 11002 | vst1q_u64 (uint64_t * __a, uint64x2_t __b) #834

Closed
stefson opened this issue Jul 7, 2022 · 43 comments · Fixed by #965

Comments

@stefson
Copy link

stefson commented Jul 7, 2022

hi, this is most likely a regression from d8867c9

compiler is gcc-10.4.0 on armhf

[8/38] /usr/bin/armv7a-unknown-linux-gnueabihf-g++ -DHWY_SHARED_DEFINE -Dhwy_contrib_EXPORTS -I/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999  -O2 -pipe -fomit-frame-pointer -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-builtin-macro-redefined -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -fmerge-all-constants -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -fmath-errno -fno-exceptions -MD -MT CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o -MF CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o.d -o CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o -c /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc
FAILED: CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o 
/usr/bin/armv7a-unknown-linux-gnueabihf-g++ -DHWY_SHARED_DEFINE -Dhwy_contrib_EXPORTS -I/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999  -O2 -pipe -fomit-frame-pointer -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-builtin-macro-redefined -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -fmerge-all-constants -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -fmath-errno -fno-exceptions -MD -MT CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o -MF CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o.d -o CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_128a.cc.o -c /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc
In file included from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/foreach_target.h:23,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:20:
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/targets.h: In function ‘std::vector<unsigned int> hwy::SupportedAndGeneratedTargets()’:
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/detect_targets.h:438:50: warning: integer overflow in expression of type ‘int’ results in ‘-2147483648’ [-Woverflow]
  438 | #define HWY_TARGETS (HWY_ATTAINABLE_TARGETS & (2 * HWY_STATIC_TARGET - 1))
      |                                                  ^
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/targets.h:69:48: note: in expansion of macro ‘HWY_TARGETS’
   69 |   for (uint32_t targets = SupportedTargets() & HWY_TARGETS; targets != 0;
      |                                                ^~~~~~~~~~~
In file included from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/highway.h:25,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/shared-inl.h:103,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/traits128-inl.h:27,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:23,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/foreach_target.h:81,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:20:
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/targets.h: In member function ‘size_t hwy::ChosenTarget::GetIndex() const’:
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/detect_targets.h:438:50: warning: integer overflow in expression of type ‘int’ results in ‘-2147483648’ [-Woverflow]
  438 | #define HWY_TARGETS (HWY_ATTAINABLE_TARGETS & (2 * HWY_STATIC_TARGET - 1))
      |                                                  ^
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/targets.h:157:7: note: in definition of macro ‘HWY_CHOSEN_TARGET_SHIFT’
  157 |   ((((X) >> (HWY_HIGHEST_TARGET_BIT + 1 - HWY_MAX_DYNAMIC_TARGETS)) & \
      |       ^
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/targets.h:163:28: note: in expansion of macro ‘HWY_TARGETS’
  163 |   (HWY_CHOSEN_TARGET_SHIFT(HWY_TARGETS) | HWY_CHOSEN_TARGET_MASK_SCALAR | 1u)
      |                            ^~~~~~~~~~~
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/targets.h:267:47: note: in expansion of macro ‘HWY_CHOSEN_TARGET_MASK_TARGETS’
  267 |                                               HWY_CHOSEN_TARGET_MASK_TARGETS);
      |                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/ops/arm_neon-inl.h:29,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/highway.h:322,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/shared-inl.h:103,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/traits128-inl.h:27,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:23,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/foreach_target.h:81,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:20:
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/arm_neon.h: In function ‘void hwy::N_NEON::StoreU(hwy::N_NEON::Vec128<long long unsigned int, 2>, hwy::N_NEON::Full128<long long unsigned int>, uint64_t*)’:
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/arm_neon.h:11002:1: error: inlining failed in call to ‘always_inline’ ‘void vst1q_u64(uint64_t*, uint64x2_t)’: target specific option mismatch
11002 | vst1q_u64 (uint64_t * __a, uint64x2_t __b)
      | ^~~~~~~~~
In file included from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/highway.h:322,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/shared-inl.h:103,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/traits128-inl.h:27,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:23,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/foreach_target.h:81,
                 from /var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_128a.cc:20:
/var/tmp/portage/portage/dev-cpp/highway-9999/work/highway-9999/hwy/ops/arm_neon-inl.h:2744:12: note: called from here
 2744 |   vst1q_u64(unaligned, v.raw);
      |   ~~~~~~~~~^~~~~~~~~~~~~~~~~~

full build log: build.log.zip

@jan-wassenberg
Copy link
Member

@stefson
hm, I'm mystified 😦 Here is the function definition I see from armhf gcc 10:

__extension__ extern __inline void
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vst1q_u64 (uint64_t * __a, uint64x2_t __b)
{
  __builtin_neon_vst1v2di ((__builtin_neon_di *) __a, (int64x2_t) __b);
}

Unlike the p64 version, which sets extra attributes, I don't see any here. If you comment out the call to vst1q_u64, are there any other errors?

For the warning, I think we can fix that by replacing 2 with 2ULL; we'll anyway soon make targets 64-bit.

@jan-wassenberg
Copy link
Member

@stefson do you have any idea what might be happening? If not, it's an option to disable runtime dispatch for arm7 on this version of GCC.

@stefson
Copy link
Author

stefson commented Jul 22, 2022

could you guide me a little bit in how to disable runtime dispatch?

@jan-wassenberg
Copy link
Member

Sure, in detect_targets.h we have a line #if HWY_ARCH_X86 || (HWY_ARCH_ARM && HWY_COMPILER_GCC_ACTUAL && HWY_OS_LINUX). You can for example change HWY_ARCH_ARM to HWY_ARCH_ARM_A64.

@stefson
Copy link
Author

stefson commented Jul 25, 2022

do you mean as in:

diff --git a/hwy/detect_targets.h b/hwy/detect_targets.h
index afc9154..7be7770 100644
--- a/hwy/detect_targets.h
+++ b/hwy/detect_targets.h
@@ -372,7 +372,7 @@
 
 // x86 compilers generally allow runtime dispatch. On Arm, currently only GCC
 // does, and we require Linux to detect CPU capabilities.
-#if HWY_ARCH_X86 || (HWY_ARCH_ARM && HWY_COMPILER_GCC_ACTUAL && HWY_OS_LINUX)
+#if HWY_ARCH_X86 || (HWY_ARCH_ARM_A64 && HWY_COMPILER_GCC_ACTUAL && HWY_OS_LINUX)
 #define HWY_HAVE_RUNTIME_DISPATCH 1
 #else
 #define HWY_HAVE_RUNTIME_DISPATCH 0

?

@jan-wassenberg
Copy link
Member

Yes, looks good :) If this helps you, feel free to send this as a pull request, or we can do it if you prefer.

@stefson
Copy link
Author

stefson commented Jul 25, 2022

it helps indeed, but I need more time to iron this out - arm hardware really is slow.

copybara-service bot pushed a commit that referenced this issue Sep 5, 2022
copybara-service bot pushed a commit that referenced this issue Sep 5, 2022
copybara-service bot pushed a commit that referenced this issue Sep 5, 2022
@stefson
Copy link
Author

stefson commented Sep 5, 2022

I wonder about a sensible strategy for a fix on the compiler side?

@malaterre
Copy link
Contributor

malaterre commented Sep 5, 2022

I cannot reproduce any compilation issue on armhf/gcc10|11|12.

For reference:

% grep -3 vst1q_u64 /usr/lib/gcc/arm-linux-gnueabihf/*/include/arm_neon.h
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h-
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h-__extension__ extern __inline void
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h:vst1q_u64 (uint64_t * __a, uint64x2_t __b)
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h-{
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h-  __builtin_neon_vst1v2di ((__builtin_neon_di *) __a, (int64x2_t) __b);
/usr/lib/gcc/arm-linux-gnueabihf/10/include/arm_neon.h-}
--
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h-
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h-__extension__ extern __inline void
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h:vst1q_u64 (uint64_t * __a, uint64x2_t __b)
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h-{
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h-  __builtin_neon_vst1v2di ((__builtin_neon_di *) __a, (int64x2_t) __b);
/usr/lib/gcc/arm-linux-gnueabihf/11/include/arm_neon.h-}
--
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h-
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h-__extension__ extern __inline void
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h:vst1q_u64 (uint64_t * __a, uint64x2_t __b)
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h-{
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h-  __builtin_neon_vst1v2di ((__builtin_neon_di *) __a, (int64x2_t) __b);
/usr/lib/gcc/arm-linux-gnueabihf/12/include/arm_neon.h-}

% gcc-10 --version
gcc-10 (Debian 10.4.0-4) 10.4.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

% gcc-11 --version
gcc-11 (Debian 11.3.0-5) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

% gcc-12 --version
gcc-12 (Debian 12.2.0-1) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@stefson
Copy link
Author

stefson commented Sep 5, 2022

I see the same results as you for my gcc-10 armv7a cross compiler, but still it gives me the error without the now pushed patch.

@malaterre
Copy link
Contributor

I see the same results as you for my gcc-10 armv7a cross compiler, but still it gives me the error without the now pushed patch.

Add '--verbose' to the compilation line that is failing and post back. Eg.:

/usr/bin/armv7a-unknown-linux-gnueabihf-g++ --verbose -DHWY_SHARED_DEFINE -Dhwy_contrib_EXPORTS [...]

@stefson
Copy link
Author

stefson commented Sep 5, 2022

hey, here is my output from --verbose, it is with commit 9b3bd6d to not hide the problem with the current workaround:

LANG=C /usr/bin/armv7a-unknown-linux-gnueabihf-g++ --verbose -DHWY_SHARED_DEFINE -Dhwy_contrib_EXPORTS -I/var/tmp/portage/dev-cpp/highway-9999/work/highway-9999  -O2 -pipe -fomit-frame-pointer -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -Wno-builtin-macro-redefined -D__DATE__=\"redacted\" -D__TIMESTAMP__=\"redacted\" -D__TIME__=\"redacted\" -fmerge-all-constants -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -fmath-errno -fno-exceptions -MD -MT CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o -MF CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o.d -o CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o -c /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_i64d.cc
Using built-in specs.
COLLECT_GCC=/usr/bin/armv7a-unknown-linux-gnueabihf-g++
Target: armv7a-unknown-linux-gnueabihf
Configured with: /var/tmp/portage/cross-armv7a-unknown-linux-gnueabihf/gcc-10.4.0/work/gcc-10.4.0/configure --host=x86_64-pc-linux-gnu --target=armv7a-unknown-linux-gnueabihf --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/armv7a-unknown-linux-gnueabihf/gcc-bin/10.4.0 --includedir=/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include --datadir=/usr/share/gcc-data/armv7a-unknown-linux-gnueabihf/10.4.0 --mandir=/usr/share/gcc-data/armv7a-unknown-linux-gnueabihf/10.4.0/man --infodir=/usr/share/gcc-data/armv7a-unknown-linux-gnueabihf/10.4.0/info --with-gxx-include-dir=/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/g++-v10 --with-python-dir=/share/gcc-data/armv7a-unknown-linux-gnueabihf/10.4.0/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 10.4.0 p5' --disable-esp --enable-libstdcxx-time --disable-libstdcxx-pch --enable-poison-system-directories --with-sysroot=/usr/armv7a-unknown-linux-gnueabihf --disable-bootstrap --enable-__cxa_atexit --enable-clocale=gnu --disable-multilib --disable-fixed-point --with-float=hard --with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 --enable-libgomp --disable-libssp --disable-libada --disable-cet --disable-systemtap --disable-vtable-verify --disable-libvtv --without-zstd --enable-lto --without-isl --enable-default-pie --enable-default-ssp
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.4.0 (Gentoo 10.4.0 p5) 
COLLECT_GCC_OPTIONS='-v' '-D' 'HWY_SHARED_DEFINE' '-D' 'hwy_contrib_EXPORTS' '-I' '/var/tmp/portage/dev-cpp/highway-9999/work/highway-9999' '-O2' '-pipe' '-fomit-frame-pointer' '-fPIC' '-fvisibility=hidden' '-fvisibility-inlines-hidden' '-Wno-builtin-macro-redefined' '-D' '__DATE__="redacted"' '-D' '__TIMESTAMP__="redacted"' '-D' '__TIME__="redacted"' '-fmerge-all-constants' '-Wall' '-Wextra' '-Wconversion' '-Wsign-conversion' '-Wvla' '-Wnon-virtual-dtor' '-fmath-errno' '-fno-exceptions' '-MD' '-MT' 'CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o' '-MF' 'CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o.d' '-o' 'CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o' '-c' '-shared-libgcc'  '-mfloat-abi=hard' '-mfpu=vfpv3-d16' '-mtls-dialect=gnu' '-marm' '-mlibarch=armv7-a+fp' '-march=armv7-a+fp'
 /usr/libexec/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/cc1plus -quiet -v -I /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999 -MD CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.d -MF CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o.d -MT CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o -D_GNU_SOURCE -D HWY_SHARED_DEFINE -D hwy_contrib_EXPORTS -D __DATE__="redacted" -D __TIMESTAMP__="redacted" -D __TIME__="redacted" /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_i64d.cc -quiet -dumpbase vqsort_i64d.cc -mfloat-abi=hard -mfpu=vfpv3-d16 -mtls-dialect=gnu -marm -mlibarch=armv7-a+fp -march=armv7-a+fp -auxbase-strip CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o -O2 -Wno-builtin-macro-redefined -Wall -Wextra -Wconversion -Wsign-conversion -Wvla -Wnon-virtual-dtor -version -fomit-frame-pointer -fPIC -fvisibility=hidden -fvisibility-inlines-hidden -fmerge-all-constants -fmath-errno -fno-exceptions -o - |
 /usr/libexec/gcc/armv7a-unknown-linux-gnueabihf/as -v -I /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999 -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -meabi=5 -o CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o
GNU assembler version 2.36.1 (armv7a-unknown-linux-gnueabihf) using BFD version (Gentoo 2.36.1 p5) 2.36.1
Assembler messages:
Fatal error: can't create CMakeFiles/hwy_contrib.dir/hwy/contrib/sort/vqsort_i64d.cc.o: No such file or directory
GNU C++14 (Gentoo 10.4.0 p5) version 10.4.0 (armv7a-unknown-linux-gnueabihf)
	compiled by GNU C version 10.4.0, GMP version 6.2.1, MPFR version 4.1.0-p13, MPC version 1.2.1, isl version none
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring nonexistent directory "/usr/armv7a-unknown-linux-gnueabihf/usr/local/include"
ignoring nonexistent directory "/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/../../../../armv7a-unknown-linux-gnueabihf/include"
#include "..." search starts here:
#include <...> search starts here:
 /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999
 /usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/g++-v10
 /usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/g++-v10/armv7a-unknown-linux-gnueabihf
 /usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/g++-v10/backward
 /usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include
 /usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include-fixed
 /usr/armv7a-unknown-linux-gnueabihf/usr/include
End of search list.
GNU C++14 (Gentoo 10.4.0 p5) version 10.4.0 (armv7a-unknown-linux-gnueabihf)
	compiled by GNU C version 10.4.0, GMP version 6.2.1, MPFR version 4.1.0-p13, MPC version 1.2.1, isl version none
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: d32c7f800b89674769804ef9c6a8ad26
In file included from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/ops/arm_neon-inl.h:29,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/highway.h:358,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/shared-inl.h:103,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/traits-inl.h:27,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_i64d.cc:23,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/foreach_target.h:81,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_i64d.cc:20:
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/arm_neon.h: In function 'void hwy::N_NEON::StoreU(hwy::N_NEON::Vec128<long long int, 2>, hwy::N_NEON::Full128<long long int>, int64_t*)':
/usr/lib/gcc/armv7a-unknown-linux-gnueabihf/10.4.0/include/arm_neon.h:10958:1: error: inlining failed in call to 'always_inline' 'void vst1q_s64(int64_t*, int64x2_t)': target specific option mismatch
10958 | vst1q_s64 (int64_t * __a, int64x2_t __b)
      | ^~~~~~~~~
In file included from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/highway.h:358,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/shared-inl.h:103,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/traits-inl.h:27,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_i64d.cc:23,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/foreach_target.h:81,
                 from /var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/contrib/sort/vqsort_i64d.cc:20:
/var/tmp/portage/dev-cpp/highway-9999/work/highway-9999/hwy/ops/arm_neon-inl.h:2765:12: note: called from here
 2765 |   vst1q_s64(unaligned, v.raw);
      |   ~~~~~~~~~^~~~~~~~~~~~~~~~~~

frankly I don't see any difference in the error, does the other information tell you something?

@malaterre
Copy link
Contributor

malaterre commented Sep 5, 2022

@jan-wassenberg Do you believe it makes sense to compile highway with neon support using the default -mfpu=vfpv3-d16 ( generic-armv7-a defaults to vfpv3-d16.) ...

@jan-wassenberg
Copy link
Member

@malaterre, good catch, thanks for pointing to that. vfpv4 is supported since 2009, I'd be surprised if anyone still cares about vfpv3.
set_macros-inl.h does:

#if HWY_ARCH_ARM_V7
#define HWY_TARGET_STR "+neon-vfpv4"

It makes sense that the compiler complains because arm_neon.h is compiled with the default target and only for Highway implementation and user code do we set vfpv4.

Here's an idea @stefson : does it help to, in arm_neon-inl.h move the following block to the line after HWY_BEFORE_NAMESPACE();?

HWY_DIAGNOSTICS(push)
HWY_DIAGNOSTICS_OFF(disable : 4701, ignored "-Wuninitialized")
#include <arm_neon.h>
HWY_DIAGNOSTICS(pop)

@stefson
Copy link
Author

stefson commented Sep 5, 2022

Can you please post a patch against latest git for your idea? The risk of a missunderstanding is too high if you ask me that way :D

@jan-wassenberg
Copy link
Member

Sure, sent :)

@stefson
Copy link
Author

stefson commented Sep 5, 2022

with gcc-10.4.0: latest-git+patch.log.gz

this does not look good :-S

@stefson
Copy link
Author

stefson commented Sep 6, 2022

armv7-gcc still broken with commit 9934046 , here is the build log: build.log.gz

@jan-wassenberg
Copy link
Member

Thanks for sharing the result. I was unable to reproduce it with GCC 10.3 (godbolt lacks 10.4) and -O2 -march=armv7-a -mfpu=vfpv3-d16, and your -O2 -mfloat-abi=hard -mfpu=vfpv3-d16 -marm -mlibarch=armv7-a+fp -march=armv7-a+fp.
https://gcc.godbolt.org/z/KrYz818xY

@stefson
Copy link
Author

stefson commented Sep 7, 2022

can you please name me the gcc versions (gcc-10.3.0 and later) which godbolt offers you? (Edit: I meant versions :D )

@jan-wassenberg
Copy link
Member

You can see them in the dropdown menu in the link above, where it currently says "ARM GCC 10.3.1" :) The next higher one is 11.1.

@stefson
Copy link
Author

stefson commented Sep 7, 2022

ah, got it! :D

I can offer you a log of failed compile with gcc-11.3.0, which seems identically to me: gcc-11.3.0-armv7a.log.gz

@jan-wassenberg
Copy link
Member

:)
The question is not whether we can get it to fail with other compilers. Instead the problem appears to be the configuration of the compiler, because it works (see godbolt link) with 11.3 and the flags specified there.
Have you compiled gcc from source, or is it from a binary release?

@stefson
Copy link
Author

stefson commented Sep 7, 2022

I've opened a gentoo bug for this, hopefully someone more experienced from toolchain will take a look at it: https://bugs.gentoo.org/869077

@jan-wassenberg
Copy link
Member

Nice, glad you've got feedback from the Gentoo bug :) Is the HWY_CMAKE_ARM7 solution enough for you?

(I still don't know why the default flags cause this conflict, but I imagine most use cases will be fine with setting compiler flags that require an Arm from 2009 or later.)

@stefson
Copy link
Author

stefson commented Sep 18, 2022

so, I did compile on armv7a with neon and with gcc-10.4.0, and it passes. Also all tests are passing, yay! test compiling on armv7a+musl is still on my queue though.

it seems libjxl now has a more or less similar hickup: libjxl/libjxl#1748

jolivain added a commit to jolivain/highway that referenced this issue Feb 20, 2023
When using a armv7 gcc >= 8 toolchain (like [1]) with Highway
configured with -DHWY_CMAKE_ARM7=OFF and HWY_ENABLE_CONTRIB=ON,
compilation fails with error:

    In file included from /build/highway-1.0.3/hwy/ops/arm_neon-inl.h:33,
                     from /build/highway-1.0.3/hwy/highway.h:358,
                     from /build/highway-1.0.3/hwy/contrib/sort/shared-inl.h:104,
                     from /build/highway-1.0.3/hwy/contrib/sort/traits128-inl.h:27,
                     from /build/highway-1.0.3/hwy/contrib/sort/vqsort_128d.cc:23,
                     from /build/highway-1.0.3/hwy/foreach_target.h:81,
                     from /build/highway-1.0.3/hwy/contrib/sort/vqsort_128d.cc:20:
    /toolchain/lib/gcc/arm-buildroot-linux-gnueabihf/12.2.0/include/arm_neon.h: In function ‘void hwy::N_NEON::StoreU(Vec128<long long unsigned int, 2>, Full128<long long unsigned int>, uint64_t*)’:
    /toolchain/lib/gcc/arm-buildroot-linux-gnueabihf/12.2.0/include/arm_neon.h:11052:1: error: inlining failed in call to ‘always_inline’ ‘void vst1q_u64(uint64_t*, uint64x2_t)’: target specific option mismatch
    11052 | vst1q_u64 (uint64_t * __a, uint64x2_t __b)
          | ^~~~~~~~~
    /build/highway-1.0.3/hwy/ops/arm_neon-inl.h:2786:12: note: called from here
     2786 |   vst1q_u64(unaligned, v.raw);
          |   ~~~~~~~~~^~~~~~~~~~~~~~~~~~

The same errors happen when configured with HWY_ENABLE_EXAMPLES=ON,
or from client libraries like libjxl (at other places).

The issue is that Highway Arm NEON ops have a dependency on the
Advanced SIMD (Neon) v2 and the VFPv4 floating-point instructions.
The SIMD (Neon) v1 and VFPv3 instructions are not supported.

There was several attempts to fix variants of this issues.
See google#834 and google#1032.

HWY_NEON target is selected only if __ARM_NEON is defined. See:
https://github.com/google/highway/blob/1.0.3/hwy/detect_targets.h#L251

This test is not sufficient since __ARM_NEON will be predefined in
any cases when Neon is enabled (neon-vfpv3, neon-vfpv4).

The issue is that HWY_CMAKE_ARM7=ON implies VFPv4 / NEON SIMD v2.
When setting HWY_CMAKE_ARM7=OFF, "neon-vfpv4" will not be forced,
but the code is still using intrinsics assuming VFPv4. Gcc will fail
with error because code cannot be generated for the selected
architecture.

This issue can be avoided by adding "-DHWY_DISABLED_TARGETS=HWY_NEON" in
CXXFLAGS. The problem with this solution is that every client program will
also need to do the same. This goes against the very purpose of
"hwy/detect_targets.h".

Technically, Armv7-a processors with VFPv4 can be detected using some
ACLE (Arm C Language Extensions [2]) predefined macros:

Basically, we want Highway to define HWY_NEON only when the target
supports SIMDv2/VFPv4 or higher. An older target with vfpv3 only
(e.g. Cortex-A8, A9, ...) would NOT define HWY_NEON, and therefore
would fallback on HWY_SCALAR implementation.

However, not all compiler completely support ACLE. There is also
several versions too. So we cannot easily rely on macros like
"__ARM_VFPV4__" (which clang predefine, but not gcc).

The alternative solution proposed in this patch, is to declare the
HWY_NEON target architecture as broken, when we detect the target is
Armv7-A, but mandatory features for vfpv4 (namely half-float, FMA)
are missing. Half-floats are tested using the macro __ARM_NEON_FP,
and the FMA with the macro __ARM_FEATURE_FMA. See ACLE [2]. The
intent of declaring the target as broken, rather than selecting
HWY_NEON only if vfpv4 features are detected is to remain a bit
conservative, since the detection is slithly inaccurate.

For a given compiler/cflags, predefined macros for Arm/ACLE can be
reviewed with commands like:

    arm-linux-gnueabihf-gcc -mcpu=cortex-a9 -mfpu=neon-vfpv3 -Wp,-dM -E -c - < /dev/null | grep -Fi arm | sort
    arm-linux-gnueabihf-gcc -mcpu=cortex-a7 -mfpu=neon-vfpv4 -Wp,-dM -E -c - < /dev/null | grep -Fi arm | sort
    clang -target armv7a -mcpu=cortex-a9 -mfpu=neon-vfpv3 -mfloat-abi=hard -Wp,-dM -E -c - < /dev/null | grep -Fi arm | sort
    clang -target armv7a -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -Wp,-dM -E -c - < /dev/null | grep -Fi arm | sort

The different values of __ARM_NEON_FP can be seen, depending which
"-mfpu" is passed. Same for __ARM_FEATURE_FMA.

[1] https://toolchains.bootlin.com/downloads/releases/toolchains/armv7-eabihf/tarballs/armv7-eabihf--glibc--bleeding-edge-2022.08-1.tar.bz2
[2] https://github.com/ARM-software/acle/

Signed-off-by: Julien Olivain <ju.o@free.fr>
@jan-wassenberg
Copy link
Member

I believe this is now fully fixed thanks to @jolivain, please feel free to re-open if anything else comes up.

@MBkkt
Copy link

MBkkt commented Mar 16, 2023

06:52:21 /home/jenkins/workspace/iresearch-release_aarch64/external/highway/hwy/ops/arm_neon-inl.h: In static member function ‘static decltype(auto) irs::simd::simd_helper<Aligned>::load(Simd, Ptr) [with Simd = hwy::N_NEON::Simd<unsigned int, 4, 0>; Ptr = unsigned int*; bool Aligned = false]’:
06:52:21 /home/jenkins/workspace/iresearch-release_aarch64/external/highway/hwy/ops/arm_neon-inl.h:2599:26: error: inlining failed in call to ‘always_inline’ ‘hwy::N_NEON::Vec128<unsigned int, 4> hwy::N_NEON::LoadU(hwy::N_NEON::Full128<unsigned int>, const uint32_t*)’: target specific option mismatch
06:52:21  2599 | HWY_API Vec128<uint32_t> LoadU(Full128<uint32_t> /* tag */,
  template<typename Simd, typename Ptr>
  static decltype(auto) load(const Simd simd_tag, Ptr p) {
    if constexpr (Aligned) {
      return Load(simd_tag, p);
    } else {
      return LoadU(simd_tag, p);
    }
  }

Could you help please?

Probably some incorrect/missing flags on aarch64, but maybe you have an idea on what I should look?

Just in case, repo is public, but I don't expect you will look
iresearch-toolkit/iresearch#485

@jan-wassenberg
Copy link
Member

hm, I believe the fix for this issue came after the 1.0.3 release that you seem to be using. We will do another release in the next few days.

Just to confirm: you are building for aarch64, right? If it's actually Arm V7, then you will want to add -DHWY_CMAKE_ARM7=ON to the CMake command line.

@MBkkt
Copy link

MBkkt commented Mar 16, 2023

@jan-wassenberg Thanks for answer!

Yes it's definitely armv8, we don't have armv7 machines

I commented on this issue, because I tried some time ago and faced with it, then I see it resolved, decided to try another time, and faced with new issue.

In general I really suspect this cmake file doing something wrong :(
https://github.com/iresearch-toolkit/iresearch/blob/master/cmake/OptimizeForArchitecture.cmake

@stefson
Copy link
Author

stefson commented Mar 16, 2023

I've tested all kind of combinations, armv7 and aarch64, without any breakage with current git. Whats your toolchain and cpu, please?

@jan-wassenberg
Copy link
Member

Thanks @stefson . I think one problem is indeed that they are overriding the compiler flags with a separate CMake file outside of Highway.

Seems that these lines are insufficient and possibly counterproductive:
https://github.com/iresearch-toolkit/iresearch/blob/master/cmake/OptimizeForArchitecture.cmake#L587

Does it help to replace with -march=armv8.2-a?

But another issue is that they are testing with Highway 1.0.3 which came before some of our fixes here.

@MBkkt
Copy link

MBkkt commented Mar 16, 2023

@stefson
I build our public repo with

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DUSE_TESTS=On -DUSE_IPO=Off -DCMAKE_C_COMPILER=gcc-10 -DCMAKE_CXX_COMPILER=g++-10 -G 'Unix Makefiles' ..

unfortunately it needs some external packages: icu, boost, lz4

Then I see this from cmake.

06:49:35 CMake Warning at cmake/OptimizeForArchitecture.cmake:585 (message):
06:49:35   Architecture auto-detection for CMAKE_SYSTEM_PROCESSOR 'aarch64' is not
06:49:35   supported by OptimizeForArchitecture.cmake on ARM
06:49:35 Call Stack (most recent call first):
06:49:35   cmake/OptimizeForArchitecture.cmake:622 (OptimizeForArchitectureArm)
06:49:35   CMakeLists.txt:204 (OptimizeForArchitecture)

So I assume https://github.com/iresearch-toolkit/iresearch/blob/master/cmake/OptimizeForArchitecture.cmake#L622
this called and do nothing.
Also I assume it's real aarch64, probably I need to specify some flags?

@MBkkt
Copy link

MBkkt commented Mar 16, 2023

Anyway thanks, probably I should do something with this broken cmake script before update highway

@MBkkt
Copy link

MBkkt commented Mar 16, 2023

Seems that these lines are insufficient and possibly counterproductive:

Agreed, because for us machine which run code can be with different processor but same architecture (in general we only support x86/armv8).

I think one problem is indeed that they are overriding the compiler flags with a separate CMake file outside of Highway.

I want to remove this script, and measure difference, should I specify something additional?
Or maybe some approach that you can recommend?

Maybe I can specify highway cmake first and it will specify all needed flags instead of me?)

@jan-wassenberg
Copy link
Member

Ah, it's good news that we're hitting the "auto" case, then this CMake file is indeed not doing anything.

On GCC+aarch64, Highway trunk works out of the box and generates code for all targets without any required flags:
https://gcc.godbolt.org/z/e4M94doPc

The 1.0.3 release might benefit from you adding -march=armv8.2-a.

copybara-service bot pushed a commit that referenced this issue Apr 9, 2024
… dispatch)

This reverts the workaround in #834. The root cause was our requiring VFPv4 on Armv7, without reliably being able to detect it. Now, we have the HWY_CMAKE_ARM7 flag to explicitly opt-in and set the required compiler flags, or the check in #1143 that disables NEON when VFPv4 prereqs are detected as missing.

Thus it is no longer necessary for us to set VFPv4 attributes on the intrinsics. This caused build failures in runtime-dispatch mode because the first compiled target was HWY_NEON, which added +crypto to intrinsics, but that caused HWY_NEON_WITHOUT_AES to fail because it inlined those intrinsics into non-crypto wrapper functions.

PiperOrigin-RevId: 623081868
copybara-service bot pushed a commit that referenced this issue Apr 9, 2024
… dispatch)

This reverts the workaround in #834. The root cause was our requiring VFPv4 on Armv7, without reliably being able to detect it. Now, we have the HWY_CMAKE_ARM7 flag to explicitly opt-in and set the required compiler flags, or the check in #1143 that disables NEON when VFPv4 prereqs are detected as missing.

Thus it is no longer necessary for us to set VFPv4 attributes on the intrinsics. This caused build failures in runtime-dispatch mode because the first compiled target was HWY_NEON, which added +crypto to intrinsics, but that caused HWY_NEON_WITHOUT_AES to fail because it inlined those intrinsics into non-crypto wrapper functions.

PiperOrigin-RevId: 623081868
copybara-service bot pushed a commit that referenced this issue Apr 9, 2024
… dispatch)

This reverts the workaround in #834. The root cause was our requiring VFPv4 on Armv7, without reliably being able to detect it. Now, we have the HWY_CMAKE_ARM7 flag to explicitly opt-in and set the required compiler flags, or the check in #1143 that disables NEON when VFPv4 prereqs are detected as missing.

Thus it is no longer necessary for us to set VFPv4 attributes on the intrinsics. This caused build failures in runtime-dispatch mode because the first compiled target was HWY_NEON, which added +crypto to intrinsics, but that caused HWY_NEON_WITHOUT_AES to fail because it inlined those intrinsics into non-crypto wrapper functions.

PiperOrigin-RevId: 623081868
copybara-service bot pushed a commit that referenced this issue Apr 9, 2024
… dispatch)

This reverts the workaround in #834. The root cause was our requiring VFPv4 on Armv7, without reliably being able to detect it. Now, we have the HWY_CMAKE_ARM7 flag to explicitly opt-in and set the required compiler flags, or the check in #1143 that disables NEON when VFPv4 prereqs are detected as missing.

Thus it is no longer necessary for us to set VFPv4 attributes on the intrinsics. This caused build failures in runtime-dispatch mode because the first compiled target was HWY_NEON, which added +crypto to intrinsics, but that caused HWY_NEON_WITHOUT_AES to fail because it inlined those intrinsics into non-crypto wrapper functions.

PiperOrigin-RevId: 623255087
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants