Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Build failure (of 1.26.2) on SapphireRapids (avx512_spr) due to multiple definition of avx512_qsort and avx512_qselect #25274

Closed
branfosj opened this issue Nov 29, 2023 · 16 comments · Fixed by #25376
Assignees
Labels
00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Comments

@branfosj
Copy link

branfosj commented Nov 29, 2023

Describe the issue:

Building 1.26.2 on SapphireRapids with

spin build -- -Dcpu-baseline=native

or, on IceLake with

spin build -- -Dcpu-baseline=avx512_spr

Fails due to multiple definition of void avx512_qsort<_Float16>(_Float16*, long) and void avx512_qselect<_Float16>(_Float16*, long, long).

Reproduce the code example:

n/a

Error message:

FAILED: numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so                                                                                                                                          c++  -o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_arraytypes.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_einsum.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_einsum_sumprod.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_lowlevel_strided_loops.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_nditer_templ.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_scalartypes.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_loops.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_matmul.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/meson-generated_scalarmath.c.o ../numpy/core/src/umath/svml/linux/avx512/svml_z0_acos_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_acos_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_acos_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_acosh_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_acosh_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_acosh_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_asin_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_asin_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_asin_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_asinh_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_asinh_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_asinh_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atan2_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atan2_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atan2_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atan_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atan_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atan_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atanh_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atanh_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_atanh_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cbrt_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cbrt_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cbrt_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cos_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cos_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cos_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cosh_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cosh_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_cosh_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_exp2_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_exp2_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_exp2_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_exp_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_exp_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_exp_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_expm1_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_expm1_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_expm1_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log10_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log10_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log10_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log1p_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log1p_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log1p_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log2_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log2_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log2_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_log_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_pow_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_pow_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_pow_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_sin_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_sin_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_sin_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_sinh_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_sinh_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_sinh_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_tan_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_tan_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_tan_d_ha.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_tanh_d_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_tanh_s_la.s ../numpy/core/src/umath/svml/linux/avx512/svml_z0_tanh_d_ha.s numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_abstractdtypes.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_alloc.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_arrayobject.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_array_coercion.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_array_method.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_array_assign_scalar.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_array_assign_array.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_arrayfunction_override.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_buffer.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_calculation.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_compiled_base.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_common.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_common_dtype.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_convert.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_convert_datatype.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_conversion_utils.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_ctors.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_datetime.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_datetime_strings.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_datetime_busday.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_datetime_busdaycal.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_descriptor.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_dlpack.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_dtypemeta.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_dragon4.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_dtype_transfer.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_dtype_traversal.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_experimental_public_dtype_api.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_flagsobject.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_getset.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_hashdescr.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_item_selection.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_iterators.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_legacy_dtype_implementation.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_mapping.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_methods.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_multiarraymodule.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_nditer_api.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_nditer_constr.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_nditer_pywrap.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_number.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_refcount.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_sequence.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_shape.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_scalarapi.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_strfuncs.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_temp_elide.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_typeinfo.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_usertypes.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_vdot.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_quicksort.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_mergesort.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_timsort.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_heapsort.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_radixsort.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_selection.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npysort_binsearch.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_conversions.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_field_types.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_growth.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_readtext.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_rows.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_stream_pyobject.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_str_to_int.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_multiarray_textreading_tokenize.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_npymath_arm64_exports.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_array_assign.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_mem_overlap.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_npy_argparse.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_npy_hashtable.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_npy_longdouble.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_ucsnarrow.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_ufunc_override.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_numpyos.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_npy_cpu_features.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_cblasfuncs.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_common_python_xerbla.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_ufunc_type_resolution.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_clip.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_dispatching.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_extobj.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_legacy_array_method.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_override.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_reduction.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_ufunc_object.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_umathmodule.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_string_ufuncs.cpp.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath_wrapping_array_method.c.o numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so.p/src_umath__scaled_float_dtype.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -shared -fPIC -Wl,--start-group numpy/core/libnpymath.a numpy/core/lib_multiarray_umath_mtargets.a /rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/OpenBLAS/0.3.24-GCC-13.2.0/lib/libopenblas.so -Wl,--end-group
/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/binutils/2.40-GCCcore-13.2.0/bin/ld: numpy/core/libsimd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_simd_qsort_16bit.dispatch.cpp.o: in function `void avx512_qsort<_Float16>(_Float16*, long)':
/rds/projects/2017/branfosj-rse/ProblemSolving/numpy/numpy-1.26.2/build/../numpy/core/src/npysort/x86-simd-sort/src/avx512fp16-16bit-qsort.hpp:161: multiple definition of `void avx512_qsort<_Float16>(_Float16*, long)'; numpy/core/libsimd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_simd_qsort_16bit.dispatch.cpp.o:/rds/projects/2017/branfosj-rse/ProblemSolving/numpy/numpy-1.26.2/build/../numpy/core/src/npysort/x86-simd-sort/src/avx512fp16-16bit-qsort.hpp:161: first defined here
/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/binutils/2.40-GCCcore-13.2.0/bin/ld: numpy/core/libsimd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_simd_qsort_16bit.dispatch.cpp.o: in function `void avx512_qselect<_Float16>(_Float16*, long, long)':
/rds/projects/2017/branfosj-rse/ProblemSolving/numpy/numpy-1.26.2/build/../numpy/core/src/npysort/x86-simd-sort/src/avx512fp16-16bit-qsort.hpp:149: multiple definition of `void avx512_qselect<_Float16>(_Float16*, long, long)'; numpy/core/libsimd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_simd_qsort_16bit.dispatch.cpp.o:/rds/projects/2017/branfosj-rse/ProblemSolving/numpy/numpy-1.26.2/build/../numpy/core/src/npysort/x86-simd-sort/src/avx512fp16-16bit-qsort.hpp:149: first defined here
collect2: error: ld returned 1 exit status

Runtime information:

The Meson build system 
Version: 1.2.99
Source dir: /rds/projects/2017/branfosj-rse/ProblemSolving/numpy/numpy-1.26.2
Build dir: /rds/projects/2017/branfosj-rse/ProblemSolving/numpy/numpy-1.26.2/build
Build type: native build
Project name: NumPy
Project version: 1.26.2
C compiler for the host machine: cc (gcc 13.2.0 "cc (GCC) 13.2.0")
C linker for the host machine: cc ld.bfd 2.40
C++ compiler for the host machine: c++ (gcc 13.2.0 "c++ (GCC) 13.2.0")
C++ linker for the host machine: c++ ld.bfd 2.40
Cython compiler for the host machine: cython (cython 3.0.4)
Host machine cpu family: x86_64
Host machine cpu: x86_64
Program python3 found: YES (/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.11.5-GCCcore-13.2.0/bin/python)
Found pkg-config: /rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/pkgconf/2.0.3-GCCcore-13.2.0/bin/pkg-config (2.0.3)
Run-time dependency python found: YES 3.11
Has header "Python.h" with dependency python-3.11: YES
Compiler for C supports arguments -fno-strict-aliasing: YES
Test features "SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL AVX512_SPR" : Supported
Test features "AVX512_KNL" : Supported
Test features "AVX512_KNM" : Supported
Configuring npy_cpu_dispatch_config.h using configuration
Message:
     CPU Optimization Options
baseline:
        Requested : avx512_spr
		Enabled   : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL AVX512_SPR
dispatch:
        Requested : max -xop -fma4
		Enabled   : AVX512_KNL AVX512_KNM
Library m found: YES
Run-time dependency scipy-openblas found: NO (tried pkgconfig)
Run-time dependency mkl found: NO (tried pkgconfig and system)
Run-time dependency mkl found: NO (tried pkgconfig and system)
Run-time dependency accelerate found: NO (tried system)
Run-time dependency openblas found: YES 0.3.24
Message: BLAS symbol suffix:
Run-time dependency mkl found: NO (tried pkgconfig and system)
Run-time dependency accelerate found: NO (tried system)
Run-time dependency openblas found: YES 0.3.24

And the last part of the configure

 Generating multi-targets for "_umath_tests.dispatch.h"                                                                                                                                                          Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "argfunc.dispatch.h"                                                                                                                                                               Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "simd_qsort.dispatch.h"                                                                                                                                                            Enabled targets: AVX512_SKX                                                                                                                                                                                 Generating multi-targets for "simd_qsort_16bit.dispatch.h"                                                                                                                                                      Enabled targets: AVX512_SPR, AVX512_ICL                                                                                                                                                                     Generating multi-targets for "loops_arithm_fp.dispatch.h"                                                                                                                                                       Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "loops_arithmetic.dispatch.h"                                                                                                                                                      Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "loops_comparison.dispatch.h"                                                                                                                                                      Enabled targets: baseline
Generating multi-targets for "loops_exponent_log.dispatch.h"                                                                                                                                                    Enabled targets: baseline
Generating multi-targets for "loops_hyperbolic.dispatch.h"                                                                                                                                                      Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "loops_logical.dispatch.h"
  Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "loops_minmax.dispatch.h"
  Enabled targets: baseline                                                                                                                                                                                   Generating multi-targets for "loops_modulo.dispatch.h"                                                                                                                                                          Enabled targets: baseline
Generating multi-targets for "loops_trigonometric.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "loops_umath_fp.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "loops_unary.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "loops_unary_fp.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "loops_unary_fp_le.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "loops_unary_complex.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "loops_autovec.dispatch.h"
  Enabled targets: baseline
Generating multi-targets for "_simd.dispatch.h"
  Enabled targets: baseline
Build targets in project: 62

NumPy 1.26.2

  User defined options
    prefix      : /usr
    cpu-baseline: avx512_spr

Context for the issue:

No response

@r-devulap
Copy link
Member

@branfosj thanks for reporting. I am taking a look.

@r-devulap
Copy link
Member

This looks like a bug in the build system. The issue seems to be that qsort_16bit dispatch file is built with baseline cpu flags on top of the specific dispatch flags which, if I understand correctly, is not intended. When using -Dcpu-baseline=avx512_spr, the avx512_icl dispatch essentially gets built with avx512_spr leading to multiple definition error. See commands used to build the x86_simd_qsort_16bit.dispatch.cpp file below:

{
    "directory": "/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build",
    "command": "g++-12 -Inumpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/home/raghuveer/anaconda3/envs/np-dev/include/python3.11 -I/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build/meson_cpu -fdiagnostics-color=always -Wall -Winvalid-pch -std=c++17 -O2 -g -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -mavx512fp16 -fPIC -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -D__STDC_VERSION__=0 -fno-exceptions -fno-rtti -O3 -DNPY_HAVE_SSE2 -DNPY_HAVE_SSE -DNPY_HAVE_SSE3 -DNPY_HAVE_SSSE3 -DNPY_HAVE_SSE41 -DNPY_HAVE_POPCNT -DNPY_HAVE_SSE42 -DNPY_HAVE_AVX -DNPY_HAVE_F16C -DNPY_HAVE_FMA3 -DNPY_HAVE_AVX2 -DNPY_HAVE_AVX512F -DNPY_HAVE_AVX512F_REDUCE -DNPY_HAVE_AVX512CD -DNPY_HAVE_AVX512_SKX -DNPY_HAVE_AVX512VL -DNPY_HAVE_AVX512BW -DNPY_HAVE_AVX512DQ -DNPY_HAVE_AVX512BW_MASK -DNPY_HAVE_AVX512DQ_MASK -DNPY_HAVE_AVX512_CLX -DNPY_HAVE_AVX512VNNI -DNPY_HAVE_AVX512_CNL -DNPY_HAVE_AVX512IFMA -DNPY_HAVE_AVX512VBMI -DNPY_HAVE_AVX512_ICL -DNPY_HAVE_AVX512VBMI2 -DNPY_HAVE_AVX512BITALG -DNPY_HAVE_AVX512VPOPCNTDQ -DNPY_HAVE_AVX512_SPR -DNPY_HAVE_AVX512FP16 -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -mavx512fp16 -DNPY_MTARGETS_CURRENT=AVX512_SPR -MD -MQ numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -MF numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o.d -o numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -c ../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "file": "../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "output": "numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o"
  },
  {
    "directory": "/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build",
    "command": "g++-12 -Inumpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/home/raghuveer/anaconda3/envs/np-dev/include/python3.11 -I/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build/meson_cpu -fdiagnostics-color=always -Wall -Winvalid-pch -std=c++17 -O2 -g -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -mavx512fp16 -DNPY_HAVE_AVX512_SPR -DNPY_HAVE_AVX512FP16 -fPIC -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -D__STDC_VERSION__=0 -fno-exceptions -fno-rtti -O3 -DNPY_HAVE_SSE2 -DNPY_HAVE_SSE -DNPY_HAVE_SSE3 -DNPY_HAVE_SSSE3 -DNPY_HAVE_SSE41 -DNPY_HAVE_POPCNT -DNPY_HAVE_SSE42 -DNPY_HAVE_AVX -DNPY_HAVE_F16C -DNPY_HAVE_FMA3 -DNPY_HAVE_AVX2 -DNPY_HAVE_AVX512F -DNPY_HAVE_AVX512F_REDUCE -DNPY_HAVE_AVX512CD -DNPY_HAVE_AVX512_SKX -DNPY_HAVE_AVX512VL -DNPY_HAVE_AVX512BW -DNPY_HAVE_AVX512DQ -DNPY_HAVE_AVX512BW_MASK -DNPY_HAVE_AVX512DQ_MASK -DNPY_HAVE_AVX512_CLX -DNPY_HAVE_AVX512VNNI -DNPY_HAVE_AVX512_CNL -DNPY_HAVE_AVX512IFMA -DNPY_HAVE_AVX512VBMI -DNPY_HAVE_AVX512_ICL -DNPY_HAVE_AVX512VBMI2 -DNPY_HAVE_AVX512BITALG -DNPY_HAVE_AVX512VPOPCNTDQ -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -DNPY_MTARGETS_CURRENT=AVX512_ICL -MD -MQ numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -MF numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o.d -o numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -c ../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "file": "../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "output": "numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o"
  },

@r-devulap
Copy link
Member

ping @seiko2plus

@tylerjereddy
Copy link
Contributor

tylerjereddy commented Dec 11, 2023

See also SapphireRapids and IceLake sorting concerns on main reproduced at: #24842 (comment). Not sure if related though.

@charris charris added this to the 1.26.3 release milestone Dec 11, 2023
@seiko2plus
Copy link
Member

This looks like a bug in the build system. The issue seems to be that qsort_16bit dispatch file is built with baseline cpu flags on top of the specific dispatch flags which, if I understand correctly, is not intended. When using -Dcpu-baseline=avx512_spr, the avx512_icl dispatch essentially gets built with avx512_spr leading to multiple definition error. See commands used to build the x86_simd_qsort_16bit.dispatch.cpp file below:

{
    "directory": "/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build",
    "command": "g++-12 -Inumpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/home/raghuveer/anaconda3/envs/np-dev/include/python3.11 -I/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build/meson_cpu -fdiagnostics-color=always -Wall -Winvalid-pch -std=c++17 -O2 -g -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -mavx512fp16 -fPIC -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -D__STDC_VERSION__=0 -fno-exceptions -fno-rtti -O3 -DNPY_HAVE_SSE2 -DNPY_HAVE_SSE -DNPY_HAVE_SSE3 -DNPY_HAVE_SSSE3 -DNPY_HAVE_SSE41 -DNPY_HAVE_POPCNT -DNPY_HAVE_SSE42 -DNPY_HAVE_AVX -DNPY_HAVE_F16C -DNPY_HAVE_FMA3 -DNPY_HAVE_AVX2 -DNPY_HAVE_AVX512F -DNPY_HAVE_AVX512F_REDUCE -DNPY_HAVE_AVX512CD -DNPY_HAVE_AVX512_SKX -DNPY_HAVE_AVX512VL -DNPY_HAVE_AVX512BW -DNPY_HAVE_AVX512DQ -DNPY_HAVE_AVX512BW_MASK -DNPY_HAVE_AVX512DQ_MASK -DNPY_HAVE_AVX512_CLX -DNPY_HAVE_AVX512VNNI -DNPY_HAVE_AVX512_CNL -DNPY_HAVE_AVX512IFMA -DNPY_HAVE_AVX512VBMI -DNPY_HAVE_AVX512_ICL -DNPY_HAVE_AVX512VBMI2 -DNPY_HAVE_AVX512BITALG -DNPY_HAVE_AVX512VPOPCNTDQ -DNPY_HAVE_AVX512_SPR -DNPY_HAVE_AVX512FP16 -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -mavx512fp16 -DNPY_MTARGETS_CURRENT=AVX512_SPR -MD -MQ numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -MF numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o.d -o numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -c ../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "file": "../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "output": "numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_SPR.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o"
  },
  {
    "directory": "/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build",
    "command": "g++-12 -Inumpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p -Inumpy/_core -I../numpy/_core -Inumpy/_core/include -I../numpy/_core/include -I../numpy/_core/src/common -I../numpy/_core/src/multiarray -I../numpy/_core/src/npymath -I../numpy/_core/src/umath -I../numpy/_core/src/highway -I/home/raghuveer/anaconda3/envs/np-dev/include/python3.11 -I/home/raghuveer/MyFiles/src/wrkdir_numpy/numpy/build/meson_cpu -fdiagnostics-color=always -Wall -Winvalid-pch -std=c++17 -O2 -g -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -mavx512fp16 -DNPY_HAVE_AVX512_SPR -DNPY_HAVE_AVX512FP16 -fPIC -DNPY_INTERNAL_BUILD -DHAVE_NPY_CONFIG_H -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D_LARGEFILE64_SOURCE=1 -D__STDC_VERSION__=0 -fno-exceptions -fno-rtti -O3 -DNPY_HAVE_SSE2 -DNPY_HAVE_SSE -DNPY_HAVE_SSE3 -DNPY_HAVE_SSSE3 -DNPY_HAVE_SSE41 -DNPY_HAVE_POPCNT -DNPY_HAVE_SSE42 -DNPY_HAVE_AVX -DNPY_HAVE_F16C -DNPY_HAVE_FMA3 -DNPY_HAVE_AVX2 -DNPY_HAVE_AVX512F -DNPY_HAVE_AVX512F_REDUCE -DNPY_HAVE_AVX512CD -DNPY_HAVE_AVX512_SKX -DNPY_HAVE_AVX512VL -DNPY_HAVE_AVX512BW -DNPY_HAVE_AVX512DQ -DNPY_HAVE_AVX512BW_MASK -DNPY_HAVE_AVX512DQ_MASK -DNPY_HAVE_AVX512_CLX -DNPY_HAVE_AVX512VNNI -DNPY_HAVE_AVX512_CNL -DNPY_HAVE_AVX512IFMA -DNPY_HAVE_AVX512VBMI -DNPY_HAVE_AVX512_ICL -DNPY_HAVE_AVX512VBMI2 -DNPY_HAVE_AVX512BITALG -DNPY_HAVE_AVX512VPOPCNTDQ -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2 -mno-mmx -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq -mavx512vnni -mavx512ifma -mavx512vbmi -mavx512vbmi2 -mavx512bitalg -mavx512vpopcntdq -DNPY_MTARGETS_CURRENT=AVX512_ICL -MD -MQ numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -MF numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o.d -o numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o -c ../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "file": "../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp",
    "output": "numpy/_core/libx86_simd_qsort_16bit.dispatch.h_AVX512_ICL.a.p/src_npysort_x86_simd_qsort_16bit.dispatch.cpp.o"
  },

The behavior observed is not a bug but a consequence of defining a non-inline function in the C++ header files by x86_simd_qsort_, leading to a violation of the One Definition Rule (ODR). To resolve this, functions like qsort or any other non-inline functions should either be declared inline, which treats the function as a weak symbol and allows multiple definitions across translation units, or they should be defined within an anonymous namespace. The anonymous namespace approach ensures that each translation unit has its own unique version of the function, effectively preventing ODR violations while maintaining encapsulation.

@Flamefire
Copy link
Contributor

@seiko2plus I concluded the same in intel/x86-simd-sort#111 (comment)

TLDR: Indeed an inline fixes the double symbol. However I fear it is numpy violating the ODR:

  • "inline" allows the symbol to be defined by multiple TUs with the linker taking any of them assuming all are the same (code)
  • template functions are "inline" by default
  • numpy compiles the same file, with the same includes multiple times with different AVX flags
  • generated code of those inline functions is now different (e.g. function is compiled with AVX2 and AVX512 in 2 cpp files)
  • linking those object files together discards all but 1 of those inline function instances -> You'll end up with only AVX2 or AVX512 (or their sub-variants) as a single function
  • dispatching now (likely, haven't fully investigated how the numpy dispatching works) will dispatch to a supposedly AVX2 function but the linker might have chosen the AVX512 function -> crash at runtime on CPUs not supporting AVX512

Having all functions have internal linkage fix that

  • many functions in x86-simd-sort are even defined as static inline which is similar to the anonymous namespace
  • Trouble: static inline cannot be applied to this template specialization, so the anonymous namespace needs to be used causing this to be required at more places
  • numpy includes x86-simd-sort so all functions in x86-simd-sort need to have internal linkage which pessimizes all other users of the latter because under "normal circumstances" inline would be enough but now there'll be multiple copies of the function in a binary even though one would have been enough

And finally I think in both numpy and x86-simd-sort the use of templates and explicit specializations is overused or even misused. A common patter seems to be:

template <typename T>
void foo(T *arr, int64_t arrsize);
template <>
void foo(int16_t *arr, int64_t arrsize) { /* impl */ }
template <>
void foo(uint16_t *arr, int64_t arrsize) { /* impl */ }
// ...

Why are those templates and not simply overloaded functions? That would make e.g. the above addition of static for internal linkage much easier

@r-devulap
Copy link
Member

#25376 should fix this build issue. Could you please verify?

@Flamefire
Copy link
Contributor

I checked out the PR locally and did a pip install . in an environment where it failed before and it does succeed with that PR.

However I still think this is an ODR violation from numpy in linking together functions compiled with different architecture flags which may lead to runtime crashes depending on the linker and target environment/cpu

@seiko2plus
Copy link
Member

numpy compiles the same file, with the same includes multiple times with different AVX flags

That's true, but each compilation exports symbols with unique suffixes based on compiler # definition e.g. -DNPY_MTARGETS_CURRENT=AVX512_ICL

generated code of those inline functions is now different (e.g. function is compiled with AVX2 and AVX512 in 2 cpp files)
dispatching now (likely, haven't fully investigated how the numpy dispatching works) will dispatch to a supposedly AVX2 function but the linker might have chosen the AVX512 function -> crash at runtime on CPUs not supporting AVX512

If the compiler fails to inline a function then the priority goes to the lowest interest target that's why we tend to export unique weak symbols for each TPU, see #25045 (comment) for more clearfiction.

And finally I think in both numpy and x86-simd-sort the use of templates and explicit specializations is overused or even misused. A common patter seems to be:

Both Numpy and x86-simd-sort use suffixed functions for SIMD kernels on both C and C++ sources to avoid ODR violation.

@Flamefire
Copy link
Contributor

But the suffixing is not exhaustive as can be seen by this issue: avx512_qsort is called from a file compiled with -mavx512fp16 and again from a file without that as shown in #25274 (comment). It might be possible that this function and all possibly called functions are not fully inlined into the TU

This applies similar to most functions in x86-simd-sort which are templates that are only "inline", i.e. weak symbols and not unique (which would require static or anonymous namespaces). So this heavily relies on the linker to hopefully sort it out correctly.

@seiko2plus
Copy link
Member

But the suffixing is not exhaustive as can be seen by this issue: avx512_qsort is called from a file compiled with -mavx512fp16

That's because you are raising the ceiling of the baseline features. During the loading of the NumPy module, there is a validation step that raises a Python runtime error if the running machine does not support the baseline features, in order to avoid illegal instruction errors.

It might be possible that this function and all possibly called functions are not fully inlined into the TU
This applies similar to most functions in x86-simd-sort which are templates that are only "inline", i.e. weak symbols and not

Let's differentiate between two situations: the weak symbols, which occur when the C++ compiler fails to inline inline-functions. In this case, the linker silently selects the first duplicated symbol. The second situation involves global symbols, which occur when the C++ compiler fails to inline non-inline functions (our issue here). This leads to a link-time error if there are any duplicated symbols.

To deal with the possibility of duplicated weak symbols, the current approach is safe as long as SIMD kernels have unique symbols and non-suffixed functions of the lowest interest are chosen. Regarding duplicated global symbols, we shouldn't encounter that issue if we adhere to the standard. Moreover, such duplications can be detected at build-time.

@seiko2plus
Copy link
Member

re-opened this issue till the backport of #25376 gets merged.

@Flamefire
Copy link
Contributor

re-opened this issue till the backport of #25376 gets merged.

Didn't you merge this 2 minutes prior which is why this got closed?

@seiko2plus
Copy link
Member

Didn't you merge this 2 minutes prior which is why this got closed?

Yes, I forgot to unlink it, however, since this issue relates to 1.26.x, I thought it would be better to leave it open till the backport and also the confirmation from the author of the issue.

@Flamefire
Copy link
Contributor

Ok, so you meant until that is merged into 1.26.x

FWIW: I already made backport patches for 1.25.1 that also apply on 1.26.2:

@branfosj Confirmed that those solve the issue on his system too.

@seiko2plus
Copy link
Member

Backported by #25475, thank you everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants