BUG, SIMD: Segmentation fault during testing quicksort/half-precision on AMD/ZEN4 #25382

seiko2plus · 2023-12-13T10:56:51Z

Describe the issue:

Segmentation fault during testing quicksort/half-precision on AMD/ZEN4

Reproduce the code example:

spin test

Error message:

numpy/_core/tests/test_half.py .............................
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff5930a03 in _mm256_maskz_loadu_epi16 (__P=0x7fffe785c010, __U=3) at /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/include/avx512vlbwintrin.h:137
137	 return (__m256i) __builtin_ia32_loaddquhi256_mask ((const short *) __P,
(gdb) bt
#0  0x00007ffff5930a03 in _mm256_maskz_loadu_epi16 (__P=0x7fffe785c010, __U=3) at /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/include/avx512vlbwintrin.h:137
#1  replace_nan_with_inf<zmm_vector<float16>, unsigned short> (arrsize=18446744073682482594, arr=0x7fffe785c010) at ../numpy/_core/src/npysort/x86-simd-sort/src/avx512-16bit-qsort.hpp:503
#2  avx512_qsort_fp16 (hasnan=true, arrsize=63490, arr=0x7fffe449bb50) at ../numpy/_core/src/npysort/x86-simd-sort/src/avx512-16bit-qsort.hpp:527
#3  np::qsort_simd::QSort_AVX512_ICL<np::Half> (arr=0x7fffe449bb50, size=63490) at ../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp:77
#4  0x00007ffff557bb9b in quicksort_dispatch<np::Half> (start=start@entry=0x7fffe449bb50, num=num@entry=63490) at ../numpy/_core/src/npysort/quicksort.cpp:100
#5  0x00007ffff557b701 in quicksort_half (start=0x7fffe449bb50, n=63490, __NPY_UNUSED_TAGGEDvarr=__NPY_UNUSED_TAGGEDvarr@entry=0x7fffd3775410) at ../numpy/_core/src/npysort/quicksort.cpp:793
#6  0x00007ffff5531482 in _new_sortlike (op=0x7fffd3775410, axis=<optimized out>, sort=0x7ffff557b6f0 <quicksort_half(void*, npy_intp, void*)>, part=part@entry=0x0, kth=kth@entry=0x0, nkth=nkth@entry=0)
    at ../numpy/_core/src/multiarray/item_selection.c:1255
#7  0x00007ffff5535a9e in PyArray_Sort (op=<optimized out>, axis=<optimized out>, which=<optimized out>) at ../numpy/_core/src/multiarray/item_selection.c:1549
#8  0x00007ffff5546e1a in array_sort (self=0x7fffd3775410, args=<optimized out>, len_args=<optimized out>, kwnames=<optimized out>) at ../numpy/_core/src/multiarray/methods.c:1285
#9  0x00007ffff7a2375f in method_vectorcall_FASTCALL_KEYWORDS (func=0x7ffff5c304a0, args=0x7ffff7e6c760, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/descrobject.c:426
#10 0x00007ffff79f2987 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7ffff5c304a0, tstate=0x7ffff7d89378 <_PyRuntime+166328>)
    at ./Include/internal/pycore_call.h:92
#11 PyObject_Vectorcall (callable=0x7ffff5c304a0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:299
#12 0x00007ffff79e4c23 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4760
#13 0x00007ffff7a2e223 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7e6c6f8, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#14 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7fffffff9e38, locals=0x0, func=0x7ffff3637ec0, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/ceval.c:6425
#15 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7fffffff9e38, func=0x7ffff3637ec0) at Objects/call.c:393
#16 _PyObject_VectorcallTstate (tstate=0x7ffff7d89378 <_PyRuntime+166328>, callable=0x7ffff3637ec0, args=0x7fffffff9e38, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#17 0x00007ffff7a2d540 in method_vectorcall (method=<optimized out>, args=0x7ffff7d6eff0 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:67
#18 0x00007ffff79e8f3d in do_call_core (use_tracing=<optimized out>, kwdict=0x7fffd2fe08c0, callargs=0x7ffff7d6efd8 <_PyRuntime+58904>, func=0x7ffff368fd00, tstate=<optimized out>) at Python/ceval.c:7343
#19 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5367
#20 0x00007ffff7a0bb90 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7e6c640, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#21 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd5bdd0a8, locals=0x0, func=0x7ffff69f5120, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/ceval.c:6425
#22 _PyFunction_Vectorcall (func=0x7ffff69f5120, stack=0x7fffd5bdd0a8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#23 0x00007ffff79e8f3d in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7fffd5bdd090, func=0x7ffff69f5120, tstate=<optimized out>) at Python/ceval.c:7343
#24 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5367
#25 0x00007ffff7a0bb90 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7e6c3e8, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#26 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd375bab8, locals=0x0, func=0x7ffff6d5cae0, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/ceval.c:6425
#27 _PyFunction_Vectorcall (func=0x7ffff6d5cae0, stack=0x7fffd375bab8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393

Python and NumPy Versions:

>>> import sys, numpy; print(numpy.__version__); print(sys.version)
2.0.0.dev0+git20231209.35c4319
3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801]

show_config()

>>> np.show_config()
/home/seiko/repos/numpy/build-install/usr/lib/python3.11/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "gcc",
      "linker": "ld.bfd",
      "version": "13.2.1",
      "commands": "cc"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.6",
      "commands": "cython"
    },
    "c++": {
      "name": "gcc",
      "linker": "ld.bfd",
      "version": "13.2.1",
      "commands": "c++"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "x86_64",
      "family": "x86_64",
      "endian": "little",
      "system": "linux"
    },
    "build": {
      "cpu": "x86_64",
      "family": "x86_64",
      "endian": "little",
      "system": "linux"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "auto"
    },
    "lapack": {
      "name": "dep140476350174480",
      "found": true,
      "version": "2.0.0.dev0+git20231209.35c4319",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/home/seiko/repos/numpy-dev/bin/python",
    "version": "3.11"
  },
  "SIMD Extensions": {
    "baseline": [
      "SSE",
      "SSE2",
      "SSE3"
    ],
    "found": [
      "SSSE3",
      "SSE41",
      "POPCNT",
      "SSE42",
      "AVX",
      "F16C",
      "FMA3",
      "AVX2",
      "AVX512F",
      "AVX512CD",
      "AVX512_SKX",
      "AVX512_CLX",
      "AVX512_CNL",
      "AVX512_ICL"
    ],
    "not found": [
      "AVX512_KNL",
      "AVX512_KNM",
      "AVX512_SPR"
    ]
  }
}

seiko2plus · 2023-12-13T11:23:39Z

@r-devulap, The 512-bit vector should have 32 lanes of float16:

https://github.com/intel/x86-simd-sort/blob/7060e3c768992441aa6454a6f9320a9fe1f870da/src/avx512-16bit-qsort.hpp#L497-L515

seiko2plus · 2023-12-13T11:27:20Z

@r-devulap, The 512-bit vector should have 32 lanes of float16:

Aha Bug is fixed by intel/x86-simd-sort#113

r-devulap · 2023-12-13T16:55:37Z

@r-devulap, The 512-bit vector should have 32 lanes of float16:

https://github.com/intel/x86-simd-sort/blob/7060e3c768992441aa6454a6f9320a9fe1f870da/src/avx512-16bit-qsort.hpp#L497-L515

We actually use only 16 lanes, because we convert float16 to float32 to detect NAN. There is a better way to do it, but this works for now and hardly affects perf.

r-devulap · 2023-12-13T16:58:00Z

The bug was introduced when we started using size_t instead of int64_t to represent array sizes. arrsize - 16 underflows when it is size_t and looped forever and segfaulted.

seiko2plus · 2023-12-15T10:40:17Z

Thank you for the clarification.

r-devulap · 2023-12-15T16:26:02Z

Closing. resolved by #25376.

seiko2plus added 00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Dec 13, 2023

seiko2plus mentioned this issue Dec 13, 2023

BUG: Fix build issues on SPR and avx512_qsort float16 #25376

Merged

r-devulap self-assigned this Dec 13, 2023

r-devulap closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG, SIMD: Segmentation fault during testing quicksort/half-precision on AMD/ZEN4 #25382

BUG, SIMD: Segmentation fault during testing quicksort/half-precision on AMD/ZEN4 #25382

seiko2plus commented Dec 13, 2023 •

edited

Loading

seiko2plus commented Dec 13, 2023

seiko2plus commented Dec 13, 2023

r-devulap commented Dec 13, 2023

r-devulap commented Dec 13, 2023

seiko2plus commented Dec 15, 2023

r-devulap commented Dec 15, 2023

BUG, SIMD: Segmentation fault during testing quicksort/half-precision on AMD/ZEN4 #25382

BUG, SIMD: Segmentation fault during testing quicksort/half-precision on AMD/ZEN4 #25382

Comments

seiko2plus commented Dec 13, 2023 • edited Loading

Describe the issue:

Reproduce the code example:

Error message:

Python and NumPy Versions:

seiko2plus commented Dec 13, 2023

seiko2plus commented Dec 13, 2023

r-devulap commented Dec 13, 2023

r-devulap commented Dec 13, 2023

seiko2plus commented Dec 15, 2023

r-devulap commented Dec 15, 2023

seiko2plus commented Dec 13, 2023 •

edited

Loading