Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG, SIMD: Segmentation fault during testing quicksort/half-precision on AMD/ZEN4 #25382

Closed
seiko2plus opened this issue Dec 13, 2023 · 6 comments
Assignees
Labels
00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Comments

@seiko2plus
Copy link
Member

seiko2plus commented Dec 13, 2023

Describe the issue:

Segmentation fault during testing quicksort/half-precision on AMD/ZEN4

Reproduce the code example:

spin test

Error message:

numpy/_core/tests/test_half.py .............................
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff5930a03 in _mm256_maskz_loadu_epi16 (__P=0x7fffe785c010, __U=3) at /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/include/avx512vlbwintrin.h:137
137	 return (__m256i) __builtin_ia32_loaddquhi256_mask ((const short *) __P,
(gdb) bt
#0  0x00007ffff5930a03 in _mm256_maskz_loadu_epi16 (__P=0x7fffe785c010, __U=3) at /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/include/avx512vlbwintrin.h:137
#1  replace_nan_with_inf<zmm_vector<float16>, unsigned short> (arrsize=18446744073682482594, arr=0x7fffe785c010) at ../numpy/_core/src/npysort/x86-simd-sort/src/avx512-16bit-qsort.hpp:503
#2  avx512_qsort_fp16 (hasnan=true, arrsize=63490, arr=0x7fffe449bb50) at ../numpy/_core/src/npysort/x86-simd-sort/src/avx512-16bit-qsort.hpp:527
#3  np::qsort_simd::QSort_AVX512_ICL<np::Half> (arr=0x7fffe449bb50, size=63490) at ../numpy/_core/src/npysort/x86_simd_qsort_16bit.dispatch.cpp:77
#4  0x00007ffff557bb9b in quicksort_dispatch<np::Half> (start=start@entry=0x7fffe449bb50, num=num@entry=63490) at ../numpy/_core/src/npysort/quicksort.cpp:100
#5  0x00007ffff557b701 in quicksort_half (start=0x7fffe449bb50, n=63490, __NPY_UNUSED_TAGGEDvarr=__NPY_UNUSED_TAGGEDvarr@entry=0x7fffd3775410) at ../numpy/_core/src/npysort/quicksort.cpp:793
#6  0x00007ffff5531482 in _new_sortlike (op=0x7fffd3775410, axis=<optimized out>, sort=0x7ffff557b6f0 <quicksort_half(void*, npy_intp, void*)>, part=part@entry=0x0, kth=kth@entry=0x0, nkth=nkth@entry=0)
    at ../numpy/_core/src/multiarray/item_selection.c:1255
#7  0x00007ffff5535a9e in PyArray_Sort (op=<optimized out>, axis=<optimized out>, which=<optimized out>) at ../numpy/_core/src/multiarray/item_selection.c:1549
#8  0x00007ffff5546e1a in array_sort (self=0x7fffd3775410, args=<optimized out>, len_args=<optimized out>, kwnames=<optimized out>) at ../numpy/_core/src/multiarray/methods.c:1285
#9  0x00007ffff7a2375f in method_vectorcall_FASTCALL_KEYWORDS (func=0x7ffff5c304a0, args=0x7ffff7e6c760, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/descrobject.c:426
#10 0x00007ffff79f2987 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7ffff5c304a0, tstate=0x7ffff7d89378 <_PyRuntime+166328>)
    at ./Include/internal/pycore_call.h:92
#11 PyObject_Vectorcall (callable=0x7ffff5c304a0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:299
#12 0x00007ffff79e4c23 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4760
#13 0x00007ffff7a2e223 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7e6c6f8, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#14 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7fffffff9e38, locals=0x0, func=0x7ffff3637ec0, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/ceval.c:6425
#15 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7fffffff9e38, func=0x7ffff3637ec0) at Objects/call.c:393
#16 _PyObject_VectorcallTstate (tstate=0x7ffff7d89378 <_PyRuntime+166328>, callable=0x7ffff3637ec0, args=0x7fffffff9e38, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#17 0x00007ffff7a2d540 in method_vectorcall (method=<optimized out>, args=0x7ffff7d6eff0 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:67
#18 0x00007ffff79e8f3d in do_call_core (use_tracing=<optimized out>, kwdict=0x7fffd2fe08c0, callargs=0x7ffff7d6efd8 <_PyRuntime+58904>, func=0x7ffff368fd00, tstate=<optimized out>) at Python/ceval.c:7343
#19 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5367
#20 0x00007ffff7a0bb90 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7e6c640, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#21 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd5bdd0a8, locals=0x0, func=0x7ffff69f5120, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/ceval.c:6425
#22 _PyFunction_Vectorcall (func=0x7ffff69f5120, stack=0x7fffd5bdd0a8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#23 0x00007ffff79e8f3d in do_call_core (use_tracing=<optimized out>, kwdict=0x0, callargs=0x7fffd5bdd090, func=0x7ffff69f5120, tstate=<optimized out>) at Python/ceval.c:7343
#24 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5367
#25 0x00007ffff7a0bb90 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7e6c3e8, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#26 _PyEval_Vector (kwnames=<optimized out>, argcount=1, args=0x7fffd375bab8, locals=0x0, func=0x7ffff6d5cae0, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/ceval.c:6425
#27 _PyFunction_Vectorcall (func=0x7ffff6d5cae0, stack=0x7fffd375bab8, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393

Python and NumPy Versions:

>>> import sys, numpy; print(numpy.__version__); print(sys.version)
2.0.0.dev0+git20231209.35c4319
3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801]
show_config()
>>> np.show_config()
/home/seiko/repos/numpy/build-install/usr/lib/python3.11/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "gcc",
      "linker": "ld.bfd",
      "version": "13.2.1",
      "commands": "cc"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.6",
      "commands": "cython"
    },
    "c++": {
      "name": "gcc",
      "linker": "ld.bfd",
      "version": "13.2.1",
      "commands": "c++"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "x86_64",
      "family": "x86_64",
      "endian": "little",
      "system": "linux"
    },
    "build": {
      "cpu": "x86_64",
      "family": "x86_64",
      "endian": "little",
      "system": "linux"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "auto"
    },
    "lapack": {
      "name": "dep140476350174480",
      "found": true,
      "version": "2.0.0.dev0+git20231209.35c4319",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/home/seiko/repos/numpy-dev/bin/python",
    "version": "3.11"
  },
  "SIMD Extensions": {
    "baseline": [
      "SSE",
      "SSE2",
      "SSE3"
    ],
    "found": [
      "SSSE3",
      "SSE41",
      "POPCNT",
      "SSE42",
      "AVX",
      "F16C",
      "FMA3",
      "AVX2",
      "AVX512F",
      "AVX512CD",
      "AVX512_SKX",
      "AVX512_CLX",
      "AVX512_CNL",
      "AVX512_ICL"
    ],
    "not found": [
      "AVX512_KNL",
      "AVX512_KNM",
      "AVX512_SPR"
    ]
  }
}
@seiko2plus seiko2plus added 00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Dec 13, 2023
@seiko2plus
Copy link
Member Author

@seiko2plus
Copy link
Member Author

@r-devulap, The 512-bit vector should have 32 lanes of float16:

Aha Bug is fixed by intel/x86-simd-sort#113

@r-devulap
Copy link
Member

@r-devulap, The 512-bit vector should have 32 lanes of float16:

https://github.com/intel/x86-simd-sort/blob/7060e3c768992441aa6454a6f9320a9fe1f870da/src/avx512-16bit-qsort.hpp#L497-L515

We actually use only 16 lanes, because we convert float16 to float32 to detect NAN. There is a better way to do it, but this works for now and hardly affects perf.

@r-devulap
Copy link
Member

The bug was introduced when we started using size_t instead of int64_t to represent array sizes. arrsize - 16 underflows when it is size_t and looped forever and segfaulted.

@r-devulap r-devulap self-assigned this Dec 13, 2023
@seiko2plus
Copy link
Member Author

Thank you for the clarification.

@r-devulap
Copy link
Member

Closing. resolved by #25376.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

No branches or pull requests

2 participants