From f34f2281640e06669c183fdcaaf1c8e506e6ca01 Mon Sep 17 00:00:00 2001 From: mattip Date: Mon, 25 Nov 2019 17:01:41 -0800 Subject: [PATCH 1/3] NEP: propose SIMD NEP --- doc/neps/nep-XXXX-SIMD-optimizations.rst | 249 +++++++++++++++++++++++ 1 file changed, 249 insertions(+) create mode 100644 doc/neps/nep-XXXX-SIMD-optimizations.rst diff --git a/doc/neps/nep-XXXX-SIMD-optimizations.rst b/doc/neps/nep-XXXX-SIMD-optimizations.rst new file mode 100644 index 000000000000..ac5e9cee07cb --- /dev/null +++ b/doc/neps/nep-XXXX-SIMD-optimizations.rst @@ -0,0 +1,249 @@ +=============================================================== +NEP XXXX — Using SIMD optimization instructions for performance +=============================================================== + +:Author: Sayed Adel, Matti Picus, Ralf Gommers +:Status: Draft +:Type: Standards +:Created: 2019-11-25 +:Resolution: none + + +Abstract +-------- + +While compilers are getting better at using hardware-specific routines to +optimize code, they sometimes do not produce optimal results. Also, we would +like to be able to copy binary C-extension modules from one machine to another +with the same architecture without recompiling. + +We have a mechanism in the ufunc machinery to `build alternative loops`_ +indexed by CPU feature name. At import (in ``InitOperators``), the loop +function that matches the run-time CPU info `is chosen`_ from the candidates.This +NEP proposes a mechanism to build on that for many more features and +architectures. The steps proposed are to: + +- Establish a set of well-defined, architecture-agnostic, universal intrisics + which capture features available across architectures. +- Capture these universal intrisics in a set of C macros and use the macros + to build code paths for sets of features from the baseline up to the maximum + set of features available on that architecture. Offer these as a limited + number of compiled alternative code paths. +- At runtime, discover which CPU features are available, and choose from among + the possible code paths accordingly. + + +Motivation and Scope +-------------------- + +Traditionally NumPy has counted on the compilers to generate optimal code +specifically for the target architecture. +However few users today compile NumPy locally for their machines. Most use the +binary packages which must provide run-time support for the lowest-common +denominator CPU architecture. Thus NumPy cannot take advantage of +more advanced features of their CPU processors, since they may not be available +on all users' systems. The ufunc machinery already has a loop-selection +protocol based on dtypes, so it is easy to extend this to also select an +optimal loop for specifically available CPU features at runtime. + +Traditionally, these features have been exposed through `intrinsics`_ which are +compiler-specific instructions that map directly to assembly instructions. +Recently there were discussions about the effectiveness of adding more +intrinsics (e.g., `gh-11113`_ for AVX optimizations for floats). In the past, +architecture-specific code was added to NumPy for `fast avx512 routines`_ in +various ufuncs, using the mechanism described above to choose the best loop +for the architecture. However the code is not generic and does not generalize +to other architectures. + +Recently, OpenCV moved to using `universal intrinsics`_ in the Hardware +Abstraction Layer (HAL) which provided a nice abstraction for common shared +Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar +mechanism for NumPy. There are three stages to using the mechanism: +- Infrastructure is provided in the code for abstract intrinsics. The ufunc + machinery will be extended using sets of these abstract intrinsics, so that + a single ufunc will be expressed as a set of loops, going from a minimal to + a maximal set of possibly availabe intrinsics. +- At compile time, compiler macros and CPU detection are used to turn the + abstract intrinsics into concrete intrinsic calls. Any intrinsics not + available on the platform, either because the CPU does not support them + (and so cannot be tested) or because the abstract intrinsic does not have a + parallel concrete intrinsic on the platform will not error, rather the + corresponding loop will not be produced and added to the set of + possibilities. +- At runtime, the CPU detection code will further limit the set of loops + available, and the optimal one will be chosen for the ufunc. + +The current NEP proposes only to use the runtime feature detection and optimal +loop selection mechanism for ufuncs. Future NEPS may propose other uses for the +proposed solution. + +Usage and Impact +---------------- + +The end user will be able to get a list of intrinsics available for their +platform and compiler. Optionally, +the user may be able to specify which of the loops available at runtime will be +used, perhaps via an environment variable to enable benchmarking the impact of +the different loops. There should be no direct impact to naive end users, the +results of all the loops should be identical to within a small number (1-3) +ULPs. On the other hand, users with more powerful machines should notice a +performance boost. + +Binary releases - wheels on PyPI and conda packages +``````````````````````````````````````````````````` + +The binaries released by this process will be larger since they include all +possible loops for the architecture. Some packagers may prefer to limit the +number of loops in order to limit the size of the binaries, we would hope they +would still support a wide range of families of architectures. Note this +problem already exists in the Intel MKL offering. + +Source builds +````````````` +TBD +- Setting the baseline and set of runtime-dispatchable ISA extensions +- Behavior when compiler or hardware doesn't support a requested ISA extension + + +How to run benchmarks to assess performance benefits +```````````````````````````````````````````````````` + +Adding more code which use intrinsics will make the code harder to maintain. +Therefore, such code should only be added if it yields a significant +performance benefit. Assessing this performance benefit can be nontrivial. +To aid with this, the implementation for this NEP will add a way to select +which instruction sets can be used at *runtime* via environment variables. +(name TBD). + + +Diagnostics +``````````` + +A new dictionary `__cpu_features__` will be available to python. The keys are +the available features, the value is a boolean whether the feature is available +or not. Various new private +C functions will be used internally to query available features. These +might be exposed via specific c-extension modules for testing. + + +Workflow for adding a new CPU architecture-specific optimization +```````````````````````````````````````````````````````````````` + +NumPy will always have a baseline C implementation for any code that may be +a candidate for SIMD vectorization. If a contributor wants to add SIMD +support for some architecture (typically the one of most interest to them), +this is the proposed workflow: + +TODO (see https://github.com/numpy/numpy/pull/13516#issuecomment-558859638, +needs to be worked out more) + +Reuse by other projects +``````````````````````` + +It would be nice if the universal intrinsics would be available to other +libraries like SciPy or Astropy that also build ufuncs, but that is not an +explicit goal of the first implementation of this NEP. + +Backward compatibility +---------------------- + +There should be no impact on backwards compatibility. + + +Detailed description +-------------------- + +Two new build options are available to ``runtests.py`` and ``setup.py build``. +The absolute minimum required features to compile are defined by +``--cpu-baseline``. For instance, on ``x86_64`` this defaults to ``SSE3``. The +set of additional intrinsics that can be detected and used as sets of +requirements to dispatch on are set by ``--cpu-dispatch``. For instance, on +``x86_64`` this defaults to ``[SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP, +FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX, +AVX512_CLX, AVX512_CNL, AVX512_ICL]``. These features are all mapped to a +c-level boolean array ``npy__cpu_have``, and a c-level convenience function +``npy_cpu_have(int feature_id)`` queries this array. + +The CPU-specific features are then mapped to unversal intrinsics which are +for all x86 SIMD variants, ARM SIMD variants etc. For example, the +NumPy universal intrinsic ``npyv_load_u32`` maps to: + +* ``vld1q_u32`` for ARM based NEON +* ``_mm256_loadu_si256`` for x86 based AVX2 +* ``_mm512_loadu_si512`` for x86 based AVX-512 + +Anyone writing a SIMD loop will use the ``npyv_load_u32`` macro instead of the +architecture specific intrinsic. The code also supplies guard macros for +compilation and runtime, so that the proper loops can be chosen. + +Related Work +------------ + +- PIXMAX TBD: what is it? +- `Eigen`_ is a C++ template library for linear algebra: matrices, vectors, + numerical solvers, and related algorithms. It is a higher level-abstraction + than the intrinsics discussed here. +- `xsimd`_ is a header-only C++ library for x86 and ARM that implements the + mathematical functions used in the algorithms of ``boost.SIMD``. +- OpenCV used to have the one-implementation-per-architecture design, but more + recently moved to a design that is quite similar to what is proposed in this + NEP. The top-level `dispatch code`_ includes a `generic header`_ that is + `specialized at compile time`_ by the CMakefile system. + + +Implementation +-------------- + +Current PRs: + +- `gh-13421 improve runtime detection of CPU features `_ +- `gh-13516: enable multi-platform SIMD compiler optimizations `_ + +The compile-time and runtime code infrastructure are supplied by the first PR. +The second adds a demonstration of use of the infrastructure for a loop. Once +the NEP is approved, more work is needed to write loops using the machnisms +provided by the NEP. + +Alternatives +------------ + +A proposed alternative in gh-13516_ is to implement loops for each CPU +architecture separately by hand, without trying to abstract common patterns in +the SIMD intrinsics (e.g., have `loops.avx512.c.src`, `loops.avx2.c.src`, +`loops.sse.c.src`, `loops.vsx.c.src`, `loops.neon.c.src`, etc.). This is more +similar to what PIXMAX does. There's a lot of duplication here though, and it +the manual code duplication requires a champion who will be dedicated to +implementing and maintaining that platform's loop code. + + +Discussion +---------- + +.. note:: + Will include a summary of the discussion once the NEP is published + +References and Footnotes +------------------------ + +.. _`build alternative loops`: https://github.com/numpy/numpy/blob/v1.17.4/numpy/core/code_generators/generate_umath.py#L50 +.. _`is chosen`: https://github.com/numpy/numpy/blob/v1.17.4/numpy/core/code_generators/generate_umath.py#L1038 +.. _`gh-11113"`: https://github.com/numpy/numpy/pull/11113 +.. _`fast avx512 routines`: https://github.com/numpy/numpy/pulls?q=is%3Apr+avx512+is%3Aclosed + +.. [1] Each NEP must either be explicitly labeled as placed in the public domain (see + this NEP as an example) or licensed under the `Open Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + +.. _`xsimd`: https://xsimd.readthedocs.io/en/latest/ +.. _`Eigen`: http://eigen.tuxfamily.org/index.php?title=Main_Page +.. _`dispatch code`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/src/arithm.dispatch.cpp +.. _`generic header`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/src/arithm.simd.hpp +.. _`specialized at compile time`: https://github.com/opencv/opencv/blob/4.1.2/modules/core/CMakeLists.txt#L3-#L13 +.. _`intrinsics`: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics +.. _`universal intrinsics`: https://docs.opencv.org/master/df/d91/group__core__hal__intrin.html + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From cbed54dd4da52c17fb595e27773643ec7bb96a9b Mon Sep 17 00:00:00 2001 From: mattip Date: Wed, 15 Jan 2020 09:04:37 +1100 Subject: [PATCH 2/3] NEP: changes from review --- doc/neps/nep-XXXX-SIMD-optimizations.rst | 32 +++++++++++++----------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/doc/neps/nep-XXXX-SIMD-optimizations.rst b/doc/neps/nep-XXXX-SIMD-optimizations.rst index ac5e9cee07cb..a19cded6b233 100644 --- a/doc/neps/nep-XXXX-SIMD-optimizations.rst +++ b/doc/neps/nep-XXXX-SIMD-optimizations.rst @@ -1,6 +1,6 @@ -=============================================================== -NEP XXXX — Using SIMD optimization instructions for performance -=============================================================== +============================================================= +NEP 38 — Using SIMD optimization instructions for performance +============================================================= :Author: Sayed Adel, Matti Picus, Ralf Gommers :Status: Draft @@ -14,8 +14,9 @@ Abstract While compilers are getting better at using hardware-specific routines to optimize code, they sometimes do not produce optimal results. Also, we would -like to be able to copy binary C-extension modules from one machine to another -with the same architecture without recompiling. +like to be able to copy binary optimized C-extension modules from one machine +to another with the same base architecture (x86, ARM, PowerPC) but with +different capabilities without recompiling. We have a mechanism in the ufunc machinery to `build alternative loops`_ indexed by CPU feature name. At import (in ``InitOperators``), the loop @@ -85,9 +86,9 @@ platform and compiler. Optionally, the user may be able to specify which of the loops available at runtime will be used, perhaps via an environment variable to enable benchmarking the impact of the different loops. There should be no direct impact to naive end users, the -results of all the loops should be identical to within a small number (1-3) +results of all the loops should be identical to within a small number (1-3?) ULPs. On the other hand, users with more powerful machines should notice a -performance boost. +significant performance boost. Binary releases - wheels on PyPI and conda packages ``````````````````````````````````````````````````` @@ -96,14 +97,17 @@ The binaries released by this process will be larger since they include all possible loops for the architecture. Some packagers may prefer to limit the number of loops in order to limit the size of the binaries, we would hope they would still support a wide range of families of architectures. Note this -problem already exists in the Intel MKL offering. +problem already exists in the Intel MKL offering, where the binary package +includes an extensive set of alternative shared objects (DLLs) for various CPU +alternatives. Source builds ````````````` -TBD -- Setting the baseline and set of runtime-dispatchable ISA extensions -- Behavior when compiler or hardware doesn't support a requested ISA extension +See "Detailed Description" below. A source build where the packager knows +details of the target machine could theoretically produce a smaller binary by +choosing to compile only the loops needed by the target via command line +arguments. How to run benchmarks to assess performance benefits ```````````````````````````````````````````````````` @@ -113,7 +117,7 @@ Therefore, such code should only be added if it yields a significant performance benefit. Assessing this performance benefit can be nontrivial. To aid with this, the implementation for this NEP will add a way to select which instruction sets can be used at *runtime* via environment variables. -(name TBD). +(name TBD). This ablility is critical for CI code verification. Diagnostics @@ -211,8 +215,8 @@ A proposed alternative in gh-13516_ is to implement loops for each CPU architecture separately by hand, without trying to abstract common patterns in the SIMD intrinsics (e.g., have `loops.avx512.c.src`, `loops.avx2.c.src`, `loops.sse.c.src`, `loops.vsx.c.src`, `loops.neon.c.src`, etc.). This is more -similar to what PIXMAX does. There's a lot of duplication here though, and it -the manual code duplication requires a champion who will be dedicated to +similar to what PIXMAX does. There's a lot of duplication here though, and the +manual code duplication requires a champion who will be dedicated to implementing and maintaining that platform's loop code. From 7904b1a6e43d2dc2bbdab8e7fda03ad44bd60ca1 Mon Sep 17 00:00:00 2001 From: mattip Date: Wed, 15 Jan 2020 09:06:27 +1100 Subject: [PATCH 3/3] NEP: rename nep --- ...XXX-SIMD-optimizations.rst => nep-0038-SIMD-optimizations.rst} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename doc/neps/{nep-XXXX-SIMD-optimizations.rst => nep-0038-SIMD-optimizations.rst} (100%) diff --git a/doc/neps/nep-XXXX-SIMD-optimizations.rst b/doc/neps/nep-0038-SIMD-optimizations.rst similarity index 100% rename from doc/neps/nep-XXXX-SIMD-optimizations.rst rename to doc/neps/nep-0038-SIMD-optimizations.rst