ENH: Allows building npy_math with static inlining #8754

AndresGuzman-Ballen · 2017-03-07T22:14:42Z

Code Overview

Numpy currently decouples the math function definitions in npy_math.c.src from the function declarations found in npy_math.h. This patch allows definitions to be included along with the inclusion of the npy_math.h header.

Motivation

Keeping the declarations and definitions separate is usually the right approach, but mathematical code like this might be better off as an exception to this common practice. Because the definitions are in the source file instead of the header, the compiler does not have any clue what lies underneath these math functions. This means the compiler can't make important optimizations like inlining and vectorization. Extensions that utilize these functions could greatly benefit from this, specifically loops.c.src from the umath extension.

This PR is the first step toward upstreaming optimizations implemented for Intel(R) Distribution for Python. You can look into the info/recipe/umath_optimizations.patch for the whole story when you install/download the Numpy package from, for instance, https://anaconda.org/intel/numpy/files

Another PR I will be proposing in the future utilizes function definitions over function pointers since the compiler is unable to optimize code using function pointers.

Implementation Details

The following PR does the following:

Renames npy_math.c.src to npy_math_internal.h.src
Generates npy_math_internal.h from template by adding to npymath_sources list and adding npymath directory to include paths in generate_numpyconfig_h function of numpy/core/setup.py
Numpy’s core distutils defines #NPY_INTERNAL_BUILD macro to make sure npy_math_internal.h is not included when other modules try to include public header npy_math.h
- Currently do not know how to ship headers generated from template files
Adds npy_math.c, a file that includes the npy_math_internal.h.src file (but does not add NPY_INLINE static)
- This is to keep the same static npy_math library as it exists now
Appends numpy/npy_math.h with npy_math_internal.h under condition that it's not being included in npy_math.c.src
- The conditional macros mean loops.c.src will have definitions included, and the compiler will vectorize accordingly
Adds NPY_INLINE static to function declarations and definitions when necessary
Replaces sqrtf with npy_sqrtf in numpy/core/src/umath/umath_tests.c to make function portable
- _sqrtf was not found on certain Windows environments compiling with Py2

It successfully compiles and completes the full test-suite for GCC and ICC

Benchmarking Results

Initial benchmarking has shown that for functions that do not use function pointers (like DOUBLE_frexp or DOUBLE_floor_divide), you see 11~20% speedup on these changes alone:

python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(4.01)' 'numpy.frexp(a)'
python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(4.01)' 'numpy.floor_divide(a, a*2.1)'

Public vs Internal Only

In order to simplify the first iteration of this patch, we decided to limit its effect to Numpy only. While npy_math.h is actually public header and other Python packages like Scipy might benefit the same way from inlining of the math functions used through this header. This patch does not allow this because npy_math_internal.h is not installed as a public header. One barrier for this is the question how to put result of template processing into public headers – there is no precedent in Numpy at this moment, thus your advice is needed.

anton-malakhov · 2017-03-07T22:39:05Z

Thank you Andres! We are happy to introduce our optimizations back to community! Please review and suggest how to proceed further @njsmith

juliantaylor · 2017-03-08T00:16:54Z

numpy/core/setup.py

+    npymath_build_dir = join('build', 'src.%s-%s.%s' % (sysconfig.get_platform(), sys.version_info[0], sys.version_info[1]),
+                             'numpy', 'core', 'src', 'npymath')
+    config.add_include_dirs(npymath_build_dir)
+    config.add_define_macros([("NPY_BUILD", "1")]) # this macro indicates that Numpy build is in process


one can probably just use HAVE_NPY_CONFIG_H for this instead of adding a new one, this marks the internal build to include the private configuration header.
It might make sense to rename it to NPY_INTERNAL_BUILD to make its usage clearer.

Ah true. For the sake of those reading through the code, the latter looks like a better option.

juliantaylor · 2017-03-08T00:23:54Z

numpy/core/setup.py

@@ -605,6 +606,12 @@ def generate_api(ext, build_dir):
    config.add_include_dirs(join('src', 'umath'))
    config.add_include_dirs(join('src', 'npysort'))

+    # Need to add location of npy_math_internal.h, generated from src/npymath/npy_math_internal.h.src
+    npymath_build_dir = join('build', 'src.%s-%s.%s' % (sysconfig.get_platform(), sys.version_info[0], sys.version_info[1]),


this should not be necessary, we already create headers via templates which get the correct build directory, see generate_multiarray_templated_sources

ah this is npymath which might not have the correct setup to do so. Copying the method from multiarray and umath module may be nicer.

I'm glad you brought this up. How does distutils utilize these generate_xyz_templated_sources? If I grep for that function call, I only find a definition of it (although if you check out v1.9.2, it once was used in numpy/core/setup.py like this in line 846:

if not ENABLE_SEPARATE_COMPILATION: multiarray_deps.extend(multiarray_src) multiarray_src = [join('src', 'multiarray', 'multiarraymodule_onefile.c')] multiarray_src.append(generate_multiarray_templated_sources)

interesting these functions seem to be cruft from a removed compilation mode.

But still explicitly setting the build directory should not be necessary, npysort which is a static utility library simlar to npymath has a generated headers, npy_partition.h.src. Have you tried just adding the .src file to the sources list and nothing else?

ah now I remember how this worked. You have to add a function to the sources line of the add_extension, that function receives the build directory as argument, see generate_numpyconfig_h

I guess one could add the include directory there as with the private header directory, add a new function just for the multiarray math include directory or move the header to the private header location.
Probably the first option is simplest albeit a but ugly as the generate_numpyconfig_h technically has a different job ...

@juliantaylor Ok, good to know you can consider installing inlinable headers. So, probably this discussion is kind of misplaced since it does not relate to the code above for adding the include directory.

I don't insist we should work on installing inlinable headers right now and right here, I just want to clarify my understanding of your concerns. Let's assume we put inlinable headers into include/numpy/ directory. What I'm saying is that even if configuration is not inferred correctly in a 3rd party library, including npy_math.h with inlined implementation would not hurt because it has fallback implementations which can be enabled even in presence of compiler support when [some] built-ins are mistakenly not detected. Is your concern that these fall back implementations are somewhat less correct than built-ins which might be available but not used?
And there are certainly no problem when they are detected in some way correctly.

Should we could open an issue/feature request for documenting this discussion?

the concern is for example we have:

#ifdef GCC_VERSION > 4.2 #define npy_copysign copysign #endif

now a compiler comes along and pretends to be gcc 4.2 but doesn't have the copysign builtin and the system doesn't have one either you get a compile error/undefined reference.
As compiler version checks are the only way to determine if a builtin exists with gcc this will always be fragile. (Of course one could change the check to only inline with compilers that offer a good way to detect this)

In practice this is probably not such a big deal but there just also wasn't a good reason to bother either. Except for some of the floating point property functions all math functions are so slow that inlining didn't make much difference
Now that vector math libraries are more common it might be worth the unlikely case of odd configurations breaking.

I see. Yes, I would be that one who enables good optimization only on good platform. That's enough for the practical purpose. If compiler defines its version incorrectly - it is compiler bug, just do not support such a compiler. Otherwise, it is a bug in condition, e.g. when only compiler version is checked but the code really depends on libc version rather than compiler version.
I worked in a cross-platform project where all the configuration is done in the header config file. It is verbose, ugly, but portable and certainly feasible even in absence of features like has_include or has_builtin.

Anyway, thanks for the discussion, we'll open a separate issue on this.

It is not an uncommon "bug", clang reports itself as gcc 4.2 but does not support everything gcc does.
numpy is built on a lot of odd platforms and compilers, but I also prefer that these are sorted out after the mainstream platforms have the best code possible.

but clang also defines __llvm__ and __clang__ macro. Similar story is with ICC which pretends to show itself as platform-default compiler like GCC or MSVC. However, it is easy to distinguish both clang and ICC from GCC.

juliantaylor · 2017-03-08T00:27:05Z

numpy/core/include/numpy/npy_common.h

@@ -10,6 +10,15 @@
 /* need Python.h for npy_intp, npy_uintp */
 #include <Python.h>

+/* using static inline modifiers when defining npy_math functions


our comment style is newline after /* and before */

AndresGuzman-Ballen · 2017-03-09T18:29:53Z

Just pushed commit that fixed the type mismatches between definitions and declarations of npy_math functions.

AndresGuzman-Ballen · 2017-03-09T22:22:52Z

@juliantaylor I've pushed the commits that would allow this to work on all the platforms. Is this ready to merge as is or is there something else that needs to be done first? Thanks!

juliantaylor · 2017-03-10T10:37:45Z

numpy/core/src/umath/umath_tests.c.src

@@ -227,7 +228,7 @@ static void
                    ptr_this += stride_d;
                    ptr_that += stride_d;
                }
-                *(@typ@ *)data_out = @sqrt_func@(out);
+                *(@typ@ *)data_out = npy_@sqrt_func@(out);


this looks correct, but can you explain why this worked before?

I have tried to figure out why this has not caused appveyor to fail before but have been unable to come up with a conclusion. This code has successfully compiled using regular Visual Studio and Visual C++ for Python 2.7 (both of which were done on Windows 2012 R2 and my local Windows 8.1).

It makes no sense because I can't find anything in my code that could possibly affect umath_tests.c, and it's not like the npymath library is affecting this :(

juliantaylor · 2017-03-10T10:38:47Z

I don't like the explicit build directory mention, putting the include directory into `generate_numpyconfig_h´ should be good enough for now, the setup.py needs cleaning up anyway in future.

juliantaylor · 2017-03-10T10:40:48Z

After that please squash the commits into one with the ENH: ... subject.

Code Overview: Numpy currently decouples the math function definitions in `npy_math.c.src` from the function declarations found in `npy_math.h`. This patch allows definitions to be included along with the inclusion of the `npy_math.h` header. Keeping the declarations and definitions separate is usually the right approach, but mathematical code like this might be better off as an exception to this common practice. Because the definitions are in the source file instead of the header, the compiler does not have any clue what lies underneath these math functions. This means the compiler can't make important optimizations like inlining and vectorization. Extensions that utilize these functions could greatly benefit from this, specifically `loops.c.src` from the umath extension. Implementation Details: + Renames `npy_math.c.src` to `npy_math_internal.h.src` + Generates `npy_math_internal.h` from template by adding to `npymath_sources` list and adding `npymath` directory to include paths in `generate_numpyconfig_h` function of `numpy/core/setup.py` + Numpy's core distutils defines `#NPY_INTERNAL_BUILD` macro to make sure `npy_math_internal.h` is not included when other modules try to include public header `npy_math.h` - Currently do not know how to ship headers generated from template files + Adds `npy_math.c`, a file that includes the `npy_math_internal.h.src` file (but does not add NPY_INLINE static) - This is to keep the same static npy_math library as it exists now + Appends `numpy/npy_math.h` with `npy_math_internal.h` under condition that it's not being included in npy_math.c.src - The conditional macros mean `loops.c.src` will have definitions included, and the compiler will vectorize accordingly + Adds `NPY_INLINE` static to function declarations and definitions when necessary + Replaces `sqrtf` with `npy_sqrtf` in `numpy/core/src/umath/umath_tests.c` to make function portable - `_sqrtf` was not found on certain Windows environments compiling with Py2

AndresGuzman-Ballen · 2017-03-10T20:33:59Z

Pasting following URL as a reference for future reference. Will be useful when we have a future discussion on potentially making npy_math_internal.h publicly available: #8754 (comment)

AndresGuzman-Ballen · 2017-03-10T20:35:00Z

@juliantaylor I've squashed the changes and amended the commit message to include everything regarding this PR

juliantaylor · 2017-03-10T20:41:19Z

thanks

charris · 2017-03-11T03:00:42Z

Are there many loops affected by this? Many of the functions in npy_math are also fallback functions for when they are not available in the system library or the system versions are buggy, in the long term only a few numpy specific functions should be needed. Possibly the complex functions would benefit more, or perhaps the ieee functions. Note that in a few cases the npy_math code has been pasted into the loops to the same effect

A bit more explanation might be helpful in npy_math.c to the effect that the inline modifier is removed so that actual code is generated.

charris · 2017-03-11T03:11:57Z

ISTR that one of the other reasons for the npy_ prefix added to the functions was to avoid compiler generated inline functions/instructions when we needed a function pointer. That was particularly a problem on Windows. Probably not a problem here, but something to bear in mind.

AndresGuzman-Ballen · 2017-03-13T16:46:21Z

@charris As of now, there aren't many loops that are affected by this, but the next PR I'll be making adds many of the unary functions to loops.c.src to take advantage of the kinds of optimizations the compiler can make when it uses explicit function definitions over function pointers. The npy_math_complex source code could benefit from this as well, but I had not had the chance to work on this yet. Same goes for the ieee functions (reason why I had not made these modifications to the two functions that are defined in ieee754.c.src (https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/npy_math.h#L159).

I can create a separate PR that cleans up a few things and adds your suggestion in the npy_math.c comment. There are already a few things in numpy/core/setup.py that I noticed need to be removed since they are no longer being used anyway (like the generate_umath_templated_sources function here: https://github.com/numpy/numpy/blob/master/numpy/core/setup.py#L862 or generate_multiarray_templated_sources function here: https://github.com/numpy/numpy/blob/master/numpy/core/setup.py#L707). @juliantaylor would this be the right approach? Was there a reason these functions were left there?

juliantaylor · 2017-03-13T17:51:26Z

my concern is that you are probably going to end up using intel svml like code? that is the same code used in glibc's libmvec for some functions right?
I played around with those and they make a whole bunch of scipy tests fail due to too low accuracy. I think svml has a bunch of different accuracy settings, do you know which one was used in glibc?
For the code loops.c.src has these LOOP_FAST macros you can use to get compiler friendly code for the new code. I assume you are planning to add more special ufuncs like was already done for sqrt

the functions in setup.py can go, its on my todo list but if you want to do the cleanup go ahead.

AndresGuzman-Ballen · 2017-03-14T20:37:56Z

@juliantaylor you're right about how intel SVML can fail a ton of scipy tests, and some can be controlled with flags like -prec-sqrt, -fimf-precision=high, -prec-div. Because of the different kinds of flags you could use for performance and precision, I thought it'd be better to give the user the ability to play around with the compiler flags, but I understand the concerns of the potential problems that could arise from giving the user this power. I know at one point you brought up using functions like np.vsqrt but I don't think that makes a nice out-of-the-box experience, and that's probably why you had suggested that a context manager would be the right approach.

How would one go about doing that anyway? Would that mean we would have to compile two umath libraries and then during runtime, the user can change an attribute flag and np.sqrt for instance would use the respective .so depending on np.core.umath.perf=True?

juliantaylor · 2017-03-14T20:52:37Z

basically yes, we compile multiple variants of the functions and at runtime the user can choose what to use via some sane method.
ufuncs can be changed at runtime already, see for example this project that already allows using MKL from numpy:
https://github.com/geggo/uvml

Or see also the AVX2 integer functions numpy uses, they are chosen at startup depending on the capabilities of the CPU in ./numpy/core/code_generators/generate_umath.py (see 37740eb)
They use the gcc target attributes for ease of implementation as using per file compiler flags is non-trivial with the poor python distutils.

charris · 2017-03-15T03:11:57Z

I am wary of attaching too many knobs to numpy. Reproducibility is important, and at the moment the best that can be done is to document version and platform. Let's not make that any worse.

charris · 2017-03-15T03:13:24Z

If we need to make a choice, I'd prefer accuracy over performance.

ghost · 2017-03-20T18:11:38Z

Jumping in this great discussion...

Precision over performance is a reasonable choice. None of commits proposed by Andres will degrade accuracy - all accuracy settings are 'high', i.e. within 1 ulp. Andres is also ensuring all tests pass with these changes. If we observe some unexpected test failure I suggest we dig into detail case by case. Our intention is to keep the floating-point behavior the same.

I would not compare vs. glibc libmvec -- it is very limited in functionality. Intel's svml provides all needed accuracy choices including 'high'.

The proposed implementation will be a combination of SVML (for short numpy vectors) and VML (for long vectors). Motivation is as follows, SVML is the most performant on short vectors; when vectors are long enough VML is preferable because it does automatic threading for performance on multiple cores. Behavior is nearly identical since both SVML and VML share the same code base.

I am in agreement that the number of knobs controlling numpy behavior should be minimal.

Thank you again for this helpful discussion,
Sergey

AndresGuzman-Ballen mentioned this pull request Mar 7, 2017

Faster elemwise: sse? minivect? mkl vector? Theano/Theano#1406

Open

juliantaylor reviewed Mar 8, 2017

View reviewed changes

AndresGuzman-Ballen force-pushed the automatic-compiler-vectorization branch from 3af45be to e436a72 Compare March 9, 2017 18:27

juliantaylor reviewed Mar 10, 2017

View reviewed changes

AndresGuzman-Ballen force-pushed the automatic-compiler-vectorization branch from 0b1c050 to 21d2c7e Compare March 10, 2017 20:01

juliantaylor merged commit 9bba99d into numpy:master Mar 10, 2017

juliantaylor mentioned this pull request Mar 11, 2017

BUG: Don't signal FP exceptions in np.absolute #8713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allows building npy_math with static inlining #8754

ENH: Allows building npy_math with static inlining #8754

AndresGuzman-Ballen commented Mar 7, 2017 •

edited

anton-malakhov commented Mar 7, 2017

juliantaylor Mar 8, 2017

AndresGuzman-Ballen Mar 8, 2017

juliantaylor Mar 8, 2017

juliantaylor Mar 8, 2017

AndresGuzman-Ballen Mar 8, 2017 •

edited

juliantaylor Mar 8, 2017

juliantaylor Mar 8, 2017

anton-malakhov Mar 10, 2017

juliantaylor Mar 10, 2017

anton-malakhov Mar 10, 2017 •

edited

juliantaylor Mar 10, 2017

anton-malakhov Mar 10, 2017 •

edited

juliantaylor Mar 8, 2017

AndresGuzman-Ballen commented Mar 9, 2017

AndresGuzman-Ballen commented Mar 9, 2017

juliantaylor Mar 10, 2017

AndresGuzman-Ballen Mar 10, 2017 •

edited

juliantaylor commented Mar 10, 2017 •

edited

juliantaylor commented Mar 10, 2017

AndresGuzman-Ballen commented Mar 10, 2017

AndresGuzman-Ballen commented Mar 10, 2017

juliantaylor commented Mar 10, 2017

charris commented Mar 11, 2017

charris commented Mar 11, 2017

AndresGuzman-Ballen commented Mar 13, 2017

juliantaylor commented Mar 13, 2017

AndresGuzman-Ballen commented Mar 14, 2017

juliantaylor commented Mar 14, 2017 •

edited

charris commented Mar 15, 2017

charris commented Mar 15, 2017

ghost commented Mar 20, 2017

ENH: Allows building npy_math with static inlining #8754

ENH: Allows building npy_math with static inlining #8754

Conversation

AndresGuzman-Ballen commented Mar 7, 2017 • edited

Code Overview

Motivation

Implementation Details

Benchmarking Results

Public vs Internal Only

anton-malakhov commented Mar 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndresGuzman-Ballen Mar 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anton-malakhov Mar 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anton-malakhov Mar 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndresGuzman-Ballen commented Mar 9, 2017

AndresGuzman-Ballen commented Mar 9, 2017

Choose a reason for hiding this comment

AndresGuzman-Ballen Mar 10, 2017 • edited

Choose a reason for hiding this comment

juliantaylor commented Mar 10, 2017 • edited

juliantaylor commented Mar 10, 2017

AndresGuzman-Ballen commented Mar 10, 2017

AndresGuzman-Ballen commented Mar 10, 2017

juliantaylor commented Mar 10, 2017

charris commented Mar 11, 2017

charris commented Mar 11, 2017

AndresGuzman-Ballen commented Mar 13, 2017

juliantaylor commented Mar 13, 2017

AndresGuzman-Ballen commented Mar 14, 2017

juliantaylor commented Mar 14, 2017 • edited

charris commented Mar 15, 2017

charris commented Mar 15, 2017

ghost commented Mar 20, 2017

AndresGuzman-Ballen commented Mar 7, 2017 •

edited

AndresGuzman-Ballen Mar 8, 2017 •

edited

anton-malakhov Mar 10, 2017 •

edited

anton-malakhov Mar 10, 2017 •

edited

AndresGuzman-Ballen Mar 10, 2017 •

edited

juliantaylor commented Mar 10, 2017 •

edited

juliantaylor commented Mar 14, 2017 •

edited