Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allows building npy_math with static inlining #8754

Conversation

AndresGuzman-Ballen
Copy link
Contributor

@AndresGuzman-Ballen AndresGuzman-Ballen commented Mar 7, 2017

Code Overview

Numpy currently decouples the math function definitions in npy_math.c.src from the function declarations found in npy_math.h. This patch allows definitions to be included along with the inclusion of the npy_math.h header.

Motivation

Keeping the declarations and definitions separate is usually the right approach, but mathematical code like this might be better off as an exception to this common practice. Because the definitions are in the source file instead of the header, the compiler does not have any clue what lies underneath these math functions. This means the compiler can't make important optimizations like inlining and vectorization. Extensions that utilize these functions could greatly benefit from this, specifically loops.c.src from the umath extension.

This PR is the first step toward upstreaming optimizations implemented for Intel(R) Distribution for Python. You can look into the info/recipe/umath_optimizations.patch for the whole story when you install/download the Numpy package from, for instance, https://anaconda.org/intel/numpy/files

Another PR I will be proposing in the future utilizes function definitions over function pointers since the compiler is unable to optimize code using function pointers.

Implementation Details

The following PR does the following:

  • Renames npy_math.c.src to npy_math_internal.h.src
  • Generates npy_math_internal.h from template by adding to npymath_sources list and adding npymath directory to include paths in generate_numpyconfig_h function of numpy/core/setup.py
  • Numpy’s core distutils defines #NPY_INTERNAL_BUILD macro to make sure npy_math_internal.h is not included when other modules try to include public header npy_math.h
    • Currently do not know how to ship headers generated from template files
  • Adds npy_math.c, a file that includes the npy_math_internal.h.src file (but does not add NPY_INLINE static)
    • This is to keep the same static npy_math library as it exists now
  • Appends numpy/npy_math.h with npy_math_internal.h under condition that it's not being included in npy_math.c.src
    • The conditional macros mean loops.c.src will have definitions included, and the compiler will vectorize accordingly
  • Adds NPY_INLINE static to function declarations and definitions when necessary
  • Replaces sqrtf with npy_sqrtf in numpy/core/src/umath/umath_tests.c to make function portable
    • _sqrtf was not found on certain Windows environments compiling with Py2

It successfully compiles and completes the full test-suite for GCC and ICC

Benchmarking Results

Initial benchmarking has shown that for functions that do not use function pointers (like DOUBLE_frexp or DOUBLE_floor_divide), you see 11~20% speedup on these changes alone:

  • python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(4.01)' 'numpy.frexp(a)'
  • python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(4.01)' 'numpy.floor_divide(a, a*2.1)'

Public vs Internal Only

In order to simplify the first iteration of this patch, we decided to limit its effect to Numpy only. While npy_math.h is actually public header and other Python packages like Scipy might benefit the same way from inlining of the math functions used through this header. This patch does not allow this because npy_math_internal.h is not installed as a public header. One barrier for this is the question how to put result of template processing into public headers – there is no precedent in Numpy at this moment, thus your advice is needed.

@anton-malakhov
Copy link

Thank you Andres! We are happy to introduce our optimizations back to community! Please review and suggest how to proceed further @njsmith

npymath_build_dir = join('build', 'src.%s-%s.%s' % (sysconfig.get_platform(), sys.version_info[0], sys.version_info[1]),
'numpy', 'core', 'src', 'npymath')
config.add_include_dirs(npymath_build_dir)
config.add_define_macros([("NPY_BUILD", "1")]) # this macro indicates that Numpy build is in process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one can probably just use HAVE_NPY_CONFIG_H for this instead of adding a new one, this marks the internal build to include the private configuration header.
It might make sense to rename it to NPY_INTERNAL_BUILD to make its usage clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true. For the sake of those reading through the code, the latter looks like a better option.

@@ -605,6 +606,12 @@ def generate_api(ext, build_dir):
config.add_include_dirs(join('src', 'umath'))
config.add_include_dirs(join('src', 'npysort'))

# Need to add location of npy_math_internal.h, generated from src/npymath/npy_math_internal.h.src
npymath_build_dir = join('build', 'src.%s-%s.%s' % (sysconfig.get_platform(), sys.version_info[0], sys.version_info[1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be necessary, we already create headers via templates which get the correct build directory, see generate_multiarray_templated_sources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is npymath which might not have the correct setup to do so. Copying the method from multiarray and umath module may be nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you brought this up. How does distutils utilize these generate_xyz_templated_sources? If I grep for that function call, I only find a definition of it (although if you check out v1.9.2, it once was used in numpy/core/setup.py like this in line 846:

if not ENABLE_SEPARATE_COMPILATION:
    multiarray_deps.extend(multiarray_src)
    multiarray_src = [join('src', 'multiarray', 'multiarraymodule_onefile.c')]
    multiarray_src.append(generate_multiarray_templated_sources)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting these functions seem to be cruft from a removed compilation mode.

But still explicitly setting the build directory should not be necessary, npysort which is a static utility library simlar to npymath has a generated headers, npy_partition.h.src. Have you tried just adding the .src file to the sources list and nothing else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah now I remember how this worked. You have to add a function to the sources line of the add_extension, that function receives the build directory as argument, see generate_numpyconfig_h

I guess one could add the include directory there as with the private header directory, add a new function just for the multiarray math include directory or move the header to the private header location.
Probably the first option is simplest albeit a but ugly as the generate_numpyconfig_h technically has a different job ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliantaylor Ok, good to know you can consider installing inlinable headers. So, probably this discussion is kind of misplaced since it does not relate to the code above for adding the include directory.

I don't insist we should work on installing inlinable headers right now and right here, I just want to clarify my understanding of your concerns. Let's assume we put inlinable headers into include/numpy/ directory. What I'm saying is that even if configuration is not inferred correctly in a 3rd party library, including npy_math.h with inlined implementation would not hurt because it has fallback implementations which can be enabled even in presence of compiler support when [some] built-ins are mistakenly not detected. Is your concern that these fall back implementations are somewhat less correct than built-ins which might be available but not used?
And there are certainly no problem when they are detected in some way correctly.

Should we could open an issue/feature request for documenting this discussion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the concern is for example we have:

#ifdef GCC_VERSION > 4.2
#define npy_copysign copysign
#endif

now a compiler comes along and pretends to be gcc 4.2 but doesn't have the copysign builtin and the system doesn't have one either you get a compile error/undefined reference.
As compiler version checks are the only way to determine if a builtin exists with gcc this will always be fragile. (Of course one could change the check to only inline with compilers that offer a good way to detect this)

In practice this is probably not such a big deal but there just also wasn't a good reason to bother either. Except for some of the floating point property functions all math functions are so slow that inlining didn't make much difference
Now that vector math libraries are more common it might be worth the unlikely case of odd configurations breaking.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes, I would be that one who enables good optimization only on good platform. That's enough for the practical purpose. If compiler defines its version incorrectly - it is compiler bug, just do not support such a compiler. Otherwise, it is a bug in condition, e.g. when only compiler version is checked but the code really depends on libc version rather than compiler version.
I worked in a cross-platform project where all the configuration is done in the header config file. It is verbose, ugly, but portable and certainly feasible even in absence of features like has_include or has_builtin.

Anyway, thanks for the discussion, we'll open a separate issue on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an uncommon "bug", clang reports itself as gcc 4.2 but does not support everything gcc does.
numpy is built on a lot of odd platforms and compilers, but I also prefer that these are sorted out after the mainstream platforms have the best code possible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but clang also defines __llvm__ and __clang__ macro. Similar story is with ICC which pretends to show itself as platform-default compiler like GCC or MSVC. However, it is easy to distinguish both clang and ICC from GCC.

@@ -10,6 +10,15 @@
/* need Python.h for npy_intp, npy_uintp */
#include <Python.h>

/* using static inline modifiers when defining npy_math functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our comment style is newline after /* and before */

@AndresGuzman-Ballen
Copy link
Contributor Author

Just pushed commit that fixed the type mismatches between definitions and declarations of npy_math functions.

@AndresGuzman-Ballen
Copy link
Contributor Author

@juliantaylor I've pushed the commits that would allow this to work on all the platforms. Is this ready to merge as is or is there something else that needs to be done first? Thanks!

@@ -227,7 +228,7 @@ static void
ptr_this += stride_d;
ptr_that += stride_d;
}
*(@typ@ *)data_out = @sqrt_func@(out);
*(@typ@ *)data_out = npy_@sqrt_func@(out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks correct, but can you explain why this worked before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to figure out why this has not caused appveyor to fail before but have been unable to come up with a conclusion. This code has successfully compiled using regular Visual Studio and Visual C++ for Python 2.7 (both of which were done on Windows 2012 R2 and my local Windows 8.1).

It makes no sense because I can't find anything in my code that could possibly affect umath_tests.c, and it's not like the npymath library is affecting this :(

@juliantaylor
Copy link
Contributor

juliantaylor commented Mar 10, 2017

I don't like the explicit build directory mention, putting the include directory into `generate_numpyconfig_h´ should be good enough for now, the setup.py needs cleaning up anyway in future.

@juliantaylor
Copy link
Contributor

After that please squash the commits into one with the ENH: ... subject.

Code Overview:
    Numpy currently decouples the math function definitions in `npy_math.c.src`
    from the function declarations found in `npy_math.h`. This patch allows
    definitions to be included along with the inclusion of the `npy_math.h`
    header.

    Keeping the declarations and definitions separate is usually the right
    approach, but mathematical code like this might be better off as an
    exception to this common practice. Because the definitions are in the source
    file instead of the header, the compiler does not have any clue what lies
    underneath these math functions. This means the compiler can't make
    important optimizations like inlining and vectorization. Extensions that
    utilize these functions could greatly benefit from this, specifically
    `loops.c.src` from the umath extension.

Implementation Details:
    + Renames `npy_math.c.src` to `npy_math_internal.h.src`
    + Generates `npy_math_internal.h` from template by adding to
      `npymath_sources` list and adding `npymath` directory to include paths in
      `generate_numpyconfig_h` function of `numpy/core/setup.py`
    + Numpy's core distutils defines `#NPY_INTERNAL_BUILD` macro to make sure
      `npy_math_internal.h` is not included when other modules try to include
      public header `npy_math.h`
        - Currently do not know how to ship headers generated from template
          files
    + Adds `npy_math.c`, a file that includes the `npy_math_internal.h.src`
      file (but does not add NPY_INLINE static)
        - This is to keep the same static npy_math library as it exists now
    + Appends `numpy/npy_math.h` with `npy_math_internal.h` under condition that
      it's not being included in npy_math.c.src
        - The conditional macros mean `loops.c.src` will have definitions
          included, and the compiler will vectorize accordingly
    + Adds `NPY_INLINE` static to function declarations and definitions when
      necessary
    + Replaces `sqrtf` with `npy_sqrtf` in `numpy/core/src/umath/umath_tests.c`
      to make function portable
        - `_sqrtf` was not found on certain Windows environments compiling with
          Py2
@AndresGuzman-Ballen
Copy link
Contributor Author

Pasting following URL as a reference for future reference. Will be useful when we have a future discussion on potentially making npy_math_internal.h publicly available: #8754 (comment)

@AndresGuzman-Ballen
Copy link
Contributor Author

@juliantaylor I've squashed the changes and amended the commit message to include everything regarding this PR

@juliantaylor
Copy link
Contributor

thanks

@juliantaylor juliantaylor merged commit 9bba99d into numpy:master Mar 10, 2017
@charris
Copy link
Member

charris commented Mar 11, 2017

Are there many loops affected by this? Many of the functions in npy_math are also fallback functions for when they are not available in the system library or the system versions are buggy, in the long term only a few numpy specific functions should be needed. Possibly the complex functions would benefit more, or perhaps the ieee functions. Note that in a few cases the npy_math code has been pasted into the loops to the same effect

A bit more explanation might be helpful in npy_math.c to the effect that the inline modifier is removed so that actual code is generated.

@charris
Copy link
Member

charris commented Mar 11, 2017

ISTR that one of the other reasons for the npy_ prefix added to the functions was to avoid compiler generated inline functions/instructions when we needed a function pointer. That was particularly a problem on Windows. Probably not a problem here, but something to bear in mind.

@AndresGuzman-Ballen
Copy link
Contributor Author

@charris As of now, there aren't many loops that are affected by this, but the next PR I'll be making adds many of the unary functions to loops.c.src to take advantage of the kinds of optimizations the compiler can make when it uses explicit function definitions over function pointers. The npy_math_complex source code could benefit from this as well, but I had not had the chance to work on this yet. Same goes for the ieee functions (reason why I had not made these modifications to the two functions that are defined in ieee754.c.src (https://github.com/numpy/numpy/blob/master/numpy/core/include/numpy/npy_math.h#L159).

I can create a separate PR that cleans up a few things and adds your suggestion in the npy_math.c comment. There are already a few things in numpy/core/setup.py that I noticed need to be removed since they are no longer being used anyway (like the generate_umath_templated_sources function here: https://github.com/numpy/numpy/blob/master/numpy/core/setup.py#L862 or generate_multiarray_templated_sources function here: https://github.com/numpy/numpy/blob/master/numpy/core/setup.py#L707). @juliantaylor would this be the right approach? Was there a reason these functions were left there?

@juliantaylor
Copy link
Contributor

my concern is that you are probably going to end up using intel svml like code? that is the same code used in glibc's libmvec for some functions right?
I played around with those and they make a whole bunch of scipy tests fail due to too low accuracy. I think svml has a bunch of different accuracy settings, do you know which one was used in glibc?
For the code loops.c.src has these LOOP_FAST macros you can use to get compiler friendly code for the new code. I assume you are planning to add more special ufuncs like was already done for sqrt

the functions in setup.py can go, its on my todo list but if you want to do the cleanup go ahead.

@AndresGuzman-Ballen
Copy link
Contributor Author

@juliantaylor you're right about how intel SVML can fail a ton of scipy tests, and some can be controlled with flags like -prec-sqrt, -fimf-precision=high, -prec-div. Because of the different kinds of flags you could use for performance and precision, I thought it'd be better to give the user the ability to play around with the compiler flags, but I understand the concerns of the potential problems that could arise from giving the user this power. I know at one point you brought up using functions like np.vsqrt but I don't think that makes a nice out-of-the-box experience, and that's probably why you had suggested that a context manager would be the right approach.

How would one go about doing that anyway? Would that mean we would have to compile two umath libraries and then during runtime, the user can change an attribute flag and np.sqrt for instance would use the respective .so depending on np.core.umath.perf=True?

@juliantaylor
Copy link
Contributor

juliantaylor commented Mar 14, 2017

basically yes, we compile multiple variants of the functions and at runtime the user can choose what to use via some sane method.
ufuncs can be changed at runtime already, see for example this project that already allows using MKL from numpy:
https://github.com/geggo/uvml

Or see also the AVX2 integer functions numpy uses, they are chosen at startup depending on the capabilities of the CPU in ./numpy/core/code_generators/generate_umath.py (see 37740eb)
They use the gcc target attributes for ease of implementation as using per file compiler flags is non-trivial with the poor python distutils.

@charris
Copy link
Member

charris commented Mar 15, 2017

I am wary of attaching too many knobs to numpy. Reproducibility is important, and at the moment the best that can be done is to document version and platform. Let's not make that any worse.

@charris
Copy link
Member

charris commented Mar 15, 2017

If we need to make a choice, I'd prefer accuracy over performance.

@ghost
Copy link

ghost commented Mar 20, 2017

Jumping in this great discussion...

Precision over performance is a reasonable choice. None of commits proposed by Andres will degrade accuracy - all accuracy settings are 'high', i.e. within 1 ulp. Andres is also ensuring all tests pass with these changes. If we observe some unexpected test failure I suggest we dig into detail case by case. Our intention is to keep the floating-point behavior the same.

I would not compare vs. glibc libmvec -- it is very limited in functionality. Intel's svml provides all needed accuracy choices including 'high'.

The proposed implementation will be a combination of SVML (for short numpy vectors) and VML (for long vectors). Motivation is as follows, SVML is the most performant on short vectors; when vectors are long enough VML is preferable because it does automatic threading for performance on multiple cores. Behavior is nearly identical since both SVML and VML share the same code base.

I am in agreement that the number of knobs controlling numpy behavior should be minimal.

Thank you again for this helpful discussion,
Sergey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants