-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allows building npy_math with static inlining #8754
ENH: Allows building npy_math with static inlining #8754
Conversation
Thank you Andres! We are happy to introduce our optimizations back to community! Please review and suggest how to proceed further @njsmith |
numpy/core/setup.py
Outdated
npymath_build_dir = join('build', 'src.%s-%s.%s' % (sysconfig.get_platform(), sys.version_info[0], sys.version_info[1]), | ||
'numpy', 'core', 'src', 'npymath') | ||
config.add_include_dirs(npymath_build_dir) | ||
config.add_define_macros([("NPY_BUILD", "1")]) # this macro indicates that Numpy build is in process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one can probably just use HAVE_NPY_CONFIG_H
for this instead of adding a new one, this marks the internal build to include the private configuration header.
It might make sense to rename it to NPY_INTERNAL_BUILD
to make its usage clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah true. For the sake of those reading through the code, the latter looks like a better option.
numpy/core/setup.py
Outdated
@@ -605,6 +606,12 @@ def generate_api(ext, build_dir): | |||
config.add_include_dirs(join('src', 'umath')) | |||
config.add_include_dirs(join('src', 'npysort')) | |||
|
|||
# Need to add location of npy_math_internal.h, generated from src/npymath/npy_math_internal.h.src | |||
npymath_build_dir = join('build', 'src.%s-%s.%s' % (sysconfig.get_platform(), sys.version_info[0], sys.version_info[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be necessary, we already create headers via templates which get the correct build directory, see generate_multiarray_templated_sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this is npymath which might not have the correct setup to do so. Copying the method from multiarray and umath module may be nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad you brought this up. How does distutils utilize these generate_xyz_templated_sources? If I grep for that function call, I only find a definition of it (although if you check out v1.9.2, it once was used in numpy/core/setup.py
like this in line 846:
if not ENABLE_SEPARATE_COMPILATION:
multiarray_deps.extend(multiarray_src)
multiarray_src = [join('src', 'multiarray', 'multiarraymodule_onefile.c')]
multiarray_src.append(generate_multiarray_templated_sources)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting these functions seem to be cruft from a removed compilation mode.
But still explicitly setting the build directory should not be necessary, npysort which is a static utility library simlar to npymath has a generated headers, npy_partition.h.src
. Have you tried just adding the .src file to the sources list and nothing else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah now I remember how this worked. You have to add a function to the sources line of the add_extension
, that function receives the build directory as argument, see generate_numpyconfig_h
I guess one could add the include directory there as with the private header directory, add a new function just for the multiarray math include directory or move the header to the private header location.
Probably the first option is simplest albeit a but ugly as the generate_numpyconfig_h
technically has a different job ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliantaylor Ok, good to know you can consider installing inlinable headers. So, probably this discussion is kind of misplaced since it does not relate to the code above for adding the include directory.
I don't insist we should work on installing inlinable headers right now and right here, I just want to clarify my understanding of your concerns. Let's assume we put inlinable headers into include/numpy/ directory. What I'm saying is that even if configuration is not inferred correctly in a 3rd party library, including npy_math.h with inlined implementation would not hurt because it has fallback implementations which can be enabled even in presence of compiler support when [some] built-ins are mistakenly not detected. Is your concern that these fall back implementations are somewhat less correct than built-ins which might be available but not used?
And there are certainly no problem when they are detected in some way correctly.
Should we could open an issue/feature request for documenting this discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the concern is for example we have:
#ifdef GCC_VERSION > 4.2
#define npy_copysign copysign
#endif
now a compiler comes along and pretends to be gcc 4.2 but doesn't have the copysign builtin and the system doesn't have one either you get a compile error/undefined reference.
As compiler version checks are the only way to determine if a builtin exists with gcc this will always be fragile. (Of course one could change the check to only inline with compilers that offer a good way to detect this)
In practice this is probably not such a big deal but there just also wasn't a good reason to bother either. Except for some of the floating point property functions all math functions are so slow that inlining didn't make much difference
Now that vector math libraries are more common it might be worth the unlikely case of odd configurations breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes, I would be that one who enables good optimization only on good platform. That's enough for the practical purpose. If compiler defines its version incorrectly - it is compiler bug, just do not support such a compiler. Otherwise, it is a bug in condition, e.g. when only compiler version is checked but the code really depends on libc version rather than compiler version.
I worked in a cross-platform project where all the configuration is done in the header config file. It is verbose, ugly, but portable and certainly feasible even in absence of features like has_include
or has_builtin
.
Anyway, thanks for the discussion, we'll open a separate issue on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an uncommon "bug", clang reports itself as gcc 4.2 but does not support everything gcc does.
numpy is built on a lot of odd platforms and compilers, but I also prefer that these are sorted out after the mainstream platforms have the best code possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but clang also defines __llvm__
and __clang__
macro. Similar story is with ICC which pretends to show itself as platform-default compiler like GCC or MSVC. However, it is easy to distinguish both clang and ICC from GCC.
@@ -10,6 +10,15 @@ | |||
/* need Python.h for npy_intp, npy_uintp */ | |||
#include <Python.h> | |||
|
|||
/* using static inline modifiers when defining npy_math functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our comment style is newline after /*
and before */
3af45be
to
e436a72
Compare
Just pushed commit that fixed the type mismatches between definitions and declarations of npy_math functions. |
@juliantaylor I've pushed the commits that would allow this to work on all the platforms. Is this ready to merge as is or is there something else that needs to be done first? Thanks! |
@@ -227,7 +228,7 @@ static void | |||
ptr_this += stride_d; | |||
ptr_that += stride_d; | |||
} | |||
*(@typ@ *)data_out = @sqrt_func@(out); | |||
*(@typ@ *)data_out = npy_@sqrt_func@(out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks correct, but can you explain why this worked before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to figure out why this has not caused appveyor to fail before but have been unable to come up with a conclusion. This code has successfully compiled using regular Visual Studio and Visual C++ for Python 2.7 (both of which were done on Windows 2012 R2 and my local Windows 8.1).
It makes no sense because I can't find anything in my code that could possibly affect umath_tests.c, and it's not like the npymath library is affecting this :(
I don't like the explicit build directory mention, putting the include directory into `generate_numpyconfig_h´ should be good enough for now, the setup.py needs cleaning up anyway in future. |
After that please squash the commits into one with the ENH: ... subject. |
Code Overview: Numpy currently decouples the math function definitions in `npy_math.c.src` from the function declarations found in `npy_math.h`. This patch allows definitions to be included along with the inclusion of the `npy_math.h` header. Keeping the declarations and definitions separate is usually the right approach, but mathematical code like this might be better off as an exception to this common practice. Because the definitions are in the source file instead of the header, the compiler does not have any clue what lies underneath these math functions. This means the compiler can't make important optimizations like inlining and vectorization. Extensions that utilize these functions could greatly benefit from this, specifically `loops.c.src` from the umath extension. Implementation Details: + Renames `npy_math.c.src` to `npy_math_internal.h.src` + Generates `npy_math_internal.h` from template by adding to `npymath_sources` list and adding `npymath` directory to include paths in `generate_numpyconfig_h` function of `numpy/core/setup.py` + Numpy's core distutils defines `#NPY_INTERNAL_BUILD` macro to make sure `npy_math_internal.h` is not included when other modules try to include public header `npy_math.h` - Currently do not know how to ship headers generated from template files + Adds `npy_math.c`, a file that includes the `npy_math_internal.h.src` file (but does not add NPY_INLINE static) - This is to keep the same static npy_math library as it exists now + Appends `numpy/npy_math.h` with `npy_math_internal.h` under condition that it's not being included in npy_math.c.src - The conditional macros mean `loops.c.src` will have definitions included, and the compiler will vectorize accordingly + Adds `NPY_INLINE` static to function declarations and definitions when necessary + Replaces `sqrtf` with `npy_sqrtf` in `numpy/core/src/umath/umath_tests.c` to make function portable - `_sqrtf` was not found on certain Windows environments compiling with Py2
0b1c050
to
21d2c7e
Compare
Pasting following URL as a reference for future reference. Will be useful when we have a future discussion on potentially making npy_math_internal.h publicly available: #8754 (comment) |
@juliantaylor I've squashed the changes and amended the commit message to include everything regarding this PR |
thanks |
Are there many loops affected by this? Many of the functions in npy_math are also fallback functions for when they are not available in the system library or the system versions are buggy, in the long term only a few numpy specific functions should be needed. Possibly the complex functions would benefit more, or perhaps the ieee functions. Note that in a few cases the npy_math code has been pasted into the loops to the same effect A bit more explanation might be helpful in |
ISTR that one of the other reasons for the |
@charris As of now, there aren't many loops that are affected by this, but the next PR I'll be making adds many of the unary functions to I can create a separate PR that cleans up a few things and adds your suggestion in the |
my concern is that you are probably going to end up using intel svml like code? that is the same code used in glibc's libmvec for some functions right? the functions in setup.py can go, its on my todo list but if you want to do the cleanup go ahead. |
@juliantaylor you're right about how intel SVML can fail a ton of scipy tests, and some can be controlled with flags like -prec-sqrt, -fimf-precision=high, -prec-div. Because of the different kinds of flags you could use for performance and precision, I thought it'd be better to give the user the ability to play around with the compiler flags, but I understand the concerns of the potential problems that could arise from giving the user this power. I know at one point you brought up using functions like np.vsqrt but I don't think that makes a nice out-of-the-box experience, and that's probably why you had suggested that a context manager would be the right approach. How would one go about doing that anyway? Would that mean we would have to compile two umath libraries and then during runtime, the user can change an attribute flag and np.sqrt for instance would use the respective .so depending on np.core.umath.perf=True? |
basically yes, we compile multiple variants of the functions and at runtime the user can choose what to use via some sane method. Or see also the AVX2 integer functions numpy uses, they are chosen at startup depending on the capabilities of the CPU in |
I am wary of attaching too many knobs to numpy. Reproducibility is important, and at the moment the best that can be done is to document version and platform. Let's not make that any worse. |
If we need to make a choice, I'd prefer accuracy over performance. |
Jumping in this great discussion... Precision over performance is a reasonable choice. None of commits proposed by Andres will degrade accuracy - all accuracy settings are 'high', i.e. within 1 ulp. Andres is also ensuring all tests pass with these changes. If we observe some unexpected test failure I suggest we dig into detail case by case. Our intention is to keep the floating-point behavior the same. I would not compare vs. glibc libmvec -- it is very limited in functionality. Intel's svml provides all needed accuracy choices including 'high'. The proposed implementation will be a combination of SVML (for short numpy vectors) and VML (for long vectors). Motivation is as follows, SVML is the most performant on short vectors; when vectors are long enough VML is preferable because it does automatic threading for performance on multiple cores. Behavior is nearly identical since both SVML and VML share the same code base. I am in agreement that the number of knobs controlling numpy behavior should be minimal. Thank you again for this helpful discussion, |
Code Overview
Numpy currently decouples the math function definitions in
npy_math.c.src
from the function declarations found innpy_math.h
. This patch allows definitions to be included along with the inclusion of thenpy_math.h
header.Motivation
Keeping the declarations and definitions separate is usually the right approach, but mathematical code like this might be better off as an exception to this common practice. Because the definitions are in the source file instead of the header, the compiler does not have any clue what lies underneath these math functions. This means the compiler can't make important optimizations like inlining and vectorization. Extensions that utilize these functions could greatly benefit from this, specifically
loops.c.src
from the umath extension.This PR is the first step toward upstreaming optimizations implemented for Intel(R) Distribution for Python. You can look into the info/recipe/umath_optimizations.patch for the whole story when you install/download the Numpy package from, for instance, https://anaconda.org/intel/numpy/files
Another PR I will be proposing in the future utilizes function definitions over function pointers since the compiler is unable to optimize code using function pointers.
Implementation Details
The following PR does the following:
npy_math.c.src
tonpy_math_internal.h.src
npy_math_internal.h
from template by adding tonpymath_sources
list and addingnpymath
directory to include paths ingenerate_numpyconfig_h
function ofnumpy/core/setup.py
#NPY_INTERNAL_BUILD
macro to make surenpy_math_internal.h
is not included when other modules try to include public headernpy_math.h
npy_math.c
, a file that includes thenpy_math_internal.h.src
file (but does not add NPY_INLINE static)numpy/npy_math.h
withnpy_math_internal.h
under condition that it's not being included in npy_math.c.srcloops.c.src
will have definitions included, and the compiler will vectorize accordinglyNPY_INLINE
static to function declarations and definitions when necessarysqrtf
withnpy_sqrtf
innumpy/core/src/umath/umath_tests.c
to make function portable_sqrtf
was not found on certain Windows environments compiling with Py2It successfully compiles and completes the full test-suite for GCC and ICC
Benchmarking Results
Initial benchmarking has shown that for functions that do not use function pointers (like DOUBLE_frexp or DOUBLE_floor_divide), you see 11~20% speedup on these changes alone:
Public vs Internal Only
In order to simplify the first iteration of this patch, we decided to limit its effect to Numpy only. While
npy_math.h
is actually public header and other Python packages like Scipy might benefit the same way from inlining of the math functions used through this header. This patch does not allow this becausenpy_math_internal.h
is not installed as a public header. One barrier for this is the question how to put result of template processing into public headers – there is no precedent in Numpy at this moment, thus your advice is needed.