Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: avoid temporary arrays in expressions (again) #7997

Merged
merged 3 commits into from Feb 25, 2017

Conversation

Projects
None yet
9 participants
@juliantaylor
Copy link
Contributor

juliantaylor commented Aug 30, 2016

Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

 r = -a + b * c**2

One can avoid temporaries when the reference count of one of the
operands is one as it cannot be referenced by any other python code
(rvalue in C++ terms).
Python itself uses this method to optimize string concatenation.
The tricky part is that via the C-API one can use the PyNumber_ directly
and skip increasing the reference count when not needed.
The python stack cannot be used to determine this as python does not
export the stack size, only the current position.
To avoid the problem we collect the backtrace of the call and if it
consist exclusively of functions inside python or the c-library we can
assume no other library is using the C-API in between.

Issues are that the reliability of backtraces is unknown on non-GNU
platforms. On GNU and amd64 it should be reliable enough, even without frame
pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf
stack annotations.
Other platforms with backtrace need to be tested.
Another problem is that the stack unwinding is very expensive. Unwinding
a 100 function deep stack (which is not uncommon from python) takes
around 35us, so the elision can only be done for relatively large
arrays.
Heuristicly it seems to be beneficial around 400kb array sizes (which is
about the typical L2 cache size).

The performance gain is quite significant around 1.5 to 2 times faster
operations with temporaries can be observed. The speed is similar to
rewriting the operations to inplace operations manually.

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch 3 times, most recently from 5ca9a16 to 7e520b8 Aug 30, 2016

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Aug 31, 2016

First, let me say that this is terrifying and this comment should not be taken as condoning it.

That said -- I think that all we need for the optimization to be safe is for the the frames up to the last call to Py_EvalFrameEx to belong to cpython/numpy. (1) is this true? (2) does it allow a faster version of this where we only calculate the backtrace for, like, 5 frames deep or so, on the optimistic assumption that in almost any case where the optimization can apply, that will be enough to reach the last Py_EvalFrameEx?

(I might have the function name slightly wrong. You know what I mean.)

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Aug 31, 2016

I though about this method too. It probably works, while you can call back into python from a c-extension I can't think of a way how you could get that code to operate on a refcount 1 array which is not a temporary.
So this is a possible optimization, but the code would be very similar, instead of determining the python dso address range we determine the EvalFrameEx range (as calling dladdr for the name is expensive) and then we need to find the longest call chain that ends up in multiarray.so. An issue here might be that the compiler could perform deinlining which would spread the function on a non-contiguous address range. Keeping track of all entry point addresses instead might be a way that avoids this problem.
This could get the overhead down to around 5-10mus which allows us to reduce the minimum array size were we elide.
But then the profitable size is currently already in the L2 cache range. Unless we can reduce it to L1 size (which I doubt with 5mus budget) there might not be a significant performance gain.

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Sep 2, 2016

Naive question: the benefit of 1.5x-2x is clear, but how common are the use cases where this would hurt? And how badly?

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Sep 3, 2016

we don't try it when the arrays aren't large enough to make the check worthwhile. The size depends on the cpus cache sizes and is between 400 and 800KiB. At that size array operations go from 150 to 500 mus.
So the worst case of a c-extension using refcount 1 arrays in the cache sweetspot and our check being to optimistically tuned would have an overhead of ~20%, for larger arrays it becomes insignificant fast.
But this should be a pretty rare situation, while you can construct a situation where you need the stacktrace check (see our incref_elide tests) I don't think it actually happens in practice often. I haven't seen anything do it yet.

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch from f08caa3 to cd79fc1 Sep 3, 2016

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Sep 3, 2016

updated so it now also elides when it can do safe casting, this allows e.g. double += float or int += smaller-int

there is still a problem on systems where libpython is not statically linked to python as is the case on e.g. debian.
before main there is a symbol in the stack doesn't seem to be identifiable with backtrace. A solution there would be to check the Py_Main (and maybe others for embedded) address and stop when that is found. Or switch to stop at EvalFrameEx which would also a bit faster.

Aside from this how do we want to proceed with this?
It still needs testing on mac, while it seems to have backtrace() I don't know how good it is performance and accuracy wise. Someone want to test that?
Windows likely won't work though there are probably equivalent windows api functions one could use. Someone with more windows knowledge should do that.

The we need to tune the cutoff to avoid performance penalties. We could set it rather high so we have basically no penalty ever. Somewhere around 1MiB of data should be good, even if that would fit into a CPU cache it will currently take long enough that the checking overhead is negligible.
But a too high value also reduces the benefit so we should spend a little effort tuning it.
I have tested a couple linux machines with different caches, an AMD with L2/L3 512/6144, intel core2 with only l2 6144, intel with L2/L3 256/3072 ad L2/L3 512/18000 and the threshold for profitable (d + 1) + 1 varies from 400KiB to 800KiB. It doesn't seem easy to directly relate it to the cache sizes, in particular when there is a L3 available.
Maybe we could write a little data gather script and send it for testing to the mailing list?

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch from 1a2cf0c to 76efa23 Sep 3, 2016

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Sep 10, 2016

Updated to stop at the frame evaluation function instead of checking everything, probably good enough.
That reduces the overhead to 5-10us so one could enable it already at around 100KiB sized arrays, but I set it in 160KiB just in case.
Note the function name changed in python 3.6 as that has the juicy pep 523 implemented which should allow us to make all this with zero overhead (and maybe more), but I'll leave that for later.

I think this should be ready for some wider testing.

can_elide_temp(PyArrayObject * m1, PyObject * m2, int * cannot)
{
int i;
if (Py_REFCNT(m1) != 1 || !PyArray_CheckExact(m1) ||

This comment has been minimized.

Copy link
@mattip

mattip Sep 13, 2016

Member

Py_REFCNT should be the NumPy specific PyArray_REFCOUNT

This way the whole function will not compile on PyPy, which is good as it is very CPython specific.

static int can_elide_temp_unary(PyArrayObject * m1)
{
int cannot;
if (Py_REFCNT(m1) != 1 || !PyArray_CheckExact(m1) ||

This comment has been minimized.

Copy link
@mattip

mattip Sep 13, 2016

Member

Py_REFCNT should be the NumPy specific PyArray_REFCOUNT (see comment above)

This comment has been minimized.

Copy link
@charris

charris Feb 18, 2017

Member

@juliantaylor Any response to this?

This comment has been minimized.

Copy link
@charris

charris Feb 18, 2017

Member

@mattip I note that Cython also uses Py_REFCNT in the generated mtrand.c file.

This comment has been minimized.

Copy link
@juliantaylor

juliantaylor Feb 18, 2017

Author Contributor

under pypy this code is not run via an ifdef above

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch from 0ca7b11 to a70ffa3 Sep 30, 2016

@rainwoodman

This comment has been minimized.

Copy link
Contributor

rainwoodman commented Sep 30, 2016

@juliantaylor Very native question. Could you elaborate a bit more on The tricky part is that via the C-API one can use the PyNumber_ directly and skip increasing the reference count when not needed.? Do you mean if one writes down

 r = -a + b * c**2

Then the refcounts of a b and c are at least two during the evaluation of the expression? How?

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Sep 30, 2016

one count is in the variable itself and the other in the reference to the variable on the stack
The expression a +b + b basically is this bytecode:

          0 LOAD_FAST                0 (a)
          3 LOAD_FAST                1 (b)
          6 BINARY_ADD          
          7 LOAD_FAST                1 (b)
         10 BINARY_ADD      

the loads onto the stack increase the reference count, binary_add reduces them both again
So a non temporary from the python interpreter always has at least a count of two but the temporary result of the addition is a new object not assigned to a variable (a store) so it has a count of one.

This is only valid from the interpreter, a C extension could correctly call PyNumber_Add on two refcount one objects directly so we have to do this backtrace checking.

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Nov 2, 2016

Any further comments on the implementation? Else I'll squash the commits and prepare it for merging (release notes etc.)
I'll probably double the threshold just to be sure not to cause performance regressions and check if divisions can be added (I think my earlier concerns with division changing types should be handled by the safe can_cast check now).
Some testing on the mac would be great, a benchmark is available as ´check.ipy` here
https://github.com/juliantaylor/numpy/tree/elide-bench
https://mail.scipy.org/pipermail/numpy-discussion/2016-September/076034.html

@charris

This comment has been minimized.

Copy link
Member

charris commented Nov 2, 2016

@juliantaylor I'd rather push this off to 1.13.0 along with @pv's overlap inplace work. There is already a lot of stuff in 1.12 and the last big item I want in it is __numpy_ufunc__, and I may push that off also.

@charris charris added this to the 1.13.0 release milestone Nov 4, 2016

@charris

This comment has been minimized.

Copy link
Member

charris commented Nov 6, 2016

1.12.x is branched, might be a good time to clean up the history.

@homu

This comment has been minimized.

Copy link
Contributor

homu commented Dec 13, 2016

☔️ The latest upstream changes (presumably #8367) made this pull request unmergeable. Please resolve the merge conflicts.

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch from ff0bbc2 to 94f711d Dec 17, 2016

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch 2 times, most recently from 8c2d717 to 8537ed4 Jan 18, 2017

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Jan 18, 2017

squashed the history

}
}

/* left ok libraries and not reached PyEval -> no elide */

This comment has been minimized.

Copy link
@charris

charris Feb 24, 2017

Member

Maybe No longer in ok ....

can_elide_temp(PyArrayObject * alhs, PyObject * orhs, int * cannot)
{
if (Py_REFCNT(alhs) != 1 || !PyArray_CheckExact(alhs) ||
PyArray_DESCR(alhs)->type_num == NPY_VOID ||

This comment has been minimized.

Copy link
@charris

charris Feb 24, 2017

Member

Ah, so we don't worry about void types.

This comment has been minimized.

Copy link
@charris

charris Feb 24, 2017

Member

String types don't have these operations (?), might be worth mentioning that.

This comment has been minimized.

Copy link
@juliantaylor

juliantaylor Feb 24, 2017

Author Contributor

string don't have these operations, but it might be a good idea to blacklist them too

* A possible future improvement would be to change cpython to give as access
* to the top of the stack. Then we could just check that the objects involved
* are on the cpython stack instead of checking the function callstack.
*/

This comment has been minimized.

Copy link
@charris

charris Feb 24, 2017

Member

It would be good to list the acceptable types and the valid operators and the distinction of commutative vs non-commutative operators. Maybe define a few terms as well such as elision and DSO. I know this is in the code, but it would be good if one could read this and know more detail of what is going on. Might also mention how you can determine if the stack objects are in python, numpy, etc. I'm also not clear on why it is OK for a temporary to be in numpy, or is it the case that you only want to know that the possible non-temporaries come from numpy?

This comment has been minimized.

Copy link
@juliantaylor

juliantaylor Feb 24, 2017

Author Contributor

the assumption is that numpy does not call PyNumber_* functions for this to work. The whole backtrace thing is only necessary because other libraries (e.g. cython) can do that.
To my knowledge numpy does not for basic types.

updated the docs more

@charris

This comment has been minimized.

Copy link
Member

charris commented Feb 24, 2017

I think this is looking good, but more documentation is desireable.

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch 2 times, most recently from a5668d8 to e53ee05 Feb 24, 2017

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Feb 24, 2017

odd why is windows failing now, it shouldn't even be enabled there. Does windows compiler not handle our typenum enum like the standard says it should?

juliantaylor added some commits Aug 30, 2016

ENH: avoid temporary arrays in expressions
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

     r = -a + b

One can avoid temporaries when the reference count of one of the
operands is one as it cannot be referenced by any other python code
(rvalue in C++ terms).
Python itself uses this method to optimize string concatenation.
The tricky part is that via the C-API one can use the PyNumber_ directly
and skip increasing the reference count when not needed.
The python stack cannot be used to determine this as python does not
export the stack size, only the current position.
To avoid the problem we collect the backtrace of the call until the
python frame evaluation function and if it consist exclusively of
functions inside python or the c-library we can assume no other library
is using the C-API in between.

Issues are that the reliability of backtraces is unknown on non-GNU
platforms. On GNU and amd64 it should be reliable enough, even without frame
pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf
stack annotations.
Other platforms with backtrace need to be tested.
Another problem is that the stack unwinding is very expensive. Unwinding
a 100 function deep stack (which is not uncommon from python) takes
around 35us, so the elision can only be done for relatively large
arrays.
Heuristicly it seems to be beneficial around 256kb array sizes (which is
about the typical L2 cache size).

The performance gain is quite significant around 1.5 to 2 times faster
operations with temporaries can be observed. The speed is similar to
rewriting the operations to inplace operations manually.

@juliantaylor juliantaylor force-pushed the juliantaylor:temporary-elide branch from e53ee05 to c3e24b2 Feb 24, 2017

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Feb 24, 2017

odd must have been a ci fluke or the intp bug thats still in master

* Only check until the python frame evaluation function is found
* approx 10us overhead for stack size of 10
*
* TODO some calls go over scalarmath in umath but we cannot get the base

This comment has been minimized.

Copy link
@charris

charris Feb 25, 2017

Member

At some point I'd like to move scalarmath back into multiarray. The idea would be to put all needed functions in npymath. I think there are only a few ufuncs still needed.

@charris charris merged commit 7bc881d into numpy:master Feb 25, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@charris

This comment has been minimized.

Copy link
Member

charris commented Feb 25, 2017

In it goes. Thanks Julian.

@charris

This comment has been minimized.

Copy link
Member

charris commented Feb 25, 2017

For future reference, PyNumber functions are called from several places in multiarray with arrays of reference count 1. AFAICT with a quick scan, it would be OK to use the passed arrays like temporaries. Am I missing something, or have you looked at those places?

EDIT: I assume there is nothing on the interpreter call stack in those cases.

@@ -84,6 +86,14 @@ C API
New Features
============

Temporary elision
-----------------
On platforms providing the `backtrace` function NumPy will now not create

This comment has been minimized.

Copy link
@rgommers

rgommers Feb 25, 2017

Member

Naming the platforms (all but Windows?) would be clearer

@@ -7,6 +7,8 @@ This release supports Python 2.7 and 3.4 - 3.6.
Highlights
==========

* Operations like `a + b + c` will create less temporaries on some platforms

This comment has been minimized.

Copy link
@rgommers

rgommers Feb 25, 2017

Member

needs double backticks, here and in all cases (including `backtrace`) below

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Feb 25, 2017

Documentation LGTM now, much more understandable

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Feb 25, 2017

most of them are for object arrays which are not handled here, the handful non object ones look fine, either reference count != 1 or actually ones which can be done inplace. Of course I might have overlooked some which would mean we have holes in our test coverage.

@juliantaylor juliantaylor deleted the juliantaylor:temporary-elide branch Feb 25, 2017

@charris

This comment has been minimized.

Copy link
Member

charris commented Feb 25, 2017

@juliantaylor It is a bit concerning as it introduces one more thing to watch out for when writing or reviewing C code. Apropos Ralf's comment, what platforms do we know that this will work on? Is there an easy way to tell?

@rgommers I'll fix those backticks when it comes time to edit the release notes.

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Feb 25, 2017

I don't think it is something to worry about, usage of PyNumber_ in numpy outside of object arrays is very rare (most of them is a c-api stddev computation I doubt anyone is actually using) and new code must have tests anyway.

The release notes is as accurate as I can be, platforms that have the backtrace() function. I have tested linux and mac. bsd and other unixes may also have it but I don't know.

@jakirkham

This comment has been minimized.

Copy link
Contributor

jakirkham commented Feb 27, 2017

Did you guys looking into whether this could work with libunwind?

@juliantaylor

This comment has been minimized.

Copy link
Contributor Author

juliantaylor commented Feb 27, 2017

It works with unwind too, but its unnecessary at least on the platforms I could test. backtrace with glibc on x86 is pretty good and doesn't need frame pointers.
Possibly libunwind is more reliable for non x86 and non glibc architectures. Restricting to those known good systems might be a good idea for release time, but for master its good to leave it unrestricted to find possible non working systems.

@mhvk mhvk referenced this pull request Apr 4, 2017

Closed

Deferred execution NEP #8889

@mhvk mhvk referenced this pull request Jul 10, 2018

Closed

Creation of composite units is relatively slow #7630

0 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.