ENH: perform inplace operations if one operand is a temporary #4322

juliantaylor · 2014-02-18T18:43:21Z

temporaries are identified by having a reference count of 1, if this is
the case we can perform inplace operations to avoid creating another
temporary.
E.g.

a = b + c + d

contains one temporary (c + d) with reference count 1 while the other
operands have count 2, so b + temp can be performed inplace.
This saves memory allocation overheads and improves cache utilization.

juliantaylor · 2014-02-18T18:45:10Z

it passes scipy and pandas testsuite, though if cython creates c = a + b with reference count 1 still needs to be checked

example benchmark:

d = np.arange(1000000)
%timeit d  + d + d + d

improves by 30% with the branch (9.79ms vs 14.8ms) as it saves on zeroing memory pages under linux

njsmith · 2014-02-18T20:00:03Z

numpy/core/src/multiarray/number.c

+{
+    if (Py_REFCNT(m1) == 1 &&
+        PyArray_CheckExact(m1) && PyArray_CheckExact(m2) &&
+        PyArray_DESCR(m1)->type_num == PyArray_DESCR(m2)->type_num &&


IIUC this type check is assuming that if m1.dtype == m2.dtype then m1.dtype == result.dtype as well. But I don't think this is true in all cases -- true_divide violates it all over the place, but also there's 'mm->d' for np.divide, and in the future arithmetic on bools will probably cast to int.

Also, what if the dtype is NPY_VOID, but the fields are different?

hm I don't think there is a way to check these special casting rules in this place
should be instead just whitelist operations?

njsmith · 2014-02-18T20:03:37Z

Should consider whether to support using the second operand as the output as well. And possibly wire this deeper into the ufunc machinery, and implement for other operations (|, &, ~, ...). But perhaps this is a good start to see if this is feasible at all :-)

seberg · 2014-02-18T20:59:35Z

Might be a bit tricky to put it deeper, since numpy also likes to increase ref counts when calling into the umath machinery. May be nonsense, but I was wondering if we can/should corrupt the original array object so that unsafe direct C-api usage may be found.

For the tests, I am not sure what the stride checks are for, but I think we need to check that m2 does not have possible memory overlap. i was going to suggest a contiguity check for m1, but it may be overly strict, though I don't like the idea that someone creates overlapped memory arrays (usually those should not have refcount=1 and owndata, but I think you could do that)

pv · 2014-02-18T21:09:54Z

numpy/core/src/multiarray/number.c

+
+/* check if m1 is a temporary (refcnt == 1) so we can do inplace operations
+ * instead of creating a new temporary */
+static int can_alide_temp(PyArrayObject * m1, PyObject * m2)


s/alide/elide/

juliantaylor · 2014-02-18T23:40:47Z

added some WIP commits including check for m2.ref_cnt == 1 for associative operations to trigger more elisions and find more issues in testsuites. The implementation is still very simplistic

how can I find out if an array is an exact type and can reuse temporaries in true_divide (besides hardcoding all types)?
PyObject_TypeCheck(m1, &PyInexactArrType_Type) does not seem to work

juliantaylor · 2014-02-18T23:52:24Z

an option to move the temporary reuse into the ufunc machinery would be to mark the arrays as temporary in tp_ slot layer, that way the refcount can be changed by the way down to the ufunc functions and still keep the c-api unchanged.

njsmith · 2014-02-19T00:12:16Z

Yeah, it's a bit ugly but we could use an array flag for this. (Just have
to be careful that every path which sets the flag also unsets it. Guess
that's easy enough by making sure that the tp_ function unconditionally
unsets it after calling down to the ufunc layer, regardless of exceptions.

For avoiding the issue with parametrized dtypes like void/str/datetime, a
sufficient hack would be to check for exact dtype object equality. But I
don't see how anyone outside the ufunc layer can tell what the result type
will be.

The current hack may be good enough for testing things like Cython
compatibility though.
On 18 Feb 2014 18:52, "Julian Taylor" notifications@github.com wrote:

an option to move the temporary reuse into the ufunc machinery would be to
mark the arrays as temporary in tp_ slot layer, that way the refcount can
be changed by the way down to the ufunc functions and still keep the c-api
unchanged.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-35450339
.

sturlamolden · 2014-02-19T09:59:59Z

This might not be safe in Cython. Stefan Behnel wrote this on the Cython user list:

Nathaniel Smith wrote:
Does anyone see any issue we might be overlooking in this refcount == 1 optimization for the python api? I'll post a PR with the change shortly.
It occurs belatedly that Cython code like a = np.arange(10) b = np.arange(10) c = a + b might end up calling tp_add with refcnt 1 arrays. Ditto for same with cdef np.ndarray or cdef object added. We should check...

That can happen, yes. Cython only guarantees that the object it passes is safely owned so that the reference cannot go away while it's being processed by a function. If it's in a local (non-closure) variable (or Cython temporary variable), that guarantee holds, so it's safe to pass objects with only a single reference into a C function, and Cython will do that.

Stefan

juliantaylor · 2014-02-19T19:03:58Z

hm too bad, I guess that kills these plans
though one could probably have a compile time option to enable it for people who don't use cython.

charris · 2014-02-21T19:29:51Z

Somewhat related #2802.

charris · 2014-03-10T00:25:44Z

@juliantaylor Should this be closed?

njsmith · 2014-03-16T13:57:09Z

Today I Learned that CPython itself makes a similar optimization in some circumstances: http://utcc.utoronto.ca/~cks/space/blog/python/ExaminingStringConcatOpt

It's probably worth looking into how exactly they pull this off. Seems to me pretty plausible that either CPython also assumes that refcnt=1 implies temporary in which case Cython is already broken and we could hassle Cython into fixing it, or else that there's some way to tell the difference between a string concat that comes in via a public API like PyString_Concat/PyNumber_Add versus one that comes in through bytecode interpration, and maybe we could steal the same trick.

juliantaylor · 2014-03-31T17:24:31Z

in cpython this is an optimization applied on the BINARY_ADD byte code, so cython is not broken.
in principle cpython could pass the refcount == 1 from bytecode information on to numpy, but that would imply adding more checks into the performance critical evaluation loop I don't think the python devs and other users would happy about that.

the only option I see is have it an optional build flag for non-cython users or to let cython set some flag to tell numpy not to do this optimization.
Maybe we are better off adding temporary caching instead.

njsmith · 2014-03-31T22:26:57Z

Ah, I see it. Well, foo.

Optional build flags seem pointless because no one will set them. (And
anyway, what's a "non-cython user", given that numpy itself uses cython...)

But I don't see how to make temporary caching work without nasty side
effects, like randomly holding onto multi-gigabyte chunks of memory for
arbitrary or nondeterministic amounts of time. (Also it's much worse than
temporary elision; out-of-place ops have 50% more cache thrashing than
in-place ones.)

The cython flag idea might work, but there might be other extensions that
do something similar, if it's a valid use of the cpython api.

If this optimization is really as valuable as it looks, then I think we
should at least ask python-dev if they have any ideas. Maybe it's possible
to somehow peek at the interpreter stack (notice that for all binary ops,
the top of the stack is the left argument), or something. Even if it
requires some new api, well, I guess a lot of numpy users will probably be
switching to 3.5 anyway...
On 31 Mar 2014 18:24, "Julian Taylor" notifications@github.com wrote:

in cpython this is an optimization applied on the BINARY_ADD byte code, so
cython is not broken.
in principle cpython could add the refcount == 1 from bytecode information
on to numpy, but that would imply adding more checks into the performance
critical evaluation loop I don't think the python devs and other users
would happy about that.

the only option I see is have it an optional build flag for non-cython
users or to let cython set some flag to tell numpy not to do this
optimization.
Maybe we are better off adding temporary caching instead.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-39115882
.

juliantaylor · 2014-03-31T22:45:51Z

it seems one can get the current frame with PyEval_GetFrame, possibly one can figure out if the current instruction is e.g. BINARY_ADD which should mean we are not being called from cython

juliantaylor · 2014-03-31T22:49:54Z

something along the line of:

#0  array_add (m1=0x7ffff7e6ca80, m2=1) at numpy/core/src/multiarray/number.c:339
339 {
(gdb) call (((PyStringObject *)PyEval_GetFrame()->f_code->co_code))->ob_sval[PyEval_GetFrame()->f_lasti]
$17 = 23 '\027'

23 is BINARY_ADD

nouiz · 2014-03-31T23:08:53Z

Their is one idea that was raise but seam to be discarded and I don't
understand why.

Do cython call the current C function that correspond to add directly
or do it pass by the interface that find the call function? Said
another way, if we register a new function for add do cython call the
new function or the new one?

On Mon, Mar 31, 2014 at 6:49 PM, Julian Taylor notifications@github.comwrote:

something along the line of:

#0 array_add (m1=0x7ffff7e6ca80, m2=1) at numpy/core/src/multiarray/number.c:339
339 {
(gdb) call (((PyStringObject *)PyEval_GetFrame()->f_code->co_code))->ob_sval[PyEval_GetFrame()->f_lasti]
$17 = 23 '\027'

23 is BINARY_ADD

Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-39152099
.

juliantaylor · 2014-03-31T23:19:04Z

so far I know cython just calls PyNumber_Add which calls the nb_add slot (== __add__ python function)
so it uses the same interface as cpython, just not from interpreted bytecode but a static compiled sequence.

njsmith · 2014-04-01T02:12:31Z

Not sure that checking the current instruction is enough - it could be that
the eval loop called add on a Cython type, which then called add on
an ndarray. Or does Cython insert its own frames?
On 1 Apr 2014 00:19, "Julian Taylor" notifications@github.com wrote:

so far I know cython just calls PyNumber_Add which calls the nb_add slot
(== add python function)
so it uses the same interface as cpython, just not from interpreted
bytecode but a static compiled sequence.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-39154171
.

juliantaylor · 2014-04-01T16:38:32Z

right, unwinding the stack to see our caller (e.g. with libunwind) might be an option

nouiz · 2014-04-01T17:13:41Z

Thanks for the explanation.

On Mon, Mar 31, 2014 at 7:19 PM, Julian Taylor notifications@github.comwrote:

so far I know cython just calls PyNumber_Add which calls the nb_add slot
(== add python function)
so it uses the same interface as cpython, just not from interpreted
bytecode but a static compiled sequence.

Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-39154171
.

juliantaylor · 2014-04-01T17:23:22Z

unwinding seems to work nicely, though it has a 10mus overhead so we would only want to do it for larger arrays

njsmith · 2014-04-01T17:40:32Z

Unwinding seems like it might be fragile? (Depending on architecture, build
options, etc.?)

If we can get ahold of the Python stack pointer, and op == BINARY_ADD,
top-of-stack == left PyObject_, just-above-top-of-stack == right PyObject_,
and both left and right are base-class ndarrays, then we can be pretty darn
certain that nothing funny is going on. And potentially this could be done
with plain ol' C...

On Tue, Apr 1, 2014 at 6:23 PM, Julian Taylor notifications@github.comwrote:

unwinding seems to work nicely, though it has a 10mus overhead so we would
only want to do it for larger arrays

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-39233007
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-04-01T18:18:45Z

libunwind only works on a couple linux arches (x86, ia64, mips and arm so far I know), macos and windows would need to do something else (windows has apis for this)

I don't think you can get the stack pointer, its probably a local variable of the bytecode loop in the frameevaluator

charris · 2014-04-10T19:09:00Z

Do we want to pursue this idea here? It sounds like there are a number of complications to deal with before it would be reliable.

juliantaylor · 2014-04-10T19:16:05Z

depends on whether a linux only solution is acceptable, using libunwind for this is just a couple lines of code.

njsmith · 2014-04-10T19:29:48Z

Do we have any compelling examples of why this optimization is useful, on
anything less microbenchmarky than 'a + a + a + a'? The laplace code was
cited in the original thread
http://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-normal
but looking at those graphs I have trouble telling how big the difference
between the regular and in-place versions is in the first place.

(Asking because I'm considering asking python-dev and it would be good to
have a simple clear example of why temporary elision matters in real life.)

On Thu, Apr 10, 2014 at 8:16 PM, Julian Taylor notifications@github.comwrote:

depends on whether a linux only solution is acceptable, using libunwind
for this is just a couple lines of code.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4322#issuecomment-40126401
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-05-26T20:53:00Z

its about 10-15% in that case due to saving some cache, on large data you save the page zeroing that needs to be done on the temporary memory.
I don't see a good way how python could help out here.

numexpr or similar approaches (like compiling the AST into more efficient numpy-only code) are probably the way to go.

njsmith · 2014-06-05T21:45:48Z

Well, it seems worth a try, anyway, we can see what happens: http://thread.gmane.org/gmane.comp.python.devel/148001

juliantaylor · 2014-07-02T20:18:59Z

I have been poking around the python eval loop a bit again, and while the stack pointer and current instruction is a local variable, it does seem to update the last instruction offset after most operations.
the last instruction + 1 may not necessarily point the the current instruction as python can fuse to operations into one go but possibly we could just scan last instruction + some small range for a PyNumber operation and branch on that?
it does seem a bit hackish but the python eval loop has not changed very much since at least 2.7, so it could be stable enough for us to use in this way.
It would avoid a modification of python itself (e.g. the tp_can_elide slot proposed in the python thread).

juliantaylor · 2014-07-02T21:29:27Z

I forgot we need the stackpointer too to check the operands are known to python
we could reconstruct the stack from the frame offset and the bytecode but thats likely too much work and too fragile
but we could scan the locals of the frame, if the operands are objects in the frame they can't be a refcount elided cython construct or?

njsmith · 2014-07-02T21:36:48Z

Are you trying to figure out specifically how to make this work on older
python versions?

For 3.5 purposes, I think the best strategy would be to prototype our ideal
changes to ceval.c (pick our favorite approach from that thread), and see
whether they actually work and whether they slow things down on regular
code. Something like my proposal shouldn't require too much in the way of
changes, I think...?

On Wed, Jul 2, 2014 at 10:29 PM, Julian Taylor notifications@github.com
wrote:

I forgot we need the stackpointer too to check the operands are known to
python
we could reconstruct the stack from the frame offset and the bytecode but
thats likely too much work and too fragile
but we could scan the locals of the frame, if the operands are objects in
the frame they can't be a refcount elided cython construct or?

—
Reply to this email directly or view it on GitHub
#4322 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-07-02T22:17:42Z

I think we might be able to solve it without python changes, maybe even without relying on the internals of the last instruction counter.
we can access to what is on the stack of a frame but not directly the offset python currently has on it. But any ndarray on the stack must come from real python code and thus have correct refcounts. So we only need to check if our operand pointers are on this stack.
Or am I missing something?

juliantaylor · 2014-07-02T22:30:30Z

pushed a commit with this change

tested with a small cythonized file:

import numpy as np
class Test:
    def __add__(a, b):
        d = np.ones(5)
        x = d + d + d
        print (d)
        return x

called via

t = obj.Test()
t + t

here the top opcode is BINARY_ADD and cython does not increase the refcount of d so without the stackcheck d would wrongly be 2 instead of 1, with the stack check d is correctly unchanged

it does still rely on implementational details like the existance of the stack and that the stack is only cleared after the number operation is performed (I think its never cleared until the frame ends).

njsmith · 2014-07-02T22:43:56Z

So your idea is that we don't actually need to know what the arguments of
the last operation were, we can just scan the whole stack?

On Wed, Jul 2, 2014 at 11:17 PM, Julian Taylor notifications@github.com
wrote:

I think we might be able to solve it without python changes, maybe even
without relying on the internals of the last instruction counter.
we can access what is on the stack of a frame. Any ndarray on the stack
must come from real python code and thus have correct refcounts. So we only
need to check if our operand pointers are on this stack.
Or am I missing something?

—
Reply to this email directly or view it on GitHub
#4322 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-07-02T22:44:56Z

yes

njsmith · 2014-07-02T23:02:04Z

So let's see. If it's on the stack, then the stack has a refcnt for it --
unless it's a stray pointer in the "unused" area above the top of the
stack. (Question 1: can such stray pointers mess us up?) And if we dispatch
from the stack via tp_whatever, and PyArray_CheckExact returns true, then
we know that the dispatch was directly to us, so the stack is the only
thing holding a pointer, which it will release as soon as the operation
finishes. (Question 2: do all such operations pop the stack? Question 3: do
we ever do internal dispatch from one numpy tp_whatever method to another?)
And other operations that call into user code that might potentially call
PyNumber_Add (e.g., function calls) will increment the refcnt. (Question 4:
are there any cases to worry about besides function calls? Question 5: are
function calls guaranteed safe?)

Re: Question 1: I think a = [1, 1, 1, 1, np.zeros(...)] will result in a
situation where the ndarray has only a single refcnt (from 'a'), and it is
present "above the stack" (because it gets pushed during construction of
the list -- the extra 1s are to guarantee that this grows the stack a bit
so it doesn't get immediately overwritten).

I think code like this therefore stands a chance of triggering a bug
(depending on how exactly how aggressively cython manages the list access
and refcnt elision):

cdef do_something(list a):
return a[4] + 1

my_cython_mod.do_something([1, 1, 1, 1, np.zeros(10)])

On Wed, Jul 2, 2014 at 11:31 PM, Julian Taylor notifications@github.com
wrote:

Reopened #4322 #4322.

—
Reply to this email directly or view it on GitHub
#4322 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-07-03T00:02:41Z

urg you are right with the list case, cython currently doesn't do this optimization but it could. too bad.

njsmith · 2014-07-03T00:05:38Z

So how much would it cost for the interpreter to put the top of the stack
somewhere we could see it?
On 3 Jul 2014 01:02, "Julian Taylor" notifications@github.com wrote:

urg you are right with the list case, cython currently doesn't do this
optimization but it could. too bad.

—
Reply to this email directly or view it on GitHub
#4322 (comment).

juliantaylor · 2014-07-03T00:11:55Z

probably not much, but I really would like a solution that does not require a new python version :/

njsmith · 2014-07-03T15:54:57Z

I know, that would be pretty sweet :-/ But even if it's possible it'll be a
pretty fragile hack, so we'll want to get fixes into CPython for the long
term anyway. And I wouldn't be surprised if a large portion of our userbase
is on 3.5 within 18-24 months, which at current rates is only ~2-3 numpy
release cycles anyway, versus ~1 numpy release cycle if we got it working
now.

On Thu, Jul 3, 2014 at 1:11 AM, Julian Taylor notifications@github.com
wrote:

probably not much, but I really would like a solution that does not
require a new python version :/

—
Reply to this email directly or view it on GitHub
#4322 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-08-06T17:44:27Z

how about storing the current frame, current instruction and produced array pointer in thread local storage.
On each operation we check if the current instruction has changed, we are on the same frame and the an argument is the previously produced array. If so we can elide the temporary.
C-API calls skipping refcounts would not have changed the frame or instruction so this would not trigger.

temporaries are identified by having a reference count of 1, if this is the case we can perform inplace operations to avoid creating another temporary. E.g. a = b + c + d contains one temporary (c + d) with reference count 1 while the other operands have count 2, so b + temp can be performed inplace. This saves memory allocation overheads and improves cache utilization.

implementation not good, just improve checking for issues in testsuite

if it is on the stack it must have correct refcounts, unlike objects from e.g. cython were refcount increases can be skipped for internal objects.

benchmark: import numpy as np n = 10000 d = np.ones(n) e = np.ones(n) f = np.ones(n) def fun(d, e, f): t = d + e t += f %timeit -r 20 d+e+f %timeit -r 20 fun(d, e, f) n = 100000 d = np.ones(n) e = np.ones(n) f = np.ones(n) %timeit -r 20 d+e+f %timeit -r 20 fun(d, e, f) n = 1000000 d = np.ones(n) e = np.ones(n) f = np.ones(n) %timeit -r 20 d+e+f %timeit -r 20 fun(d, e, f)

charris · 2015-12-10T22:31:53Z

@juliantaylor I'm going to close this. Feel free to reopen if you want to pursue it.

eric-wieser · 2020-03-12T16:36:42Z

Replaced by #7997 I think?

njsmith reviewed Feb 18, 2014
View reviewed changes

pv reviewed Feb 18, 2014
View reviewed changes

juliantaylor closed this May 26, 2014

juliantaylor reopened this Jul 2, 2014

juliantaylor mentioned this pull request Jul 9, 2014

TST: add temporary elision testcases #4856

Merged

juliantaylor added 6 commits August 6, 2014 19:57

add bitwise operations

6c4c696

add comments and disable true_divide as it can change type

61ae312

add m2.ref_cnt == 1 eliding for associative operations

be38f58

implementation not good, just improve checking for issues in testsuite

only elide if the inplace operand is on the python stack

4450dd2

if it is on the stack it must have correct refcounts, unlike objects from e.g. cython were refcount increases can be skipped for internal objects.

rgommers added 01 - Enhancement component: numpy._core labels Mar 8, 2015

charris closed this Dec 10, 2015

ENH: perform inplace operations if one operand is a temporary #4322

ENH: perform inplace operations if one operand is a temporary #4322

Conversation

juliantaylor commented Feb 18, 2014

juliantaylor commented Feb 18, 2014

njsmith Feb 18, 2014

Choose a reason for hiding this comment

pv Feb 18, 2014

Choose a reason for hiding this comment

juliantaylor Feb 18, 2014

Choose a reason for hiding this comment

njsmith commented Feb 18, 2014

seberg commented Feb 18, 2014

pv Feb 18, 2014

Choose a reason for hiding this comment

juliantaylor commented Feb 18, 2014

juliantaylor commented Feb 18, 2014

njsmith commented Feb 19, 2014

sturlamolden commented Feb 19, 2014

juliantaylor commented Feb 19, 2014

charris commented Feb 21, 2014

charris commented Mar 10, 2014

njsmith commented Mar 16, 2014

juliantaylor commented Mar 31, 2014

njsmith commented Mar 31, 2014

juliantaylor commented Mar 31, 2014

juliantaylor commented Mar 31, 2014

nouiz commented Mar 31, 2014

juliantaylor commented Mar 31, 2014

njsmith commented Apr 1, 2014

juliantaylor commented Apr 1, 2014

nouiz commented Apr 1, 2014

juliantaylor commented Apr 1, 2014

njsmith commented Apr 1, 2014

juliantaylor commented Apr 1, 2014

charris commented Apr 10, 2014

juliantaylor commented Apr 10, 2014

njsmith commented Apr 10, 2014

juliantaylor commented May 26, 2014

njsmith commented Jun 5, 2014

juliantaylor commented Jul 2, 2014

juliantaylor commented Jul 2, 2014

njsmith commented Jul 2, 2014

juliantaylor commented Jul 2, 2014

juliantaylor commented Jul 2, 2014

njsmith commented Jul 2, 2014

juliantaylor commented Jul 2, 2014

njsmith commented Jul 2, 2014

juliantaylor commented Jul 3, 2014

njsmith commented Jul 3, 2014

juliantaylor commented Jul 3, 2014

njsmith commented Jul 3, 2014

juliantaylor commented Aug 6, 2014

charris commented Dec 10, 2015

eric-wieser commented Mar 12, 2020