Broken iadd behavior on views #6119

akhmerov · 2015-07-26T18:22:39Z

Using += with a view of the same array sometimes is broken. To reproduce run

for n in range(80, 100):
    h = np.random.randn(n, n)
    h += h.T
    print(n, np.linalg.norm(h - h.T))

This produces a bunch of zeros (correct result), followed by a bunch of finite numbers of order 1 (wrong). For me the result is wrong for n > 90.

Replacing h += h.T with h = h + h.T or anything equivalent fixes the issue.

I have verified that the issue appears in linux pip installation of numpy v1.9.2 in python 2, as well as python 3. In particular, it is reproducible on http://try.jupyter.org.

The text was updated successfully, but these errors were encountered:

charris · 2015-07-26T18:53:20Z

The problem here is that numpy __iadd__ actually works inplace and h.T is a view of h, so the data in h.T gets overwritten during the add. It has been proposed that we somehow attempt to detect overlap of the right and left side, but the problem in general is difficult, and the order in which the data is overwritten not guaranteed. The best current advice is that the programmer know what is going on and avoid the situation. I'm going to close this as a known problem but you should feel free to open another discussion on the mailing list ;)

argriffing · 2015-07-27T20:26:20Z

I'm going to close this as a known problem

In particular #1683, #4802, #5241.

pv · 2015-07-27T20:32:44Z

My understanding of this issue was that it indeed is a bug, with the correct behavior being x += x.T.copy(). However, it needs to be fixed in such a way that performance is not sacrificed in the cases that work. The complete question "does the operation give correct result without copy" seems difficult to answer exactly, but it should be possible to detect the most imporant simple cases that work --- probably only 1D cases come in question here, because for multidim there are memory access pattern optimizations.

akhmerov · 2015-07-27T20:35:12Z

I've sent an email to the mailing list as suggested by @charris. I might be missing something there, but isn't numpy.may_share_memory the exact same check @pv discusses?

pv · 2015-07-27T20:39:56Z

Not completely the same --- consider x[:-1] += x[1:] which is OK due to the traversal order. For ndim>1 there are no guarantees about traversal order (or, they are more difficult to predict), so a similar overlap check would indeed work. may_share_memory can also return false positives, so it's not a complete answer, but probably generally good enough for this purpose.

akhmerov · 2015-07-27T20:45:43Z

I would already be happy if the arrays with ndim > 1 were covered. Are there any other important and safe use cases?

pv · 2015-07-27T21:19:58Z

I don't recall seeing too many "safe" examples given, although it may be there are some additional ones in old threads on the mailing list. I suspect there's not so much (correct) code that actually relies on this as a performance optimization, as it is cumbersome to reason about, but it AFAIK worked like this also in Numeric (pre-2006) so some may exist.

jaimefrio · 2015-07-27T21:56:47Z

I would be a little wary of providing features, or documenting behaviors, that rely on implementation details like array traversal order. It may be shooting ourselves in the foot if someone ever comes up with some iterator optimization scheme that needs to change this, and we are stuck with backwards compatibility.

You can get the original example to work playing around with the iterator buffer size, but it's the kind of thing that should come with a big fat DO NOT TRY THIS AT HOME sign.

pv · 2015-07-27T22:03:35Z

@jaimefrio: the suggestion here (gh-1683) was to have it always behave as if a copy of the RHS was made. The ufunc implementation in Numpy knows what order it is going to do the traversal (and about buffering etc), and can skip the copies in some cases. The current behavior is implementation-defined, but as seen from the issue being discussed on regular intervals, it's fairly surprising.
I see divergent opinions (gh-5241) have also been pointed out, so mailing list is probably better place for this...

akhmerov · 2015-07-27T22:33:21Z

If the current behavior does qualify as a bug, then wouldn't the optimal resolution be to first fix it, while only accounting for the most common safe case (np.may_share_memory(rhs, lhs) == False)?

On a related note, is there a way to automatically test some few thousand lines of existing codebase for possible appearances of this undesired behavior?

pv · 2015-07-27T23:00:09Z

Yes --- given the assumption it is a bug and not a wontfix. (However, allowances for the 1D cases are probably fairly simple to implement.)

At runtime, you can use np.set_numeric_ops to replace the binary operations with custom ones that use np.may_share_memory to check for overlaps between input and output arrays. The functions take 3 arguments when in-place. Additionally, numpy.dot and other functions with output arguments can be monkeypatched. Running a test suite with such modifications will likely catch many issues of this type.

akhmerov · 2015-07-29T07:29:39Z

Firstly: should I continue discussion here (the issue is still closed), in #1683 (same issue, still open), or here http://thread.gmane.org/gmane.comp.python.numeric.general/60880/focus=60884 ?

I would strongly argue for considering it a bug for the following reasons:

The wrong behavior is not mentioned anywhere. Neither add nor ufuncs docstrings nor docstrings of other affected functions like that of dot.
Basic operations don't even have user-accessible docstrings. For example I only realized after the clarification of @pv that a.__iadd__(b) is the same as add(a, b, out=a).
The overall numpy design ensures that things don't fail. For example, it isn't easy to turn off bounds checks even though they reduce performance.
Existing efficient code like a[:-1] += a[1:] relies on an assumption that is not declared anywhere in numpy. It may well break in the future. For an extremely similar situation see https://bugzilla.redhat.com/show_bug.cgi?id=638477
As discussed above, and seemingly contrary to what @charris wrote in his initial comment, the check can seemingly be implemented without compromising efficiency since it would essentially amount to a np.may_share_memory call.

njsmith · 2015-07-29T08:52:56Z

I see two potentially productive action items: (a) improve the docs, (b)
implement a useful definitely_share_memory. (may_share_memory does not help
for the reasons mentioned in my comment above - note my example does not
depend on any quirky or undefined behavior at all). Either or both would be
welcome.
On Jul 29, 2015 00:29, "Anton Akhmerov" notifications@github.com wrote:

Firstly: should I continue discussion here (the issue is still closed), in
#1683 #1683 (same issue, still
open), or here
http://thread.gmane.org/gmane.comp.python.numeric.general/60880/focus=60884
?

I would strongly argue for considering it a bug for the following reasons:

The wrong behavior is not mentioned anywhere. Neither add nor ufuncs
docstrings nor docstrings of other affected functions like that of dot.

Basic operations don't even have user-accessible docstrings. For
example I only realized after the clarification of @pv
https://github.com/pv that a.iadd(b) is the same as add(a, b,
out=a).

The overall numpy design ensures that things don't fail. For
example, it isn't easy to turn off bounds checks even though they reduce
performance.

Existing efficient code like a[:-1] += a[1:] relies on an assumption
that is not declared anywhere in numpy. It may well break in the future.
For an extremely similar situation see
https://bugzilla.redhat.com/show_bug.cgi?id=638477

As discussed above, and seemingly contrary to what @charris
https://github.com/charris wrote in his initial comment, the check
can seemingly be implemented without compromising efficiency since it would
essentially amount to a np.may_share_memory call.

—
Reply to this email directly or view it on GitHub
#6119 (comment).

akhmerov · 2015-07-29T12:33:01Z

Sorry, which example? I don't see any other messages from you in this issue discussion.

argriffing · 2015-07-29T15:07:52Z

@akhmerov I'm quite sure that @njsmith is referring to this kind of thing:

>>> import numpy as np
>>> a = np.ones((4, 4))
>>> a
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
>>> a[:, 0] += a[:, 1]
>>> a
array([[ 2.,  1.,  1.,  1.],
       [ 2.,  1.,  1.,  1.],
       [ 2.,  1.,  1.,  1.],
       [ 2.,  1.,  1.,  1.]])
>>> np.may_share_memory(a[:, 0], a[:, 1])
True

In this example we are adding the second column of a to the first column of a in-place. When @njsmith calls this a 'trivial safe case' in #5241 (comment), I think he is referring to the fact that the behavior of this example doesn't depend on the traversal order of the copy, and in particular would not be affected by issues like https://bugzilla.redhat.com/show_bug.cgi?id=638477. Therefore it's not completely correct to say that "the check can seemingly be implemented without compromising efficiency since it would essentially amount to a np.may_share_memory call".

For what it's worth, I'd personally be OK with a policy where a += b means the same thing as a += b.copy() for ndarrays even if it cannot be implemented without any loss of efficiency.

Also I don't understand @njsmith's suggestion that "what we need here is a definitely_share_memory function." If this function has leeway to return False even if memory is shared, then this could not be used to implement a policy that a += b always has the same behavior as a += b.copy() for ndarrays, right?

rkern · 2015-07-29T15:21:01Z

I believe that @njsmith is suggesting that definitely_share_memory() be implemented to return True if and only if an addressable element of one array does overlap in memory with an addressable element of the other array.

pv · 2015-07-29T17:07:54Z

I'm not sure it's productive to insist on definitely_share_memory(), given that the scaling is non-polynomial (in the number of dimensions), so it's likely going to be slow for high-dimensional arrays. For a compiler this might be OK, but Numpy has to do the calculation on the fly.

charris · 2015-07-29T17:15:10Z

The simplest solution might be to use the copy semantics whenever there is a possible overlap. Less efficient in some cases, but that is pretty much my standard procedure to avoid problems. The same might be appropriate for the out parameter of binary ufuncs.

@njsmith I mentioned the memcpy case because we don't have documented behavior for the one dimensionsal case. If we want to keep that behavior it should be documented and tested.

njsmith · 2015-07-29T17:20:33Z

Sorry about getting confused about the different issue threads there.
.
The problem is that without definitely_share_memory, we would be straight
up no joke removing the ability to efficiently perform operations like a[:,
0] += a[:, 1] from numpy. Maybe I'm wrong and everyone considers that a
good tradeoff but I really doubt you'll be able to get consensus that
removing functionality is a good tradeoff. Numpy generally errs on the side
of exposing the raw metal.
.
If definitely_share_memory is allowed to return False even in cases where
the arrays do share memory, then it could still be useful at least for
issuing warnings. Or if it managed to be both precise and essentially
instantaneous for essentially every situation that arises in practice then
we might have a case for doing this copies. Certainly it seems plausible
that some simple heuristics could catch 99% of the cases, but until someone
does the work it's impossible to know.
On Jul 29, 2015 10:07 AM, "Pauli Virtanen" notifications@github.com wrote:

I'm not sure it's productive to insist on definitely_share_memory(),
given that the scaling is non-polynomial (in the number of dimensions),
so it's likely going to be slow for high-dimensional arrays. For a
compiler this might be OK, but Numpy has to do the calculation on the fly.

—
Reply to this email directly or view it on GitHub
#6119 (comment).

rkern · 2015-07-29T17:23:57Z

We have a firm upper bound on the number of dimensions and almost all arrays have a tiny number of dimensions. Without an implementation to test, I wouldn't discount it as slow out of hand.

njsmith · 2015-07-29T17:29:20Z

To be clear, I don't necessarily oppose making copies just-in-case myself,
I merely doubt that everyone else can be convinced :-). If someone does
want to try going forward with that, then I think the next step would be to
send an email to numpy-discussion that (a) is very clear about what
functionality would be removed, the consequences for existing code, and
workarounds (or lack thereof), and (b) makes a case for doing it anyway.
.
(NB that such a case will become much stronger if it can be demonstrated
that doing exact checking really is too slow. This issue keeps coming up
but AFAICT so far no one has cared enough to even attempt the "proper" fix,
so it's hard to evaluate...)
On Jul 29, 2015 10:20 AM, "Nathaniel Smith" njs@pobox.com wrote:

Sorry about getting confused about the different issue threads there.
.
The problem is that without definitely_share_memory, we would be straight
up no joke removing the ability to efficiently perform operations like a[:,
0] += a[:, 1] from numpy. Maybe I'm wrong and everyone considers that a
good tradeoff but I really doubt you'll be able to get consensus that
removing functionality is a good tradeoff. Numpy generally errs on the side
of exposing the raw metal.
.
If definitely_share_memory is allowed to return False even in cases where
the arrays do share memory, then it could still be useful at least for
issuing warnings. Or if it managed to be both precise and essentially
instantaneous for essentially every situation that arises in practice then
we might have a case for doing this copies. Certainly it seems plausible
that some simple heuristics could catch 99% of the cases, but until someone
does the work it's impossible to know.
On Jul 29, 2015 10:07 AM, "Pauli Virtanen" notifications@github.com
wrote:

I'm not sure it's productive to insist on definitely_share_memory(),
given that the scaling is non-polynomial (in the number of dimensions),
so it's likely going to be slow for high-dimensional arrays. For a
compiler this might be OK, but Numpy has to do the calculation on the fly.

—
Reply to this email directly or view it on GitHub
#6119 (comment).

argriffing · 2015-07-29T17:46:52Z

Certainly it seems plausible that some simple heuristics could catch 99% of the cases, but until someone does the work it's impossible to know.

Would it seem as plausible to improve may_share_memory so that it rules out memory sharing in relatively simple cases like a[:, 0] vs. a[:, 1]? If so, would it make sense to then use copy semantics whenever there is a possible overlap? This idea would not require another function like definitely_share_memory.

charris · 2015-07-29T17:46:55Z

Ignoring the variable details of order of access, the contiguous case should be easy to handle. If the step size if fixed throughout, that is sort of like the contiguous case. The complicated cases can be done (I think) using GCD and checks, or we could simply make a copy in that situation.

njsmith · 2015-07-29T17:55:08Z

Would it seem as plausible to improve may_share_memory so that it rules out memory sharing in relatively simple cases

I dunno, maybe! The thing to do is to try it :-)

argriffing · 2015-07-29T18:21:01Z

I dunno, maybe! The thing to do is to try it :-)

For what it's worth, I'd personally still favor copying if the possibility of aliasing is detected, even without waiting for someone to take up this challenge to improve may_share_memory.

charris · 2015-07-29T19:50:58Z

I think the priority should be correctness rather than efficiency. Here we have an example of someone getting wrong answers for reasons that were not obvious to him, code that was correct with one set of variables was not correct for another. That is probably not the sort of reputation that we want to develop.

njsmith · 2015-07-29T20:02:27Z

So make the case on the list?

On Wed, Jul 29, 2015 at 12:51 PM, Charles Harris notifications@github.com
wrote:

I think the priority should be correctness rather than efficiency. Here we
have an example of someone getting wrong answers for reasons that were not
obvious to him, code that was correct with one set of variables was not
correct for another. That is probably not the sort of reputation that we
want to develop.

—
Reply to this email directly or view it on GitHub
#6119 (comment).

Nathaniel J. Smith -- http://vorpus.org

pv · 2015-07-30T00:28:31Z

The usual structure that (permute to sorted strides) stride[i]_shape[i]
<= stride[i+1] could allow for useful heuristics:
https://gist.github.com/pv/fe407dee3170a7e2243a
.
As for solving the integer program exactly --- there are worst cases,
and they appear also eg in LP-SOLVE (http://lpsolve.sourceforge.net/);
.
0;
x1 <= 100;
x2 <= 99;
x3 <= 96;
x4 <= 92;
x5 <= 91;
999_x1 + 1009_x2 + 1040_x3 + 1081_x4 + 1088_x5 = 20196;
int x1;
int x2;
int x3;
int x4;
int x5;
.
It takes 0.2s to determine this is infeasible; maybe we could manage to
do 1000x better with more specialized algorithm. In contrast, replacing
the rhs of the equality constraint by 10000 gives a solution in
essentially zero time. Granted, the above corresponds to huge arrays
with strange strides, but it probably means it would be best to cap the
work done to guard against spending too much time solving the overlap
problem.

njsmith · 2015-07-30T00:57:14Z

Here's a chapter that claims to provide an algorithm for solving bounded
linear diophantine equations (which is the formal name for our problem
here... well, we care about the decision problem version, but in practice
they're about the same):
https://scholar.google.com/scholar?cluster=2827314101877297017

On Wed, Jul 29, 2015 at 5:28 PM, Pauli Virtanen notifications@github.com
wrote:

The usual structure that (permute to sorted strides) stride[i]_shape[i]
<= stride[i+1] could allow for useful heuristics:
https://gist.github.com/pv/fe407dee3170a7e2243a
.
As for solving the integer program exactly --- there are worst cases,
and they appear also eg in LP-SOLVE (http://lpsolve.sourceforge.net/);
.
0;
x1 <= 100;
x2 <= 99;
x3 <= 96;
x4 <= 92;
x5 <= 91;
999_x1 + 1009_x2 + 1040_x3 + 1081_x4 + 1088_x5 = 20196;
int x1;
int x2;
int x3;
int x4;
int x5;
.
It takes 0.2s to determine this is infeasible; maybe we could manage to
do 1000x better with more specialized algorithm. In contrast, replacing
the rhs of the equality constraint by 10000 gives a solution in
essentially zero time. Granted, the above corresponds to huge arrays
with strange strides, but it probably means it would be best to cap the
work done to guard against spending too much time solving the overlap
problem.

—
Reply to this email directly or view it on GitHub
#6119 (comment).

Nathaniel J. Smith -- http://vorpus.org

charris · 2015-07-30T01:11:01Z

I think intersection of (shifted?) lattices might be the thing to look at first. One problem is likely to be integer overflow, as some of the basis vectors of the intersection can be very, very, large. In any case, that would give all common points and from there it might be possible to check for those in bounds in a relatively efficient manner. I don't know if any of that would be efficient enough, however. And while the strict lattice case is straightforward, the shifted case looks to be tricky. I suspect William Stein is the guy to ask here, he did his Ph'D under Hendrik Lenstra ;)

pv · 2015-07-30T01:35:20Z

I read the ramachandran paper some time ago. If the runtimes there are in seconds, it was of order 10s for 9 variables (java implementation).
Lenstra also has bounded diophantine paper from 1998. Simple depth-first search in pure python is in 50ms range for 10 vars.

pv · 2015-07-30T01:37:46Z

@charris: If I understood correctly, the unbounded problem is not hard, but the bounded one is.

njsmith · 2015-07-30T02:15:02Z

Yeah, IIUC finding the space of all possible solutions to a linear Diophantine equation is basically trivial (just apply Euclid's algorithm repeatedly, in polynomial time), but checking whether that space intersects with the bounds is still NP-hard.

For our purposes though there are probably a lot of cases where solving the unbounded equation will immediately reveal that overlap is impossible. If the two arrays are aligned and have the same size elements, then one can simplify out the element-size entry in the equation, and if you can do that then you can immediately rule out cases like a[:, 0] and a[:, 1] as being non-overlapping regardless of how large the arrays are.

charris · 2015-07-30T05:15:10Z

I think we could subdivide the ranges of each array, say into the subarrays indexed by the index with largest stride, and check for possible overlap between pairs. Then recurse on the possible pairs until the subarrays are small enough to check explicitly. Of course, that doesn't solve the problem of when it is OK to use arrays with common element addresses. It needs a bit of cleverness to avoid blowup of the n**2 variety, but that might be possible by sudividing just the largest (sub)arrays each time into, say, at most four parts.

Stride trick arrays pretty much mess up everything as any element can have multiple indices. But in the case of unique indices I think it is doable.

pv · 2015-07-30T14:12:48Z

Here's something to play with: https://gist.github.com/pv/fe407dee3170a7e2243a

charris · 2015-07-30T15:10:28Z

In fact, in the unique index case the element addresses can be accessed in sorted order, so the upper bound is on comparisons is O(n1 + n2), it's like merging two sorted lists.

njsmith · 2015-07-30T15:34:09Z

@pv: from a quick skim, it looks like that code is currently not taking
into account non-unit element sizes - is that right? (It'd be easy to add,
of course, as an extra two terms with unit stride and elsize bound.)
On Jul 30, 2015 8:10 AM, "Charles Harris" notifications@github.com wrote:

In fact, in the unique index case the element addresses can be accessed in
sorted order, so the upper bound is on comparisons is O(n1 + n2), it's like
merging two sorted lists.

—
Reply to this email directly or view it on GitHub
#6119 (comment).

pv · 2015-07-30T16:02:23Z

Yes, it wasn't considering itemsize; added now as an extra dimension.

The unique index cases (usual strided arrays) are probably easy for the GCD algorithm, which might be useful to try to actually show. The worst case scales exponentially with dimension, and you can easily find by random trial cases in 0.1 sec range (for the C code) at 10 variables. I don't think these correspond to something you'd ever have to solve when dealing with Numpy arrays, but I'd prefer to cap the maximum work done in order to not add potentially exponentially long runtimes.

argriffing · 2015-08-03T16:55:45Z

Would it make sense to upgrade the gist to a repo? I guess it's a work in progress to reduce the false positive rate of np.may_share_memory while keeping zero false negatives.

pv · 2015-08-03T17:15:17Z

@argriffing: the gist actually is a repository that you can clone etc. Assuming the algorithm is correct and I didn't flub the integer overflows the next step would be to use it for may_share_memory, and after that reuse the logic for ufuncs.

argriffing · 2015-08-04T22:15:21Z

Yes, it wasn't considering itemsize; added now as an extra dimension.

This has been added to the cython may_share_memory, but not to the maybe.py testing code, right?

Also I think it causes an issue with the scalar detection. I think scalars with itemsize larger than 1 are no longer detected as scalars?

argriffing · 2015-08-04T23:59:07Z

In test_ramachandran I think the n in for n in xrange(2, 64): is clobbered by d_ub = sum(a*(n-1) for a, n in zip(A, N)).

pv · 2015-08-05T15:10:46Z

Here goes, not yet for ufuncs though: gh-6166

charris closed this as completed Jul 26, 2015

insertinterestingnamehere mentioned this issue Jul 29, 2015

linalg.eigh failing for moderate matrices scipy/scipy#5095

Closed

pv mentioned this issue Aug 5, 2015

Better memory overlap detection #6166

Merged

pv added the 50 - Duplicate label Aug 5, 2015

Broken __iadd__ behavior on views #6119

Broken __iadd__ behavior on views #6119

Comments

akhmerov commented Jul 26, 2015

charris commented Jul 26, 2015

argriffing commented Jul 27, 2015

pv commented Jul 27, 2015

akhmerov commented Jul 27, 2015

pv commented Jul 27, 2015

akhmerov commented Jul 27, 2015

pv commented Jul 27, 2015

jaimefrio commented Jul 27, 2015

pv commented Jul 27, 2015

akhmerov commented Jul 27, 2015

pv commented Jul 27, 2015

akhmerov commented Jul 29, 2015

njsmith commented Jul 29, 2015

akhmerov commented Jul 29, 2015

argriffing commented Jul 29, 2015

rkern commented Jul 29, 2015

pv commented Jul 29, 2015 via email

charris commented Jul 29, 2015

njsmith commented Jul 29, 2015

rkern commented Jul 29, 2015

njsmith commented Jul 29, 2015

argriffing commented Jul 29, 2015

charris commented Jul 29, 2015

njsmith commented Jul 29, 2015

argriffing commented Jul 29, 2015

charris commented Jul 29, 2015

njsmith commented Jul 29, 2015

pv commented Jul 30, 2015

njsmith commented Jul 30, 2015

charris commented Jul 30, 2015

pv commented Jul 30, 2015

pv commented Jul 30, 2015

njsmith commented Jul 30, 2015

charris commented Jul 30, 2015

pv commented Jul 30, 2015 via email

charris commented Jul 30, 2015

njsmith commented Jul 30, 2015

pv commented Jul 30, 2015

argriffing commented Aug 3, 2015

pv commented Aug 3, 2015 via email

argriffing commented Aug 4, 2015

argriffing commented Aug 4, 2015

pv commented Aug 5, 2015

Broken iadd behavior on views #6119

Broken iadd behavior on views #6119