Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken __iadd__ behavior on views #6119

Closed
akhmerov opened this issue Jul 26, 2015 · 43 comments
Closed

Broken __iadd__ behavior on views #6119

akhmerov opened this issue Jul 26, 2015 · 43 comments

Comments

@akhmerov
Copy link

Using += with a view of the same array sometimes is broken. To reproduce run

for n in range(80, 100):
    h = np.random.randn(n, n)
    h += h.T
    print(n, np.linalg.norm(h - h.T))

This produces a bunch of zeros (correct result), followed by a bunch of finite numbers of order 1 (wrong). For me the result is wrong for n > 90.

Replacing h += h.T with h = h + h.T or anything equivalent fixes the issue.

I have verified that the issue appears in linux pip installation of numpy v1.9.2 in python 2, as well as python 3. In particular, it is reproducible on http://try.jupyter.org.

@charris
Copy link
Member

charris commented Jul 26, 2015

The problem here is that numpy __iadd__ actually works inplace and h.T is a view of h, so the data in h.T gets overwritten during the add. It has been proposed that we somehow attempt to detect overlap of the right and left side, but the problem in general is difficult, and the order in which the data is overwritten not guaranteed. The best current advice is that the programmer know what is going on and avoid the situation. I'm going to close this as a known problem but you should feel free to open another discussion on the mailing list ;)

@charris charris closed this as completed Jul 26, 2015
@argriffing
Copy link
Contributor

I'm going to close this as a known problem

In particular #1683, #4802, #5241.

@pv
Copy link
Member

pv commented Jul 27, 2015

My understanding of this issue was that it indeed is a bug, with the correct behavior being x += x.T.copy(). However, it needs to be fixed in such a way that performance is not sacrificed in the cases that work. The complete question "does the operation give correct result without copy" seems difficult to answer exactly, but it should be possible to detect the most imporant simple cases that work --- probably only 1D cases come in question here, because for multidim there are memory access pattern optimizations.

@akhmerov
Copy link
Author

I've sent an email to the mailing list as suggested by @charris. I might be missing something there, but isn't numpy.may_share_memory the exact same check @pv discusses?

@pv
Copy link
Member

pv commented Jul 27, 2015

Not completely the same --- consider x[:-1] += x[1:] which is OK due to the traversal order. For ndim>1 there are no guarantees about traversal order (or, they are more difficult to predict), so a similar overlap check would indeed work. may_share_memory can also return false positives, so it's not a complete answer, but probably generally good enough for this purpose.

@akhmerov
Copy link
Author

I would already be happy if the arrays with ndim > 1 were covered. Are there any other important and safe use cases?

@pv
Copy link
Member

pv commented Jul 27, 2015

I don't recall seeing too many "safe" examples given, although it may be there are some additional ones in old threads on the mailing list. I suspect there's not so much (correct) code that actually relies on this as a performance optimization, as it is cumbersome to reason about, but it AFAIK worked like this also in Numeric (pre-2006) so some may exist.

@jaimefrio
Copy link
Member

I would be a little wary of providing features, or documenting behaviors, that rely on implementation details like array traversal order. It may be shooting ourselves in the foot if someone ever comes up with some iterator optimization scheme that needs to change this, and we are stuck with backwards compatibility.

You can get the original example to work playing around with the iterator buffer size, but it's the kind of thing that should come with a big fat DO NOT TRY THIS AT HOME sign.

@pv
Copy link
Member

pv commented Jul 27, 2015

@jaimefrio: the suggestion here (gh-1683) was to have it always behave as if a copy of the RHS was made. The ufunc implementation in Numpy knows what order it is going to do the traversal (and about buffering etc), and can skip the copies in some cases. The current behavior is implementation-defined, but as seen from the issue being discussed on regular intervals, it's fairly surprising.
I see divergent opinions (gh-5241) have also been pointed out, so mailing list is probably better place for this...

@akhmerov
Copy link
Author

If the current behavior does qualify as a bug, then wouldn't the optimal resolution be to first fix it, while only accounting for the most common safe case (np.may_share_memory(rhs, lhs) == False)?

On a related note, is there a way to automatically test some few thousand lines of existing codebase for possible appearances of this undesired behavior?

@pv
Copy link
Member

pv commented Jul 27, 2015

Yes --- given the assumption it is a bug and not a wontfix. (However, allowances for the 1D cases are probably fairly simple to implement.)

At runtime, you can use np.set_numeric_ops to replace the binary operations with custom ones that use np.may_share_memory to check for overlaps between input and output arrays. The functions take 3 arguments when in-place. Additionally, numpy.dot and other functions with output arguments can be monkeypatched. Running a test suite with such modifications will likely catch many issues of this type.

@akhmerov
Copy link
Author

Firstly: should I continue discussion here (the issue is still closed), in #1683 (same issue, still open), or here http://thread.gmane.org/gmane.comp.python.numeric.general/60880/focus=60884 ?

I would strongly argue for considering it a bug for the following reasons:

  • The wrong behavior is not mentioned anywhere. Neither add nor ufuncs docstrings nor docstrings of other affected functions like that of dot.
  • Basic operations don't even have user-accessible docstrings. For example I only realized after the clarification of @pv that a.__iadd__(b) is the same as add(a, b, out=a).
  • The overall numpy design ensures that things don't fail. For example, it isn't easy to turn off bounds checks even though they reduce performance.
  • Existing efficient code like a[:-1] += a[1:] relies on an assumption that is not declared anywhere in numpy. It may well break in the future. For an extremely similar situation see https://bugzilla.redhat.com/show_bug.cgi?id=638477
  • As discussed above, and seemingly contrary to what @charris wrote in his initial comment, the check can seemingly be implemented without compromising efficiency since it would essentially amount to a np.may_share_memory call.

@njsmith
Copy link
Member

njsmith commented Jul 29, 2015

I see two potentially productive action items: (a) improve the docs, (b)
implement a useful definitely_share_memory. (may_share_memory does not help
for the reasons mentioned in my comment above - note my example does not
depend on any quirky or undefined behavior at all). Either or both would be
welcome.
On Jul 29, 2015 00:29, "Anton Akhmerov" notifications@github.com wrote:

Firstly: should I continue discussion here (the issue is still closed), in
#1683 #1683 (same issue, still
open), or here
http://thread.gmane.org/gmane.comp.python.numeric.general/60880/focus=60884
?

I would strongly argue for considering it a bug for the following reasons:

  • The wrong behavior is not mentioned anywhere. Neither add nor ufuncs
    docstrings nor docstrings of other affected functions like that of dot.
  • Basic operations don't even have user-accessible docstrings. For
    example I only realized after the clarification of @pv
    https://github.com/pv that a.iadd(b) is the same as add(a, b,
    out=a).
  • The overall numpy design ensures that things don't fail. For
    example, it isn't easy to turn off bounds checks even though they reduce
    performance.
  • Existing efficient code like a[:-1] += a[1:] relies on an assumption
    that is not declared anywhere in numpy. It may well break in the future.
    For an extremely similar situation see
    https://bugzilla.redhat.com/show_bug.cgi?id=638477
  • As discussed above, and seemingly contrary to what @charris
    https://github.com/charris wrote in his initial comment, the check
    can seemingly be implemented without compromising efficiency since it would
    essentially amount to a np.may_share_memory call.


Reply to this email directly or view it on GitHub
#6119 (comment).

@akhmerov
Copy link
Author

Sorry, which example? I don't see any other messages from you in this issue discussion.

@argriffing
Copy link
Contributor

@akhmerov I'm quite sure that @njsmith is referring to this kind of thing:

>>> import numpy as np
>>> a = np.ones((4, 4))
>>> a
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])
>>> a[:, 0] += a[:, 1]
>>> a
array([[ 2.,  1.,  1.,  1.],
       [ 2.,  1.,  1.,  1.],
       [ 2.,  1.,  1.,  1.],
       [ 2.,  1.,  1.,  1.]])
>>> np.may_share_memory(a[:, 0], a[:, 1])
True

In this example we are adding the second column of a to the first column of a in-place. When @njsmith calls this a 'trivial safe case' in #5241 (comment), I think he is referring to the fact that the behavior of this example doesn't depend on the traversal order of the copy, and in particular would not be affected by issues like https://bugzilla.redhat.com/show_bug.cgi?id=638477. Therefore it's not completely correct to say that "the check can seemingly be implemented without compromising efficiency since it would essentially amount to a np.may_share_memory call".

For what it's worth, I'd personally be OK with a policy where a += b means the same thing as a += b.copy() for ndarrays even if it cannot be implemented without any loss of efficiency.

Also I don't understand @njsmith's suggestion that "what we need here is a definitely_share_memory function." If this function has leeway to return False even if memory is shared, then this could not be used to implement a policy that a += b always has the same behavior as a += b.copy() for ndarrays, right?

@rkern
Copy link
Member

rkern commented Jul 29, 2015

I believe that @njsmith is suggesting that definitely_share_memory() be implemented to return True if and only if an addressable element of one array does overlap in memory with an addressable element of the other array.

@pv
Copy link
Member

pv commented Jul 29, 2015 via email

@charris
Copy link
Member

charris commented Jul 29, 2015

The simplest solution might be to use the copy semantics whenever there is a possible overlap. Less efficient in some cases, but that is pretty much my standard procedure to avoid problems. The same might be appropriate for the out parameter of binary ufuncs.

@njsmith I mentioned the memcpy case because we don't have documented behavior for the one dimensionsal case. If we want to keep that behavior it should be documented and tested.

@njsmith
Copy link
Member

njsmith commented Jul 29, 2015

Sorry about getting confused about the different issue threads there.
.
The problem is that without definitely_share_memory, we would be straight
up no joke removing the ability to efficiently perform operations like a[:,
0] += a[:, 1] from numpy. Maybe I'm wrong and everyone considers that a
good tradeoff but I really doubt you'll be able to get consensus that
removing functionality is a good tradeoff. Numpy generally errs on the side
of exposing the raw metal.
.
If definitely_share_memory is allowed to return False even in cases where
the arrays do share memory, then it could still be useful at least for
issuing warnings. Or if it managed to be both precise and essentially
instantaneous for essentially every situation that arises in practice then
we might have a case for doing this copies. Certainly it seems plausible
that some simple heuristics could catch 99% of the cases, but until someone
does the work it's impossible to know.
On Jul 29, 2015 10:07 AM, "Pauli Virtanen" notifications@github.com wrote:

I'm not sure it's productive to insist on definitely_share_memory(),
given that the scaling is non-polynomial (in the number of dimensions),
so it's likely going to be slow for high-dimensional arrays. For a
compiler this might be OK, but Numpy has to do the calculation on the fly.


Reply to this email directly or view it on GitHub
#6119 (comment).

@rkern
Copy link
Member

rkern commented Jul 29, 2015

We have a firm upper bound on the number of dimensions and almost all arrays have a tiny number of dimensions. Without an implementation to test, I wouldn't discount it as slow out of hand.

@njsmith
Copy link
Member

njsmith commented Jul 29, 2015

To be clear, I don't necessarily oppose making copies just-in-case myself,
I merely doubt that everyone else can be convinced :-). If someone does
want to try going forward with that, then I think the next step would be to
send an email to numpy-discussion that (a) is very clear about what
functionality would be removed, the consequences for existing code, and
workarounds (or lack thereof), and (b) makes a case for doing it anyway.
.
(NB that such a case will become much stronger if it can be demonstrated
that doing exact checking really is too slow. This issue keeps coming up
but AFAICT so far no one has cared enough to even attempt the "proper" fix,
so it's hard to evaluate...)
On Jul 29, 2015 10:20 AM, "Nathaniel Smith" njs@pobox.com wrote:

Sorry about getting confused about the different issue threads there.
.
The problem is that without definitely_share_memory, we would be straight
up no joke removing the ability to efficiently perform operations like a[:,
0] += a[:, 1] from numpy. Maybe I'm wrong and everyone considers that a
good tradeoff but I really doubt you'll be able to get consensus that
removing functionality is a good tradeoff. Numpy generally errs on the side
of exposing the raw metal.
.
If definitely_share_memory is allowed to return False even in cases where
the arrays do share memory, then it could still be useful at least for
issuing warnings. Or if it managed to be both precise and essentially
instantaneous for essentially every situation that arises in practice then
we might have a case for doing this copies. Certainly it seems plausible
that some simple heuristics could catch 99% of the cases, but until someone
does the work it's impossible to know.
On Jul 29, 2015 10:07 AM, "Pauli Virtanen" notifications@github.com
wrote:

I'm not sure it's productive to insist on definitely_share_memory(),
given that the scaling is non-polynomial (in the number of dimensions),
so it's likely going to be slow for high-dimensional arrays. For a
compiler this might be OK, but Numpy has to do the calculation on the fly.


Reply to this email directly or view it on GitHub
#6119 (comment).

@argriffing
Copy link
Contributor

Certainly it seems plausible that some simple heuristics could catch 99% of the cases, but until someone does the work it's impossible to know.

Would it seem as plausible to improve may_share_memory so that it rules out memory sharing in relatively simple cases like a[:, 0] vs. a[:, 1]? If so, would it make sense to then use copy semantics whenever there is a possible overlap? This idea would not require another function like definitely_share_memory.

@charris
Copy link
Member

charris commented Jul 29, 2015

Ignoring the variable details of order of access, the contiguous case should be easy to handle. If the step size if fixed throughout, that is sort of like the contiguous case. The complicated cases can be done (I think) using GCD and checks, or we could simply make a copy in that situation.

@njsmith
Copy link
Member

njsmith commented Jul 29, 2015

Would it seem as plausible to improve may_share_memory so that it rules out memory sharing in relatively simple cases

I dunno, maybe! The thing to do is to try it :-)

@argriffing
Copy link
Contributor

I dunno, maybe! The thing to do is to try it :-)

For what it's worth, I'd personally still favor copying if the possibility of aliasing is detected, even without waiting for someone to take up this challenge to improve may_share_memory.

@charris
Copy link
Member

charris commented Jul 29, 2015

I think the priority should be correctness rather than efficiency. Here we have an example of someone getting wrong answers for reasons that were not obvious to him, code that was correct with one set of variables was not correct for another. That is probably not the sort of reputation that we want to develop.

@njsmith
Copy link
Member

njsmith commented Jul 29, 2015

So make the case on the list?

On Wed, Jul 29, 2015 at 12:51 PM, Charles Harris notifications@github.com
wrote:

I think the priority should be correctness rather than efficiency. Here we
have an example of someone getting wrong answers for reasons that were not
obvious to him, code that was correct with one set of variables was not
correct for another. That is probably not the sort of reputation that we
want to develop.


Reply to this email directly or view it on GitHub
#6119 (comment).

Nathaniel J. Smith -- http://vorpus.org

@pv
Copy link
Member

pv commented Jul 30, 2015

The usual structure that (permute to sorted strides) stride[i]_shape[i]
<= stride[i+1] could allow for useful heuristics:
https://gist.github.com/pv/fe407dee3170a7e2243a
.
As for solving the integer program exactly --- there are worst cases,
and they appear also eg in LP-SOLVE (http://lpsolve.sourceforge.net/);
.
0;
x1 <= 100;
x2 <= 99;
x3 <= 96;
x4 <= 92;
x5 <= 91;
999_x1 + 1009_x2 + 1040_x3 + 1081_x4 + 1088_x5 = 20196;
int x1;
int x2;
int x3;
int x4;
int x5;
.
It takes 0.2s to determine this is infeasible; maybe we could manage to
do 1000x better with more specialized algorithm. In contrast, replacing
the rhs of the equality constraint by 10000 gives a solution in
essentially zero time. Granted, the above corresponds to huge arrays
with strange strides, but it probably means it would be best to cap the
work done to guard against spending too much time solving the overlap
problem.

@njsmith
Copy link
Member

njsmith commented Jul 30, 2015

Here's a chapter that claims to provide an algorithm for solving bounded
linear diophantine equations (which is the formal name for our problem
here... well, we care about the decision problem version, but in practice
they're about the same):
https://scholar.google.com/scholar?cluster=2827314101877297017

On Wed, Jul 29, 2015 at 5:28 PM, Pauli Virtanen notifications@github.com
wrote:

The usual structure that (permute to sorted strides) stride[i]_shape[i]
<= stride[i+1] could allow for useful heuristics:
https://gist.github.com/pv/fe407dee3170a7e2243a
.
As for solving the integer program exactly --- there are worst cases,
and they appear also eg in LP-SOLVE (http://lpsolve.sourceforge.net/);
.
0;
x1 <= 100;
x2 <= 99;
x3 <= 96;
x4 <= 92;
x5 <= 91;
999_x1 + 1009_x2 + 1040_x3 + 1081_x4 + 1088_x5 = 20196;
int x1;
int x2;
int x3;
int x4;
int x5;
.
It takes 0.2s to determine this is infeasible; maybe we could manage to
do 1000x better with more specialized algorithm. In contrast, replacing
the rhs of the equality constraint by 10000 gives a solution in
essentially zero time. Granted, the above corresponds to huge arrays
with strange strides, but it probably means it would be best to cap the
work done to guard against spending too much time solving the overlap
problem.


Reply to this email directly or view it on GitHub
#6119 (comment).

Nathaniel J. Smith -- http://vorpus.org

@charris
Copy link
Member

charris commented Jul 30, 2015

I think intersection of (shifted?) lattices might be the thing to look at first. One problem is likely to be integer overflow, as some of the basis vectors of the intersection can be very, very, large. In any case, that would give all common points and from there it might be possible to check for those in bounds in a relatively efficient manner. I don't know if any of that would be efficient enough, however. And while the strict lattice case is straightforward, the shifted case looks to be tricky. I suspect William Stein is the guy to ask here, he did his Ph'D under Hendrik Lenstra ;)

@pv
Copy link
Member

pv commented Jul 30, 2015

I read the ramachandran paper some time ago. If the runtimes there are in seconds, it was of order 10s for 9 variables (java implementation).
Lenstra also has bounded diophantine paper from 1998. Simple depth-first search in pure python is in 50ms range for 10 vars.

@pv
Copy link
Member

pv commented Jul 30, 2015

@charris: If I understood correctly, the unbounded problem is not hard, but the bounded one is.

@njsmith
Copy link
Member

njsmith commented Jul 30, 2015

Yeah, IIUC finding the space of all possible solutions to a linear Diophantine equation is basically trivial (just apply Euclid's algorithm repeatedly, in polynomial time), but checking whether that space intersects with the bounds is still NP-hard.

For our purposes though there are probably a lot of cases where solving the unbounded equation will immediately reveal that overlap is impossible. If the two arrays are aligned and have the same size elements, then one can simplify out the element-size entry in the equation, and if you can do that then you can immediately rule out cases like a[:, 0] and a[:, 1] as being non-overlapping regardless of how large the arrays are.

@charris
Copy link
Member

charris commented Jul 30, 2015

I think we could subdivide the ranges of each array, say into the subarrays indexed by the index with largest stride, and check for possible overlap between pairs. Then recurse on the possible pairs until the subarrays are small enough to check explicitly. Of course, that doesn't solve the problem of when it is OK to use arrays with common element addresses. It needs a bit of cleverness to avoid blowup of the n**2 variety, but that might be possible by sudividing just the largest (sub)arrays each time into, say, at most four parts.

Stride trick arrays pretty much mess up everything as any element can have multiple indices. But in the case of unique indices I think it is doable.

@pv
Copy link
Member

pv commented Jul 30, 2015 via email

@charris
Copy link
Member

charris commented Jul 30, 2015

In fact, in the unique index case the element addresses can be accessed in sorted order, so the upper bound is on comparisons is O(n1 + n2), it's like merging two sorted lists.

@njsmith
Copy link
Member

njsmith commented Jul 30, 2015

@pv: from a quick skim, it looks like that code is currently not taking
into account non-unit element sizes - is that right? (It'd be easy to add,
of course, as an extra two terms with unit stride and elsize bound.)
On Jul 30, 2015 8:10 AM, "Charles Harris" notifications@github.com wrote:

In fact, in the unique index case the element addresses can be accessed in
sorted order, so the upper bound is on comparisons is O(n1 + n2), it's like
merging two sorted lists.


Reply to this email directly or view it on GitHub
#6119 (comment).

@pv
Copy link
Member

pv commented Jul 30, 2015

Yes, it wasn't considering itemsize; added now as an extra dimension.

The unique index cases (usual strided arrays) are probably easy for the GCD algorithm, which might be useful to try to actually show. The worst case scales exponentially with dimension, and you can easily find by random trial cases in 0.1 sec range (for the C code) at 10 variables. I don't think these correspond to something you'd ever have to solve when dealing with Numpy arrays, but I'd prefer to cap the maximum work done in order to not add potentially exponentially long runtimes.

@argriffing
Copy link
Contributor

Would it make sense to upgrade the gist to a repo? I guess it's a work in progress to reduce the false positive rate of np.may_share_memory while keeping zero false negatives.

@pv
Copy link
Member

pv commented Aug 3, 2015 via email

@argriffing
Copy link
Contributor

Yes, it wasn't considering itemsize; added now as an extra dimension.

This has been added to the cython may_share_memory, but not to the maybe.py testing code, right?

Also I think it causes an issue with the scalar detection. I think scalars with itemsize larger than 1 are no longer detected as scalars?

@argriffing
Copy link
Contributor

In test_ramachandran I think the n in for n in xrange(2, 64): is clobbered by d_ub = sum(a*(n-1) for a, n in zip(A, N)).

@pv
Copy link
Member

pv commented Aug 5, 2015

Here goes, not yet for ufuncs though: gh-6166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants