Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.__getitem__ #8905

Merged
merged 5 commits into from
Apr 20, 2017

Conversation

eric-wieser
Copy link
Member

@eric-wieser eric-wieser commented Apr 6, 2017

This fixes #8684.

There's still a can of worms waiting to be opened here regarding object arrays that themselves contain ndarrays, but they would fail before this patch too

Edit: This can of worms has been opened, and fixed at the expense of speed

@ahaldane
Copy link
Member

ahaldane commented Apr 19, 2017

So it looks like this is fixing two things.

  1. Indexing maskedarray with ... gives masked #8684 is fixed by effectively changing if not getattr(dout, 'ndim', False) to not isinstance(dout, np.ndarray).

  2. There are special cases for object arrays.

If it is possible and not too clumsy, it might be nice to separate out the object array case into its own special case if-statement (eg if self.dtype.type is np.object_) at the start. Right now I find it confusing as the object array special cases are interleaved with the rest. It seems to me most of the extra code and rearrangement in this PR is to account for the object array case, right?, and all that is needed to fix #8684 is the one-line change noted above.

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 19, 2017

Replacing just getattr(dout, 'ndim', False) doesn't do the job.

The problem is that some array subclasses decide that indexing should never cause scalars to unwrap, such that both data[0] and data[0,...] (on a 1-d array) return ndarrays.

However, there is a test in ma/test_subclassing that even when the data array does this, that m[0] is masked. So therefore, the check of isinstance(dout, np.ndarray) is not sufficient. I'll update that bit of the tests, since it's missing tests of ....

There are special cases for object arrays.

These were already here, and unfortunately, by design (this is documented behaviour).

Independently, it happens that one of the heuristics for "is this a scalar" is "is the dtype of the result array different from the input", which can only happen for object arrays

@eric-wieser eric-wieser force-pushed the fix-8684 branch 2 times, most recently from f1c55a1 to d194239 Compare April 19, 2017 09:09
numpy/ma/core.py Outdated
expect_scalar = not isinstance(mout, np.ndarray)
else:
# much harder to tell if we should get a scalar here - not quite
# correct
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ideal world, we would call getmaskarray(), and do the same as the above code path. Unfortunately, this would add a hefty performance cost for the most common cases.

@eric-wieser
Copy link
Member Author

If it is possible and not too clumsy, it might be nice to separate out the object array case into its own special case if-statement

You're on to something here - there were two different checks for object arrays - which happened to be similar, but there for totally different reasons. The latest patch separates the heuristic from the behaviour, and refers to the PR that introduced the special case for object arrays

dout = mvoid(dout, mask=mask, hardmask=self._hardmask)
return mvoid(dout, mask=mout, hardmask=self._hardmask)

# special case introduced in gh-5962
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5962. This elif could be removed to fix #8906, but that might have backwards-compatibility implications

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the test failure can be avoided by having here

elif self.dtype.type is np.object_ and dout is not masked and isinstance(dout, np.ndarray):

(And, yes, having a different .data property is more than a little annoying. But it probably was a logical choice for the original Column class, and then what do you do for MaskedColumn...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is probably yet another argument to not have a special case for object arrays holding ndarray and just return masked...

Do note, that #5962 didn't quite introduce this -- it just ensured that the pre-existing behaviour continued...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see where the MaskedConstant is being __new__'d from, causing the problem you described below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, yes I do - this is #8679 rearing its head. np.ma.array(np.ma.masked, copy=False) creates a new masked constant!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that closes the loop...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit to fix that - but I need your fix as well

@eric-wieser
Copy link
Member Author

@ahaldane: Hit it with another rewrite to try and separate more things.

If we got rid of the nomask shorthand, most of this code could go

return MaskedArray(dout, mask=True)
else:
if mout:
return masked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR breaks a test in astropy in a very strange way, in that what is returned here fails a test where masked_column[0] is np.ma.masked (even though both seem to be masked). I'm utterly confused and this may just be a build issue, but I don't get the same problem with current master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you construct a minimal example, or at least point me to that test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser - it is this test
A more minimal example:

from astropy.table import MaskedColumn
import numpy as np
ma = MaskedColumn([1], dtype='O')
ma.mask = True
ma[0]
# masked
ma[0] is np.ma.masked
# False
type(ma[0]) is type(np.ma.masked)
# True

Since we overwrite __getitem__ with a CPython shim (for speed with ndarray, I also checked that:

np.ma.MaskedArray.__getitem__(ma, 0) is np.ma.masked
# False

But it has something to do with our subclass:

np.ma.MaskedArray.__getitem__(ma.view(np.ma.MaskedArray), 0) is np.ma.masked
# True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I added just before this return in core.py:

print(mout, dout, masked is np.ma.masked)

and the output always is True 1 True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, np.ma.MaskedArray.__getitem__(ma, 0) still looks like ma.masked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dout is masked, or dout looks like masked but upon closer inspection is not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using _data instead of data work, or would that break other bits of MaskedColumn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using _data would work -- that is not used in MaskedColumn. But it needs some thought whether that is the right approach, i.e., to what extent one wants subclasses to be able to override how data is interpreted. Though on the other hand arguably this is so integral to how MaskedArray works that it would be reasonable to use a private property, which indicates more clearly that "if you override this, expect breakage".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I managed to reproduce this problem without touching the data attribute at all, so I'm inclined to leave it as is for now (see the added tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

The correct behavior here is to return a new MaskedArray, because `MaskedConstant` is pretending to be a scalar
@eric-wieser
Copy link
Member Author

Thanks @mhvk for spotting the problem.

There's an unrelated bugfix commit in here (BUG: np.ma.array(masked) should not return a new MaskedConstant) now too

# All element masked
self.assertTrue(isinstance(a[-2], MaskedArray))
assert_equal(type(a[-2]), mvoid)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour here hasn't changed, the test was just less strict than it should be

numpy/ma/core.py Outdated
else:
# we must never do .view(MaskedConstant), as that would create a new
# instance of np.ma.masked, which would be confusing
if isinstance(data, cls) and subok and not isinstance(data, MaskedConstant):
_data = ndarray.view(_data, type(data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get so confused about this: why exactly is a view taken here? For isinstance(data, cls) to hold, data must be a subclass of ndarray as well, and if subok was true, then the above assignment of _data = np.array(data, subok=sub) would already have ensured that type(_data) is type(data) (independently of the other arguments). Am I missing something? If so, do add that as a comment....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason might be to ensure that a copy of the wrapper is made - as otherwise we'd modify properties of the input array. isinstance(data, cls) is checking for a MaskedArray subclass.

All I did here was swap the order of the if in order to make it easier to see what is happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think that makes sense. If you do have another round, do add a comment...

_mask = self._mask

def _is_scalar(m):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it is not all that helpful to have these as helper functions, and I wonder a bit about the additional hit in performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other one has to be a helper function in order for the flow control to work. I made this one a helper function because we need it twice, and it makes the lower-down code generally nicer.

I could move these to be free functions, if you think they'd be better outside the method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm heavily biased by having used in-method functions only in cases where I want access to some of the local variables; as this is not the case here, moving it outside might be better indeed. But this is not a strong preference, so don't worry about -- my main preference would be not to have them at all, but I can see that the code flow would be a bit odd (probably would need to start by assigning scalar_expected = None, and then at the end check whether it has change; not great either).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern with moving it outside, is that it becomes distant compared to the rest of the logic. On the other hand, there's a good chance there's a bug waiting in __setitem__ that will need the same code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in cases where I want access to some of the local variables

Note that the original implementation did exactly that, but I realized it was pretty trivial to convert them into parameters, as there were only two

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe best to leave as is -- this is substantially clearer than it was!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it as is sounds good to me!

I think I've addressed your other concerns too - and I reckon we leave any kind of rollback of #5962 to another PR, since it's not needed to fix the bug.

@@ -4376,16 +4376,27 @@ def test_getitem(self):
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0])),
dtype=[('a', bool), ('b', bool)])
# No mask
self.assertTrue(isinstance(a[1], MaskedArray))
assert_equal(type(a[1]), mvoid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also check equality of _data and _mask as done for element 0 below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wrap this up in a helper to avoid the code duplication

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing so

@mhvk
Copy link
Contributor

mhvk commented Apr 19, 2017

This looks good, modulo the small comments (of which the in-method functions is just a preference).

numpy/ma/core.py Outdated

# special case introduced in gh-5962
elif (self.dtype.type is np.object_ and
isinstance(dout, np.ndarray) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate giving pep8 comments, but no need for the extra indent here (only for if clauses is it needed, since if ( takes 4 characters...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, you want columnar alignment here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, with self and isinstance aligned; that way it is clearest what the parentheses gather together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't quite sure whether alignment or indenting was right here. Does PEP8 actually specify to wrap if differently to elif?

Either way, fixed, since you seem to think this is wrong, and I knew that I was unsure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment is standard, with the problem of if ( being 4 characters very specifically discussed - https://www.python.org/dev/peps/pep-0008/#indentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, my real "knowledge" just comes from using emacs with flymake -- see http://docs.astropy.org/en/latest/development/codeguide_emacs.html -- which handily marks things that don't follow style, or uninitialized/unused variables, etc.

@mhvk
Copy link
Contributor

mhvk commented Apr 19, 2017

I'm happy with this. @ahaldane - do you want to have a look as well? (I'll merge tomorrow unless I hear otherwise).

Copy link
Member

@ahaldane ahaldane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though it's unfortunate we need to take so many special cases into account.

I'm happy you wrote so many tests - I think these will be very helpful to anyone who needs to modiify this code in the future.

I'll leave it open overnight. First one to get to it tomorrow can merge!

@eric-wieser
Copy link
Member Author

eric-wieser commented Apr 20, 2017

it's unfortunate we need to take so many special cases into account.

Note that the entire thing could become iscalar = not isinstance(getmaskarray(self)[indx]), but for performance does a bunch of faster checks first, since getmaskarray could incur a hefty allocation.

@mhvk mhvk merged commit 1a9f43f into numpy:master Apr 20, 2017
@mhvk mhvk added this to the 1.13.0 release milestone Apr 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Indexing maskedarray with ... gives masked
3 participants