BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.getitem #8905

eric-wieser · 2017-04-06T23:37:35Z

This fixes #8684.

~~There's still a can of worms waiting to be opened here regarding object arrays that themselves contain ndarrays, but they would fail before this patch too~~

Edit: This can of worms has been opened, and fixed at the expense of speed

ahaldane · 2017-04-19T03:54:25Z

So it looks like this is fixing two things.

Indexing maskedarray with ... gives masked #8684 is fixed by effectively changing if not getattr(dout, 'ndim', False) to not isinstance(dout, np.ndarray).
There are special cases for object arrays.

If it is possible and not too clumsy, it might be nice to separate out the object array case into its own special case if-statement (eg if self.dtype.type is np.object_) at the start. Right now I find it confusing as the object array special cases are interleaved with the rest. It seems to me most of the extra code and rearrangement in this PR is to account for the object array case, right?, and all that is needed to fix #8684 is the one-line change noted above.

eric-wieser · 2017-04-19T09:01:51Z

Replacing just getattr(dout, 'ndim', False) doesn't do the job.

The problem is that some array subclasses decide that indexing should never cause scalars to unwrap, such that both data[0] and data[0,...] (on a 1-d array) return ndarrays.

However, there is a test in ma/test_subclassing that even when the data array does this, that m[0] is masked. So therefore, the check of isinstance(dout, np.ndarray) is not sufficient. I'll update that bit of the tests, since it's missing tests of ....

There are special cases for object arrays.

These were already here, and unfortunately, by design (this is documented behaviour).

Independently, it happens that one of the heuristics for "is this a scalar" is "is the dtype of the result array different from the input", which can only happen for object arrays

eric-wieser · 2017-04-19T09:29:03Z

numpy/ma/core.py

+            expect_scalar = not isinstance(mout, np.ndarray)
+        else:
+            # much harder to tell if we should get a scalar here - not quite
+            # correct


In an ideal world, we would call getmaskarray(), and do the same as the above code path. Unfortunately, this would add a hefty performance cost for the most common cases.

eric-wieser · 2017-04-19T10:13:33Z

If it is possible and not too clumsy, it might be nice to separate out the object array case into its own special case if-statement

You're on to something here - there were two different checks for object arrays - which happened to be similar, but there for totally different reasons. The latest patch separates the heuristic from the behaviour, and refers to the PR that introduced the special case for object arrays

eric-wieser · 2017-04-19T10:16:53Z

numpy/ma/core.py

-                dout = mvoid(dout, mask=mask, hardmask=self._hardmask)
+                return mvoid(dout, mask=mout, hardmask=self._hardmask)
+
+            # special case introduced in gh-5962


#5962. This elif could be removed to fix #8906, but that might have backwards-compatibility implications

So, the test failure can be avoided by having here

elif self.dtype.type is np.object_ and dout is not masked and isinstance(dout, np.ndarray):

(And, yes, having a different .data property is more than a little annoying. But it probably was a logical choice for the original Column class, and then what do you do for MaskedColumn...)

Overall, this is probably yet another argument to not have a special case for object arrays holding ndarray and just return masked...

Do note, that #5962 didn't quite introduce this -- it just ensured that the pre-existing behaviour continued...

I still don't see where the MaskedConstant is being __new__'d from, causing the problem you described below

Oh wait, yes I do - this is #8679 rearing its head. np.ma.array(np.ma.masked, copy=False) creates a new masked constant!

Ah, yes, that closes the loop...

Added a commit to fix that - but I need your fix as well

This fixes numpy#8684

eric-wieser · 2017-04-19T13:51:41Z

@ahaldane: Hit it with another rewrite to try and separate more things.

If we got rid of the nomask shorthand, most of this code could go

mhvk · 2017-04-19T14:15:28Z

numpy/ma/core.py

-                return MaskedArray(dout, mask=True)
+            else:
+                if mout:
+                    return masked


This PR breaks a test in astropy in a very strange way, in that what is returned here fails a test where masked_column[0] is np.ma.masked (even though both seem to be masked). I'm utterly confused and this may just be a build issue, but I don't get the same problem with current master.

Can you construct a minimal example, or at least point me to that test?

@eric-wieser - it is this test
A more minimal example:

from astropy.table import MaskedColumn import numpy as np ma = MaskedColumn([1], dtype='O') ma.mask = True ma[0] # masked ma[0] is np.ma.masked # False type(ma[0]) is type(np.ma.masked) # True

Since we overwrite __getitem__ with a CPython shim (for speed with ndarray, I also checked that:

np.ma.MaskedArray.__getitem__(ma, 0) is np.ma.masked # False

But it has something to do with our subclass:

np.ma.MaskedArray.__getitem__(ma.view(np.ma.MaskedArray), 0) is np.ma.masked # True

For completeness, I added just before this return in core.py:

print(mout, dout, masked is np.ma.masked)

and the output always is True 1 True.

Presumably, np.ma.MaskedArray.__getitem__(ma, 0) still looks like ma.masked?

dout is masked, or dout looks like masked but upon closer inspection is not?

Would using _data instead of data work, or would that break other bits of MaskedColumn?

Yes, using _data would work -- that is not used in MaskedColumn. But it needs some thought whether that is the right approach, i.e., to what extent one wants subclasses to be able to override how data is interpreted. Though on the other hand arguably this is so integral to how MaskedArray works that it would be reasonable to use a private property, which indicates more clearly that "if you override this, expect breakage".

Either way, I managed to reproduce this problem without touching the data attribute at all, so I'm inclined to leave it as is for now (see the added tests)

The correct behavior here is to return a new MaskedArray, because `MaskedConstant` is pretending to be a scalar

eric-wieser · 2017-04-19T17:31:56Z

Thanks @mhvk for spotting the problem.

There's an unrelated bugfix commit in here (BUG: np.ma.array(masked) should not return a new MaskedConstant) now too

eric-wieser · 2017-04-19T17:35:19Z

numpy/ma/tests/test_core.py

        # All element masked
-        self.assertTrue(isinstance(a[-2], MaskedArray))
+        assert_equal(type(a[-2]), mvoid)


The behaviour here hasn't changed, the test was just less strict than it should be

mhvk · 2017-04-19T18:34:09Z

numpy/ma/core.py

-        else:
+        # we must never do .view(MaskedConstant), as that would create a new
+        # instance of np.ma.masked, which would be confusing
+        if isinstance(data, cls) and subok and not isinstance(data, MaskedConstant):
            _data = ndarray.view(_data, type(data))


I get so confused about this: why exactly is a view taken here? For isinstance(data, cls) to hold, data must be a subclass of ndarray as well, and if subok was true, then the above assignment of _data = np.array(data, subok=sub) would already have ensured that type(_data) is type(data) (independently of the other arguments). Am I missing something? If so, do add that as a comment....

I think the reason might be to ensure that a copy of the wrapper is made - as otherwise we'd modify properties of the input array. isinstance(data, cls) is checking for a MaskedArray subclass.

All I did here was swap the order of the if in order to make it easier to see what is happening.

OK, I think that makes sense. If you do have another round, do add a comment...

mhvk · 2017-04-19T18:37:28Z

numpy/ma/core.py

        _mask = self._mask
+
+        def _is_scalar(m):


To me, it is not all that helpful to have these as helper functions, and I wonder a bit about the additional hit in performance.

I think the other one has to be a helper function in order for the flow control to work. I made this one a helper function because we need it twice, and it makes the lower-down code generally nicer.

I could move these to be free functions, if you think they'd be better outside the method

I'm heavily biased by having used in-method functions only in cases where I want access to some of the local variables; as this is not the case here, moving it outside might be better indeed. But this is not a strong preference, so don't worry about -- my main preference would be not to have them at all, but I can see that the code flow would be a bit odd (probably would need to start by assigning scalar_expected = None, and then at the end check whether it has change; not great either).

The only concern with moving it outside, is that it becomes distant compared to the rest of the logic. On the other hand, there's a good chance there's a bug waiting in __setitem__ that will need the same code

in cases where I want access to some of the local variables

Note that the original implementation did exactly that, but I realized it was pretty trivial to convert them into parameters, as there were only two

Maybe best to leave as is -- this is substantially clearer than it was!

Leaving it as is sounds good to me!

I think I've addressed your other concerns too - and I reckon we leave any kind of rollback of #5962 to another PR, since it's not needed to fix the bug.

mhvk · 2017-04-19T18:44:42Z

numpy/ma/tests/test_core.py

@@ -4376,16 +4376,27 @@ def test_getitem(self):
                                   [1, 0, 0, 0, 0, 0, 0, 0, 1, 0])),
                          dtype=[('a', bool), ('b', bool)])
        # No mask
-        self.assertTrue(isinstance(a[1], MaskedArray))
+        assert_equal(type(a[1]), mvoid)


Perhaps also check equality of _data and _mask as done for element 0 below?

I'll wrap this up in a helper to avoid the code duplication

mhvk · 2017-04-19T18:48:56Z

This looks good, modulo the small comments (of which the in-method functions is just a preference).

mhvk · 2017-04-19T21:42:42Z

numpy/ma/core.py

+
+            # special case introduced in gh-5962
+            elif (self.dtype.type is np.object_ and
+                    isinstance(dout, np.ndarray) and


I hate giving pep8 comments, but no need for the extra indent here (only for if clauses is it needed, since if ( takes 4 characters...)

As in, you want columnar alignment here?

yes, with self and isinstance aligned; that way it is clearest what the parentheses gather together.

Yeah, I wasn't quite sure whether alignment or indenting was right here. Does PEP8 actually specify to wrap if differently to elif?

Either way, fixed, since you seem to think this is wrong, and I knew that I was unsure

The alignment is standard, with the problem of if ( being 4 characters very specifically discussed - https://www.python.org/dev/peps/pep-0008/#indentation

Of course, my real "knowledge" just comes from using emacs with flymake -- see http://docs.astropy.org/en/latest/development/codeguide_emacs.html -- which handily marks things that don't follow style, or uninitialized/unused variables, etc.

mhvk · 2017-04-19T21:44:32Z

I'm happy with this. @ahaldane - do you want to have a look as well? (I'll merge tomorrow unless I hear otherwise).

ahaldane

Looks good, though it's unfortunate we need to take so many special cases into account.

I'm happy you wrote so many tests - I think these will be very helpful to anyone who needs to modiify this code in the future.

I'll leave it open overnight. First one to get to it tomorrow can merge!

eric-wieser · 2017-04-20T09:20:08Z

it's unfortunate we need to take so many special cases into account.

Note that the entire thing could become iscalar = not isinstance(getmaskarray(self)[indx]), but for performance does a bunch of faster checks first, since getmaskarray could incur a hefty allocation.

eric-wieser added 00 - Bug component: numpy.ma masked arrays labels Apr 6, 2017

eric-wieser force-pushed the fix-8684 branch from 5edcef7 to 5bcdb6f Compare April 7, 2017 09:53

eric-wieser mentioned this pull request Apr 11, 2017

BUG: Fix double-wrapping of object scalars #8643

Merged

eric-wieser requested a review from ahaldane April 18, 2017 17:26

eric-wieser force-pushed the fix-8684 branch 2 times, most recently from f1c55a1 to d194239 Compare April 19, 2017 09:09

eric-wieser commented Apr 19, 2017

View reviewed changes

eric-wieser force-pushed the fix-8684 branch from d194239 to bc8d315 Compare April 19, 2017 10:14

eric-wieser commented Apr 19, 2017

View reviewed changes

eric-wieser force-pushed the fix-8684 branch from bc8d315 to 42ad990 Compare April 19, 2017 10:51

eric-wieser mentioned this pull request Apr 19, 2017

BUG: Masked object array of ndarrays attaches mask to masked array value #8906

Open

eric-wieser force-pushed the fix-8684 branch from 42ad990 to 4185284 Compare April 19, 2017 13:41

BUG: Correctly distinguish between 0d arrays and scalars

c926d0a

This fixes numpy#8684

eric-wieser force-pushed the fix-8684 branch from 4185284 to c926d0a Compare April 19, 2017 13:49

mhvk reviewed Apr 19, 2017

View reviewed changes

BUG: np.ma.array(masked) should not return a new MaskedConstant

c08ec57

The correct behavior here is to return a new MaskedArray, because `MaskedConstant` is pretending to be a scalar

eric-wieser commented Apr 19, 2017

View reviewed changes

mhvk reviewed Apr 19, 2017

View reviewed changes

BUG: Fix __getitem__ on masked arrays containing np.ma.masked

ed75d97

eric-wieser added 2 commits April 19, 2017 23:06

TST: Add previously-missing test of __setitem__

2597a16

DOC: Improve comments in ma.array.__new__

73c15f7

eric-wieser force-pushed the fix-8684 branch from 44f2bac to 73c15f7 Compare April 19, 2017 22:07

ahaldane approved these changes Apr 20, 2017

View reviewed changes

mhvk merged commit 1a9f43f into numpy:master Apr 20, 2017

mhvk added this to the 1.13.0 release milestone Apr 20, 2017

ahaldane mentioned this pull request May 15, 2017

numpy masked array regression in 1.13.0rc1 #9121

Closed

BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.__getitem__ #8905

BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.__getitem__ #8905

Conversation

eric-wieser commented Apr 6, 2017 • edited

ahaldane commented Apr 19, 2017 • edited

eric-wieser commented Apr 19, 2017 • edited

Choose a reason for hiding this comment

eric-wieser commented Apr 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Apr 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser commented Apr 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Apr 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser commented Apr 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Apr 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhvk commented Apr 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhvk commented Apr 19, 2017

ahaldane left a comment

Choose a reason for hiding this comment

eric-wieser commented Apr 20, 2017 • edited

BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.getitem #8905

BUG: Correctly distinguish between 0d arrays and scalars in MaskedArray.getitem #8905

eric-wieser commented Apr 6, 2017 •

edited

ahaldane commented Apr 19, 2017 •

edited

eric-wieser commented Apr 19, 2017 •

edited

eric-wieser Apr 19, 2017 •

edited

eric-wieser Apr 19, 2017 •

edited

eric-wieser Apr 19, 2017 •

edited

eric-wieser commented Apr 20, 2017 •

edited