Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Adding __array_ufunc__ capability to MaskedArrays #21977

Merged
merged 4 commits into from
Jul 14, 2022

Conversation

greglucas
Copy link
Contributor

Redo of: #16022 which was merged and then reverted due to a failing docstring test with flattened masks. Thanks to @rcomer for adding the final commit that fixed this case. That PR has a lot of discussion in it for background too.

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls directly.

Example

import numpy as np
a = np.ma.array(np.arange(10), [0, 1] * 5)
x = a < 10
# x.data == [True,  True,  True,  True,  True, False, False, False, False, False]
y = np.ma.less(a, 10)
# y.data == [True,  True,  True,  True,  True,  True, False,  True, False, True]

Even though data values in a are masked, they are still acted upon and evaluated. Changing that underlying data. However, calling the masked less version doesn't do anything with the data in the array. Note that this is a big change but to a somewhat ambiguous case of what happens to masked values under function evaluation.

The real reason I began looking into this was to not evaluate the function in the masked values because I'd already premasked the condition that would throw warnings at me. See #4959 for evaluating log10 at locations less than 0 that were under the mask, so still threw warnings.

import numpy as np
A = arange(-2, 5)/2
Am = numpy.ma.masked_less_equal(A, 0)
np.log10(Am) # Previously threw a warning, now works as expected
np.ma.log10(Am) # No warnings before or after

Implementation

I haven't added any documentation yet because I wanted to check and see if this was even the right way of approaching this idea. I struggled to figure out how to pass the out= object pointers around in a consistent manner, so I am definitely interested in suggestions on improving that. I had to update quite a few seemingly random areas of the Masked codebase to make that work everywhere. There are a few tests that I had to update because values under the mask were being compared, and those have now changed with this implementation applying the masked version of ufuncs everywhere.

This is a pretty major (underlying) API change, so I expect there to be quite a bit of discussion on the best approaches here.

Linked Issues

Fixes #4959
Closes #15200

This enables any ufunc numpy operations that are called on a
MaskedArray to use the masked version of that function automatically
without needing to resort to np.ma.func() calls.
greglucas and others added 3 commits July 13, 2022 11:39
This test makes sure that a MaskedArray defers properly to another
class if it doesn't know how to handle it. See numpy#15200.
@mattip
Copy link
Member

mattip commented Jul 13, 2022

Just to reiterate my subjective summary of the comments from the original PR:

In general, we should probably aim to split masked arrays out of numpy into a separate library. That would require a lot of work, so in the mean time this PR moves in the direction of localizing masked array operations into one area, instead of needing to have lots of if is_masked_array(arg) scattered across NumPy, so it is a GoodThing.

@mattip mattip merged commit cb28cd1 into numpy:main Jul 14, 2022
@mattip
Copy link
Member

mattip commented Jul 14, 2022

Thanks @greglucas

@seberg
Copy link
Member

seberg commented Jul 16, 2022

SciPy ran into an issue, and I suspect it is due to this: scipy/scipy#16610

The code sample is this failing, which was not failing before:

import numpy as np
data = [[0, 1, 2, 3, 4, 9], [5, 5, 0, 9, 3, 3]]
mask = [[0, 0, 0, 0, 0, 1], [0, 0, 1, 1, 0, 0]]
a = np.ma.masked_array(data, mask=mask)
np.ma.minimum.reduce(a, axis=1)

rcomer added a commit to rcomer/numpy that referenced this pull request Jul 16, 2022
rcomer added a commit to rcomer/numpy that referenced this pull request Jul 16, 2022
Fixes the problem reported at
numpy#21977 (comment)

The reduce method here effectively calls itself with an unmasked
MaskedArray (mask=nomask) and then expects either a MaskedArray or
a scalar.  This change ensures that an ordinary ndarray is
converted to a MaskedArray, following the pattern already used in
mean and var in this module.
@greglucas
Copy link
Contributor Author

Unfortunately, I'm on vacation again and won't have much access to a computer this next week. I don't think this will be too hard to fix though, and still think it would be better to improve this than revert... At first glance this might be something as simple as elif m is masked instead of just the elif m check. Someone else should feel free to take this on if they have time!

I agree with your assessment in the other thread that a few more eyes would be valuable on this. My guess is there might be some other downstream packages that this hits as well and I may have missed some corner cases.

I'm pretty sure Matplotlib is hitting this same issue too in that weekly job testing Numpy main: matplotlib/matplotlib#23431

@rcomer
Copy link
Contributor

rcomer commented Jul 16, 2022

I have proposed a fix for the minimum.reduce issue at #21993.

@pllim
Copy link
Contributor

pllim commented Jul 18, 2022

Might have broken astropy as well, see astropy/astropy#13464 (cc @mhvk )

@seberg
Copy link
Member

seberg commented Jul 18, 2022

If this is a problem (and we can confirm its this PR), we could even revert it (again). For this larger change, it really makes sense to require checking against astropy first, since that is one of the largest users to begin with.

I like having __array_ufunc__, but unfortunately, I am not surprised that it is tricky...

@rcomer
Copy link
Contributor

rcomer commented Jul 18, 2022

We are using isinstance to check which types the MaskedArray‘s ufunc can process. Do we need a stricter check that would not include subclasses?

numpy/numpy/ma/core.py

Lines 3152 to 3156 in fe2bb38

# Determine what class types we are compatible with and return
# NotImplemented if we don't know how to handle them
for arg in args + outputs:
if not isinstance(arg, (ndarray, np.bool_, Number, list, str)):
return NotImplemented

@seberg
Copy link
Member

seberg commented Jul 18, 2022

@rcomer I think we might need the opposite and allow everything.

That is I think the design here may be wrong, and what MaskedArray should be doing is more simiarl to what astropy.units.Quantity is doing:

  1. Allow everything
  2. Unpack all masked arrays and apply whatever logic is necessary (mutated where or keep track of the new mask)
  3. Call the ufunc again after unpacking all masked arrays, so that the next call will not recruse (and needs no information about masked arrays at all).
  4. Wrap the result into a masked array again and return

So in that sense, I am not sure the design is quite right.

@seberg
Copy link
Member

seberg commented Jul 18, 2022

The weird thing about that is, that there should be an inherent order to things. If Quantity also allows everything, then quantity + masked could return a masked(quantity) or a quantity(masked) depending on the order of operations.

So maybe that is not right, and since maskedArray is in NumPy it is not allowed to wrap anything.

@rcomer so I think you may be right, and this should just be a strict type check for ndarray. Rather than checking for Number, you should also check for not hasattr("__array_ufunc__"), probably (or something like that).
(EDIT: Do we even forward those who do not implement __array_ufunc__? Because in that case, maybe allowing those ins not necessary to begin with?)

(Yes, we could auto-wrap more than just ndarray, but I guess we have to make that a problem of downstream, and I doubt that it matters in practice.)

outputs = kwargs.pop('out', ())
if not isinstance(outputs, tuple):
outputs = (outputs,)
outputs += inputs[np_ufunc.nin:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not necessary, all outputs are normalized into out by the ufunc machinery here.

@mhvk
Copy link
Contributor

mhvk commented Jul 19, 2022

Just to quickly note that the failures in astropy are with our MaskedColumn class, not Quantity; from a first guess, it looks like __array_finalize__ may need to be called.

On a more general note, since the way MaskedArray is designed, it really wraps a subclass, it seems logical to pursue the idea that it can act independent of subclass (in fact, this was argued in the subclassing section of the __array_ufunc__ documentation), i.e., unpack, re-call, and repack the result.

That said, in my own Masked implementation, I ended up going the route of trying to keep the mask as close to the array as possible, since that turned out to be much easier to wrap inside Quantity (leaving the quantity-specific methods exposed, unlike in a MaskedArray or a quantity).

else:
# The masked power function doesn't support extra args
if np_ufunc.__name__ in ('power'):
kwargs = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an error instead?

@@ -3292,7 +3384,7 @@ def _scalar_heuristic(arr, elem):
return dout
else:
# Force dout to MA
dout = dout.view(type(self))
dout = MaskedArray(dout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks getting items on subclasses:

from astropy.table import MaskedColumn
mc = MaskedColumn([1., 2.], mask=[True, False])
mc[:1]
Out[4]: 
masked_BaseColumn(data=[--],
                  mask=[ True],
            fill_value=1e+20,
                 dtype=float64)

This should return a MaskedColumn...

@mhvk
Copy link
Contributor

mhvk commented Jul 19, 2022

There are also some real regressions: e.g.,


In [8]: ma = np.ma.MaskedArray([1., 2.], mask=[True, False])

In [9]: np.sign(ma)
Out[9]: array([1., 1.])

while in previous versions this correctly gave

masked_array(data=[--, 1.0],
             mask=[ True, False],
       fill_value=1e+20)

@@ -4200,7 +4314,7 @@ def __add__(self, other):
"""
if self._delegate_binop(other):
return NotImplemented
return add(self, other)
return np.add(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for all these, one can now safely remove the overrides, since __array_ufunc__ will take care.

# prevent infinite recursion.

# Make ndarray views of the input arguments
args = [getdata(input) if isinstance(input, MaskedArray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can prevent infinite recursion more easily by just calling the ndarray.__array_ufunc__, i.e.,

return super().__array_ufunc__(np_ufunc, method, *inputs, **kwargs)

# Determine what class types we are compatible with and return
# NotImplemented if we don't know how to handle them
for arg in args + outputs:
if not isinstance(arg, (ndarray, np.bool_, Number, list, str)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For outputs, one should be stricter: they need to be ndarray - indeed, for most functions, they should be MaskedArray, otherwise the mask cannot be stored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one could be a bit less strict on the inputs - after all, the unwrapped call will do its own checks anyway.

@mhvk
Copy link
Contributor

mhvk commented Jul 19, 2022

I put some in-line comments; I think this may need a bit more thought to get out the wrinkles. Not sure whether it is best to have a PR with fixes or revert and start with a new PR.

@seberg
Copy link
Member

seberg commented Jul 19, 2022

I am tempted to revert (unfortunately), unless @greglucas has bandwidth to look into it soon.

@mattip
Copy link
Member

mattip commented Jul 19, 2022

I agree. I did think the numpy test suite was more comprehensive, but it seems we have a way to go.

@greglucas
Copy link
Contributor Author

Unfortunately I'm on vacation away from a computer and won't be able to pick it up for a week or so at least and even then it will take me a bit to get back up to speed since the majority of the work was done a couple years ago. Reverting is probably the right move, and I appreciate that the dependent libraries are testing main and can hopefully help identify some good test cases for these updates when I go digging through those failure logs. Thanks for the ping and helpful comments so far @seberg and @mhvk. Probably good for a two review PR on the next one as well just to ensure we've had enough eyes on it.

greglucas pushed a commit to greglucas/numpy that referenced this pull request Jan 3, 2023
Fixes the problem reported at
numpy#21977 (comment)

The reduce method here effectively calls itself with an unmasked
MaskedArray (mask=nomask) and then expects either a MaskedArray or
a scalar.  This change ensures that an ordinary ndarray is
converted to a MaskedArray, following the pattern already used in
mean and var in this module.
greglucas pushed a commit to greglucas/numpy that referenced this pull request Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants