Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: make some masked array methods behave more like ndarray methods #5706

Merged
merged 3 commits into from
Apr 4, 2016

Conversation

ahaldane
Copy link
Member

I recently learned about masked arrays and was trying them out in a project, and found that the masked array functions mean, var, sum, amax and similar do not behave quite the same as their ndarray counterparts.

They don't support multiple axes like ndarrays, eg

>>> a = np.random.rand(4,4)
>>> a.mean(axis=(0,1)) # works
>>> b = np.ma.array(a, mask=np.random.rand(4,4)<0.5)
>>> b.mean(axis=(0,1)) # fails with IndexError

they don't have a keepdims argument, and looking at the code the behavior is a little different.

In this PR I modified np.ma.mean, np.ma.sum and np.ma.count to behave more like ndarray methods, as proof of concept.

Does it look like a good idea, and is the code OK so far? If so I'll do the same thing for the rest of the functions (var, sum etc).

Note that I modified a test to get it to pass: I removed the test that the return type of np.ma.count was of type np.intp. The return type can now be an ndarray (for partial axes) or an int. The return type was discussed in #4698, and I think the problem at that time was that arr.size could return variable types, but I removed dependence on arr.size so I don't think that's a problem any more.

@ahaldane ahaldane force-pushed the ma_methods_args branch 3 times, most recently from d11dbf5 to 3cd5514 Compare March 22, 2015 05:28
@charris
Copy link
Member

charris commented Mar 22, 2015

Closing and reopening to see if the tests can be rejiggered.

@charris charris closed this Mar 22, 2015
@charris charris reopened this Mar 22, 2015
@charris
Copy link
Member

charris commented Mar 22, 2015

Yep, that works. Restarting the tests doesn't have the same effect.

@charris
Copy link
Member

charris commented Mar 22, 2015

Try again.

@ahaldane
Copy link
Member Author

Updated np.ma.mean to treat dtype parameter more carefully (like in np.mean).

Also I noticed that many ma operations will show a warning if you perform an invalid operation on an element even if that element is masked. Eg,

>>> np.log(np.ma.array([nan, 1, 2], mask=[True, False, False]))
>>> np.log(np.ma.array([0, 1, 2], mask=[True, False, False]))

will raise a warning, which doesn't seem right. (The first of these also fails when using np.ma.log, but not the second). I started to fix this by disabling warnings during the domain calculation. I don't think I'm disabling any legitimate warnings.

But actually the correct fix may instead be to do the domain check with the masked array, not the raw data. However, in the case of _DomainSafeDivide this isn't currently possible since it converts to ndarrays, but that is a recent change that might be wrong.

@abalkin
Copy link
Contributor

abalkin commented Mar 23, 2015

I recently learned about masked arrays and was trying them out in a project, and found that the masked array functions mean, var, sum, amax and similar do not behave quite the same as their ndarray counterparts.

Keeping masked arrays interfaces in sync with ndarray has been a loosing battle in the past because most of the people who enhance ndarray either don't know or don't care about masked arrays.

The things only got worse when masked array were reimplemented (over objections of original designers) to inherit from ndarray. After that, any new method added to ndarray has automatically became a broken method of masked arrays. Case in point: the dot method. See #5185.

I can see only one way to fix this situation for the long term: add a test that would introspect andarray and masked arrays methods and make sure they match. Unfortunately, introspecting ndarray methods are not always possible because some of them are implemented in C, but in those cases we can probably parse the docstring to extract the signature. See #4356.

Finally, I think this issue overlaps with #4537.

if isinstance(out, MaskedArray):
outmask = getattr(out, '_mask', nomask)
if (outmask is nomask):
outmask = out._mask = make_mask_none(out.shape)
outmask.flat = newmask
outmask[()] = newmask.reshape(result.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more idiomatic to use ... when assigning to the entire array. Empty tuple indexing should only be used in generic code where ndim can happen to be 0. Literal a[()] is confusing and unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the second thought, what was wrong with the old code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm you're right using .flat works. I was worried that a non-1d newmask would cause problems.

I still think outmask[...] = is better (thanks for explaining () vs ...). I guess I am biased against .flat after reading some other devs say they didn't like it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always found assigning to .flat confusing, so as someone who might read your code at some point in the future I'd suggest keeping your change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I only objected to the use of [()]. Getting rid of .flat is an improvement, I was just curious if that change was necessary to implement keepdims.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end I don't think I can replace .flat = with [...] =.

Using .flat is currently necessary for interopreability of ma and matrices. See #4585 (comment), #4615.

Eg, if you sum a 3x3 masked-matrix along the 2nd axis, the resulting data shape is (3,1), but the mask shape becomes (3,) (since mask is an ndarray). [...] = fails when trying to assign a (3,) shape to a (3,1) shape, but .flat = doesn't.

If we really want to get rid of a.flat = b here, something like this might be a replacement:

a[...] = b.reshape(a.shape) if not np.isscalar(b) else b

but that seems a bit ugly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaldane - ah, bitten by matrices again... Yes, in that case, I would also just stick with .flat.

@mhvk
Copy link
Contributor

mhvk commented Mar 23, 2015

In my opinion, subclassing from ndarray was a logical thing to do, and not bad in itself, though from my trials to get MaskedArray to work with astropy's Quantity (#3907, #4586, #4617), I would have to say it was not done quite carefully enough. In #3907 (comment) @charris suggested an overhaul using __numpy_ufunc__ -- I think that would indeed be the way forward (though don't let that detract from at least getting some things work better!).

@@ -28,12 +28,14 @@

import numpy as np
import numpy.core.umath as umath
from numpy.core import multiarray as mu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I'd really like to get rid of most of these imports, plain old np.* seems to work in most cases.

@charris
Copy link
Member

charris commented Mar 25, 2015

Might want to look at the corresponding nan functions, I expect there is some commonality. There needs to be tests for all the new functionality, might be able to copy some of those from the nanfunction tests also. Making the masked functions conform is good. We are looking for a masked array maintainer if anyone is interested in pursuing that ;)

@ahaldane
Copy link
Member Author

@mhvk, @charris Using __numpy_ufunc__ sounds like a good idea. I'll keep it in mind, but I don't think I'll touch that here just to keep things manageable.

@ahaldane
Copy link
Member Author

Updated. Summary so far:

  • reworked count method (to allow multiple axes)
  • added keepdims arg to all, any, sum, prod, mean, var, std, min, max
  • added (unused) arg to same methods in matrix class
  • updated docstring for those funcs, plus cumsum, cumprod, std, round.
  • rewrote internals of mean and var. (var was pretty buggy!)
  • removed anom method, which was just used in var before
  • hid warnings during domain calculations

I haven't tested it much yet.

By the way, for a separate PR, I noticed the ndarray doctrings are missing keepdim args. Eg in ndarray.mean and others.

I also notices that the C-Api function PyArray_Mean is implemented (in calculations.c) quite differently from ndarray.mean (in core._methods.py). (and var and others too) Probably it should simply call array_mean.

@ahaldane ahaldane force-pushed the ma_methods_args branch 4 times, most recently from 6814cfa to 31afad5 Compare March 29, 2015 00:44
@ahaldane
Copy link
Member Author

More changes: I got burned due to the np.ma.masked constant being writeable, which caused some tests to fail depending on the order the tests were run.

So I made np.ma.masked readonly, which uncovered a bug in np.ma.ptp, which (for now) I fixed with some modifications of the MaskedConstant class, but probably something nicer can be done.

@ahaldane
Copy link
Member Author

Test failure is spurious, happens because http://www.rabbitmq.com is down.

@mhvk
Copy link
Contributor

mhvk commented Mar 30, 2016

Apart from the two small comments (one of which can be ignored if you wish), to me this seems all ready to go in. Sorry for it having been such a long slog...

@seberg
Copy link
Member

seberg commented Mar 30, 2016

Seems like this is a bit cleaned up/less compared to before. I will try to look over it again sooner rather then later, so hopefully by the weekend, don't feel like it now, but poke me if I don't (though if someone else beats me to it, would be more than happy), and since @mhvk already reviewed it, I think it should be good anyway. That way the changes can hopefully sit at least a little bit on master to notice oddities.

@mhvk
Copy link
Contributor

mhvk commented Mar 30, 2016

@ahaldane - OK, all seems fine now!

`None`, all non-masked elements are counted.
axis : None or int or tuple of ints, optional
Axis or axes along which the count is performed.
The default (`axis` = `None`) is perform the count sum over all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are single ticks OK here (honestly not sure)? The sentence reads a bit funny. "is to perform" or "performs" sound fine to me, though who knows english has some weird grammar ;).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs as I read them, single-backticks go around variable, module, function, and class names, while double-backticks go around inline code. So I think it's right here. (Edit: and note that this docstring was copied from the np.any docstring, though slightly mangled)

I'll fix the grammar though.

@seberg
Copy link
Member

seberg commented Apr 2, 2016

Just some silly questions/nitpicks. Then I guess I will put it in and hope nothing annoying crops up (with MA I won't try to guess).

@ahaldane
Copy link
Member Author

ahaldane commented Apr 4, 2016

Updated.

  • count now raises ValueError for 0-d arrays if axis > 0
  • fixed grammar in average and count docstrings.
  • I think backtick usage in docstrings is correct
  • fixed read-only return value of scl in average
  • added keepdims argument to _check_mask_axis, and used it everywhere

@seberg
Copy link
Member

seberg commented Apr 4, 2016

OK, lets give it a shot. Thanks a lot @ahaldane was a long work....

@seberg seberg merged commit c6e65b7 into numpy:master Apr 4, 2016
@ahaldane
Copy link
Member Author

ahaldane commented Apr 4, 2016

Whew! Good thing I had some tenacious reviewers to help, thanks both of you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants