-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Update return value of ma.average and allow weights to be masked. #14668
base: main
Are you sure you want to change the base?
Conversation
4a07047
to
b2a9e92
Compare
This commit makes ma.average to behave more like np.average regarding dtypes and result handling. The change required code similar to the code in np.mean, as suggested by Sebastian Berg, but also some handling in np.ma.average. As a result, ma.mean handles dtypes like to np.mean. Similar changes have been done to ma.var, so that it also keeps proper dtypes and uses np.float64 for integral types. While getting my head around the code, I inevitably introduced a change in weights handling: np.ma.average now handles masked weights such that only values which are not masked in either the data or the weights are taken into account. Additional tests and updated documentation are included. Resolves numpy#14462.
b2a9e92
to
f4dd770
Compare
Skipping one failing test because on many systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoeffner, I made some suggestions and added some questions inline, but I haven't reviewed all the code changes. If you are still interested in working on this, I'll continue with a more detailed review.
One issue that worries me is this change in behavior. Before this pull request:
With this pull request, we get a warning and an explicit nan:
One can argue that the new behavior is preferred, but it is a backwards incompatible change that might affect existing code. Like it or not, the masked array functions tend to automatically convert values that would normally be
With a regular array, we get a warning and the result includes
|
Thank you very much for your valuable feedback! I will go through your comments and adjust the code accordingly (at some point during the next days). If I remember correctly, most of the changes (like the I very much appreciate the comment on the change of functionality: I honestly was not particularly aware of that issue and will a) search if I have any unit tests on it and b) investigate how to keep the backwards compatibility. |
Uses scalar types like np.float64 instead of dtype instances in ma.mean and ma.var.
To retain backwards compatibility, division by 0 entries in masked arrays are masked out in the results by using true_divide from numpy.ma rather than a simple /.
I addressed
Now I am waiting for the CI to tell me what I broke but didn't test locally. |
@shoeffner, sorry for letting this slip. I'll get back to this over the weekend, if not sooner. |
Is there a reason that, when the input is type EDIT: Seems to be just |
@shoeffner, the changes to One of the goals of this pull request is to make the behavior of the
I would expect
(The results do have the same Numerically, the value 300.0 is closer to the answer one gets with higher precision, 299.9166666666667. In this case, it is the ndarray method that is not boosting the precision of the intermediate calculation. We can verify that by using an explicit
So, to maintain consistency with the ndarray (Unless, that is, we also modify the ndarray |
Adding tests to check if var and mean behave equal to their non-masked counterparts.
That is indeed copied; I thought the idea was – exactly as you said – to keep a higher precision for the intermediate divisions. I changed it so that it's more consistent to the behaviour of Thank you for your thorough tests and feedback, @WarrenWeckesser! |
@WarrenWeckesser float16 arithmetic is implemented in C by casting to float32, doing the operation, and downcasting after. You can see that in the ufunc loops. So there is probably no problem with doing that here. The idea is that float16 is mostly a storage format rather than for computation, although I think that has changed over the years with some gpus implementing the operations. Converting whole arrays is probably not memory efficient, but no doubt much better numerically. |
I do not think there is a general policy, right now it seems pretty random. A general policy would be nice, but not sure how feasible it is in the short term. Some reduction-like operation up-cast for sure, in general ufunc reductions upcast along the fast axis. That means that along the fast axis, you have rounded float32 precision, but along the slow axis you are limited by float16 precision (a somewhat annoying side effect is that it is not just less precise but also slower). A long term policy could be to try and use always float32 precision for float16 calculations, but at least for reductions the difference between slow and fast axis would need to be removed, so that it is transparent when the downcast to float16 occurs. As Chuck says, it is mostly meant to be a storage format, so I am not sure that is even a good policy... |
For 982637f, the azure pipelines failed for pypy3 ( So should I now keep the behavior as it is right now (i.e., identical behavior between ma and ndarray) or should I switch to up/down-casting for all operations? Do you have any other things I can address for the PR? :) |
Indeed the PyPy failure is both unrelated and annoying. |
This pull request makes changes to the np.ma.average function, and with it to the MaskedArray.mean and MaskedArray.var methods.
They behave more predictable and in line with their np. counterparts with regards to dtype handling.
Additionally, np.ma.average can now handle masked arrays as weights. Please see the commit messages below for more details.
As this is my first numpy contribution, any feedback is highly appreciated! :-)
DOC: typos in np.average
ENH: ma.average like np.average; allow weight mask
This commit makes ma.average to behave more like np.average regarding
dtypes and result handling.
The change required code similar to the code in np.mean, as suggested by
Sebastian Berg, but also some handling in np.ma.average. As a result,
ma.mean handles dtypes like to np.mean. Similar changes have been done
to ma.var, so that it also keeps proper dtypes and uses np.float64 for
integral types.
While getting my head around the code, I inevitably introduced a change
in weights handling: np.ma.average now handles masked weights such that
only values which are not masked in either the data or the weights are
taken into account.
Additional tests and updated documentation are included.
Resolves #14462.