Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Update return value of ma.average and allow weights to be masked. #14668

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

shoeffner
Copy link

This pull request makes changes to the np.ma.average function, and with it to the MaskedArray.mean and MaskedArray.var methods.
They behave more predictable and in line with their np. counterparts with regards to dtype handling.
Additionally, np.ma.average can now handle masked arrays as weights. Please see the commit messages below for more details.

As this is my first numpy contribution, any feedback is highly appreciated! :-)


DOC: typos in np.average


ENH: ma.average like np.average; allow weight mask

This commit makes ma.average to behave more like np.average regarding
dtypes and result handling.

The change required code similar to the code in np.mean, as suggested by
Sebastian Berg, but also some handling in np.ma.average. As a result,
ma.mean handles dtypes like to np.mean. Similar changes have been done
to ma.var, so that it also keeps proper dtypes and uses np.float64 for
integral types.

While getting my head around the code, I inevitably introduced a change
in weights handling: np.ma.average now handles masked weights such that
only values which are not masked in either the data or the weights are
taken into account.

Additional tests and updated documentation are included.

Resolves #14462.

This commit makes ma.average to behave more like np.average regarding
dtypes and result handling.

The change required code similar to the code in np.mean, as suggested by
Sebastian Berg, but also some handling in np.ma.average. As a result,
ma.mean handles dtypes like to np.mean. Similar changes have been done
to ma.var, so that it also keeps proper dtypes and uses np.float64 for
integral types.

While getting my head around the code, I inevitably introduced a change
in weights handling: np.ma.average now handles masked weights such that
only values which are not masked in either the data or the weights are
taken into account.

Additional tests and updated documentation are included.

Resolves numpy#14462.
@shoeffner
Copy link
Author

shoeffner commented Oct 9, 2019

Skipping one failing test because on many systems np.float128 does not seem to be available (https://dev.azure.com/numpy/numpy/_build/results?buildId=5837&view=logs).
Additionally, fixed behavior when no mask for the weights is given so that the doctests pass.

@mattip mattip added 01 - Enhancement component: numpy.ma masked arrays triage review Issue/PR to be discussed at the next triage meeting labels Dec 8, 2019
@mattip mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Dec 18, 2019
@WarrenWeckesser WarrenWeckesser self-assigned this Dec 18, 2019
Copy link
Member

@WarrenWeckesser WarrenWeckesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoeffner, I made some suggestions and added some questions inline, but I haven't reviewed all the code changes. If you are still interested in working on this, I'll continue with a more detailed review.

numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
numpy/ma/core.py Outdated Show resolved Hide resolved
@WarrenWeckesser
Copy link
Member

One issue that worries me is this change in behavior. Before this pull request:

In [15]: x = np.ma.masked_array([[1, 2]], mask=[[0, 1]])                                         

In [16]: w = np.ma.masked_array([[0, 0]], mask=[[0, 0]])                                         

In [17]: np.ma.average(x, weights=w, axis=1)                                                     
Out[17]: 
masked_array(data=[--],
             mask=[ True],
       fill_value=1e+20,
            dtype=float64)

With this pull request, we get a warning and an explicit nan:

In [18]: x = np.ma.masked_array([[1, 2]], mask=[[0, 1]])                        

In [19]: w = np.ma.masked_array([[0, 0]], mask=[[0, 0]])                        

In [20]: np.ma.average(x, weights=w, axis=1)                                    
/Users/warren/mc37np/lib/python3.7/site-packages/numpy-1.18.0.dev0+f4dd770-py3.7-macosx-10.9-x86_64.egg/numpy/ma/extras.py:694: RuntimeWarning: invalid value encountered in true_divide
  avg = prod.sum(axis=axis, dtype=result_dtype) / scl
Out[20]: 
masked_array(data=[nan],
             mask=[False],
       fill_value=1e+20)

One can argue that the new behavior is preferred, but it is a backwards incompatible change that might affect existing code. Like it or not, the masked array functions tend to automatically convert values that would normally be nan or inf into a mask. E.g.

In [19]: x = np.ma.masked_array([0, 1, 2])

In [20]: 1/x
Out[20]: 
masked_array(data=[--, 1.0, 0.5],
             mask=[ True, False, False],
       fill_value=1e+20)

With a regular array, we get a warning and the result includes inf:

In [21]: 1/np.arange(3)                                                                          
/Users/warren/a2019.10/bin/ipython:1: RuntimeWarning: divide by zero encountered in true_divide
  #!/Users/warren/a2019.10/bin/python
Out[21]: array([inf, 1. , 0.5])

@shoeffner
Copy link
Author

Thank you very much for your valuable feedback!

I will go through your comments and adjust the code accordingly (at some point during the next days). If I remember correctly, most of the changes (like the views and the ntypes.dtypes) were mostly copied from the normal array operations, but I will have to take a closer look again and see why I decided on that. Might also be the case that I found that in the test suite somewhere.

I very much appreciate the comment on the change of functionality: I honestly was not particularly aware of that issue and will a) search if I have any unit tests on it and b) investigate how to keep the backwards compatibility.

Uses scalar types like np.float64 instead of dtype instances in ma.mean
and ma.var.
To retain backwards compatibility, division by 0 entries in masked
arrays are masked out in the results by using true_divide from
numpy.ma rather than a simple /.
@shoeffner
Copy link
Author

I addressed

  • your recommendations about using the scalar types instead of dtype objects,
  • removed the views as they seemed to be unnecessary,
  • added tests about the scalar types also to the ma.mean-tests to increase its coverage,
  • added your regression example as a unit test as well as a doctest example so that one can see that values might be masked out in case of a DIV0,
  • used true_divide where I had divide or / before, as this handles the masking of division by 0 values automatically, or so it seems.

Now I am waiting for the CI to tell me what I broke but didn't test locally.

numpy/ma/extras.py Outdated Show resolved Hide resolved
@WarrenWeckesser
Copy link
Member

@shoeffner, sorry for letting this slip. I'll get back to this over the weekend, if not sooner.

@charris charris changed the title Return data types of np.average and np.ma.average behave the same. Weights for np.ma.average can be masked. ENH: Update return value of ma.average and allow weights to be masked. Mar 13, 2020
@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Mar 16, 2020

Is there a reason that, when the input is type float16, the calculation is upcast to float32 and later cast back to float16? @shoeffner, I realize this is copied from the behavior of mean, so my question is really directed at other numpy devs, e.g. @seberg, @eric-wieser, etc. Is there a general policy of bumping float16 inputs to float32 for computations, or is this just a special behavior for mean?

EDIT: Seems to be just mean--but I haven't done a thorough review of the code to check.

@WarrenWeckesser
Copy link
Member

@shoeffner, the changes to mean look good.

One of the goals of this pull request is to make the behavior of the mean and var methods of the masked arrays consistent with the ndarray methods. I noticed the following discrepancy in the values computed by the var method for the float16 array x:

In [101]: x = np.arange(-30, 30).astype(np.float16)                                                      

In [102]: mx = np.ma.masked_array(x, mask=np.zeros(x.shape, dtype=bool))

x and mx have the same shape, dtype and values. mx has an explicit mask, but no values are masked.

I would expect x.var() and mx.var() to return the same value, but the following shows otherwise:

In [103]: x.var(), x.var().dtype
Out[103]: (299.8, dtype('float16'))

In [104]: mx.var(), mx.var().dtype
Out[104]: (300.0, dtype('float16'))

(The results do have the same dtype, which is part of the improvement made in this pull request, so we're making progress!)

Numerically, the value 300.0 is closer to the answer one gets with higher precision, 299.9166666666667. In this case, it is the ndarray method that is not boosting the precision of the intermediate calculation. We can verify that by using an explicit dtype argument:

In [228]: x.var(dtype=np.float32).astype(np.float16)
Out[228]: 300.0

So, to maintain consistency with the ndarray var method, the masked array version of var shouldn't cast float16 to float32 internally.

(Unless, that is, we also modify the ndarray var method to internally cast float16 to float32, like mean does. But that would require discussion from other numpy devs, and I suspect if that was desired, it would have been done already, perhaps at the same time mean was changed.)

Adding tests to check if var and mean behave equal to their non-masked counterparts.
@shoeffner
Copy link
Author

That is indeed copied; I thought the idea was – exactly as you said – to keep a higher precision for the intermediate divisions. I changed it so that it's more consistent to the behaviour of ndarray.var. I also added your example using different dtypes as a test to check if the behavior stays the same for ndarray and ma.

Thank you for your thorough tests and feedback, @WarrenWeckesser!

@charris
Copy link
Member

charris commented Mar 16, 2020

@WarrenWeckesser float16 arithmetic is implemented in C by casting to float32, doing the operation, and downcasting after. You can see that in the ufunc loops. So there is probably no problem with doing that here. The idea is that float16 is mostly a storage format rather than for computation, although I think that has changed over the years with some gpus implementing the operations. Converting whole arrays is probably not memory efficient, but no doubt much better numerically.

numpy/ma/tests/test_core.py Outdated Show resolved Hide resolved
@seberg
Copy link
Member

seberg commented Mar 16, 2020

I do not think there is a general policy, right now it seems pretty random. A general policy would be nice, but not sure how feasible it is in the short term. Some reduction-like operation up-cast for sure, in general ufunc reductions upcast along the fast axis. That means that along the fast axis, you have rounded float32 precision, but along the slow axis you are limited by float16 precision (a somewhat annoying side effect is that it is not just less precise but also slower).

A long term policy could be to try and use always float32 precision for float16 calculations, but at least for reductions the difference between slow and fast axis would need to be removed, so that it is transparent when the downcast to float16 occurs. As Chuck says, it is mostly meant to be a storage format, so I am not sure that is even a good policy...

@shoeffner
Copy link
Author

For 982637f, the azure pipelines failed for pypy3 (numpy/core/tests/test_einsum.py ...........Fatal Python error: Illegal instruction), I hope this is unrelated...

So should I now keep the behavior as it is right now (i.e., identical behavior between ma and ndarray) or should I switch to up/down-casting for all operations?

Do you have any other things I can address for the PR? :)

@mattip
Copy link
Member

mattip commented Mar 18, 2020

Indeed the PyPy failure is both unrelated and annoying.

Base automatically changed from master to main March 4, 2021 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting a code review
Status: Awaiting a code review
Development

Successfully merging this pull request may close these issues.

Return data types of np.average and np.ma.average behave differently
6 participants