ENH: Update return value of ma.average and allow weights to be masked. #14668

shoeffner · 2019-10-09T19:27:13Z

This pull request makes changes to the np.ma.average function, and with it to the MaskedArray.mean and MaskedArray.var methods.
They behave more predictable and in line with their np. counterparts with regards to dtype handling.
Additionally, np.ma.average can now handle masked arrays as weights. Please see the commit messages below for more details.

As this is my first numpy contribution, any feedback is highly appreciated! :-)

DOC: typos in np.average

ENH: ma.average like np.average; allow weight mask

This commit makes ma.average to behave more like np.average regarding
dtypes and result handling.

The change required code similar to the code in np.mean, as suggested by
Sebastian Berg, but also some handling in np.ma.average. As a result,
ma.mean handles dtypes like to np.mean. Similar changes have been done
to ma.var, so that it also keeps proper dtypes and uses np.float64 for
integral types.

While getting my head around the code, I inevitably introduced a change
in weights handling: np.ma.average now handles masked weights such that
only values which are not masked in either the data or the weights are
taken into account.

Additional tests and updated documentation are included.

Resolves #14462.

This commit makes ma.average to behave more like np.average regarding dtypes and result handling. The change required code similar to the code in np.mean, as suggested by Sebastian Berg, but also some handling in np.ma.average. As a result, ma.mean handles dtypes like to np.mean. Similar changes have been done to ma.var, so that it also keeps proper dtypes and uses np.float64 for integral types. While getting my head around the code, I inevitably introduced a change in weights handling: np.ma.average now handles masked weights such that only values which are not masked in either the data or the weights are taken into account. Additional tests and updated documentation are included. Resolves numpy#14462.

shoeffner · 2019-10-09T23:13:12Z

Skipping one failing test because on many systems np.float128 does not seem to be available (https://dev.azure.com/numpy/numpy/_build/results?buildId=5837&view=logs).
Additionally, fixed behavior when no mask for the weights is given so that the doctests pass.

WarrenWeckesser

@shoeffner, I made some suggestions and added some questions inline, but I haven't reviewed all the code changes. If you are still interested in working on this, I'll continue with a more detailed review.

numpy/ma/core.py

WarrenWeckesser · 2020-01-06T10:53:03Z

One issue that worries me is this change in behavior. Before this pull request:

In [15]: x = np.ma.masked_array([[1, 2]], mask=[[0, 1]])                                         

In [16]: w = np.ma.masked_array([[0, 0]], mask=[[0, 0]])                                         

In [17]: np.ma.average(x, weights=w, axis=1)                                                     
Out[17]: 
masked_array(data=[--],
             mask=[ True],
       fill_value=1e+20,
            dtype=float64)

With this pull request, we get a warning and an explicit nan:

In [18]: x = np.ma.masked_array([[1, 2]], mask=[[0, 1]])                        

In [19]: w = np.ma.masked_array([[0, 0]], mask=[[0, 0]])                        

In [20]: np.ma.average(x, weights=w, axis=1)                                    
/Users/warren/mc37np/lib/python3.7/site-packages/numpy-1.18.0.dev0+f4dd770-py3.7-macosx-10.9-x86_64.egg/numpy/ma/extras.py:694: RuntimeWarning: invalid value encountered in true_divide
  avg = prod.sum(axis=axis, dtype=result_dtype) / scl
Out[20]: 
masked_array(data=[nan],
             mask=[False],
       fill_value=1e+20)

One can argue that the new behavior is preferred, but it is a backwards incompatible change that might affect existing code. Like it or not, the masked array functions tend to automatically convert values that would normally be nan or inf into a mask. E.g.

In [19]: x = np.ma.masked_array([0, 1, 2])

In [20]: 1/x
Out[20]: 
masked_array(data=[--, 1.0, 0.5],
             mask=[ True, False, False],
       fill_value=1e+20)

With a regular array, we get a warning and the result includes inf:

In [21]: 1/np.arange(3)                                                                          
/Users/warren/a2019.10/bin/ipython:1: RuntimeWarning: divide by zero encountered in true_divide
  #!/Users/warren/a2019.10/bin/python
Out[21]: array([inf, 1. , 0.5])

shoeffner · 2020-01-06T11:39:34Z

Thank you very much for your valuable feedback!

I will go through your comments and adjust the code accordingly (at some point during the next days). If I remember correctly, most of the changes (like the views and the ntypes.dtypes) were mostly copied from the normal array operations, but I will have to take a closer look again and see why I decided on that. Might also be the case that I found that in the test suite somewhere.

I very much appreciate the comment on the change of functionality: I honestly was not particularly aware of that issue and will a) search if I have any unit tests on it and b) investigate how to keep the backwards compatibility.

Uses scalar types like np.float64 instead of dtype instances in ma.mean and ma.var.

To retain backwards compatibility, division by 0 entries in masked arrays are masked out in the results by using true_divide from numpy.ma rather than a simple /.

shoeffner · 2020-01-08T01:11:55Z

I addressed

your recommendations about using the scalar types instead of dtype objects,
removed the views as they seemed to be unnecessary,
added tests about the scalar types also to the ma.mean-tests to increase its coverage,
added your regression example as a unit test as well as a doctest example so that one can see that values might be masked out in case of a DIV0,
used true_divide where I had divide or / before, as this handles the masking of division by 0 values automatically, or so it seems.

Now I am waiting for the CI to tell me what I broke but didn't test locally.

numpy/ma/extras.py

WarrenWeckesser · 2020-03-13T01:52:10Z

@shoeffner, sorry for letting this slip. I'll get back to this over the weekend, if not sooner.

WarrenWeckesser · 2020-03-16T04:21:01Z

Is there a reason that, when the input is type float16, the calculation is upcast to float32 and later cast back to float16? @shoeffner, I realize this is copied from the behavior of mean, so my question is really directed at other numpy devs, e.g. @seberg, @eric-wieser, etc. Is there a general policy of bumping float16 inputs to float32 for computations, or is this just a special behavior for mean?

EDIT: Seems to be just mean--but I haven't done a thorough review of the code to check.

WarrenWeckesser · 2020-03-16T06:46:28Z

@shoeffner, the changes to mean look good.

One of the goals of this pull request is to make the behavior of the mean and var methods of the masked arrays consistent with the ndarray methods. I noticed the following discrepancy in the values computed by the var method for the float16 array x:

In [101]: x = np.arange(-30, 30).astype(np.float16)                                                      

In [102]: mx = np.ma.masked_array(x, mask=np.zeros(x.shape, dtype=bool))

x and mx have the same shape, dtype and values. mx has an explicit mask, but no values are masked.

I would expect x.var() and mx.var() to return the same value, but the following shows otherwise:

In [103]: x.var(), x.var().dtype
Out[103]: (299.8, dtype('float16'))

In [104]: mx.var(), mx.var().dtype
Out[104]: (300.0, dtype('float16'))

(The results do have the same dtype, which is part of the improvement made in this pull request, so we're making progress!)

Numerically, the value 300.0 is closer to the answer one gets with higher precision, 299.9166666666667. In this case, it is the ndarray method that is not boosting the precision of the intermediate calculation. We can verify that by using an explicit dtype argument:

In [228]: x.var(dtype=np.float32).astype(np.float16)
Out[228]: 300.0

So, to maintain consistency with the ndarray var method, the masked array version of var shouldn't cast float16 to float32 internally.

(Unless, that is, we also modify the ndarray var method to internally cast float16 to float32, like mean does. But that would require discussion from other numpy devs, and I suspect if that was desired, it would have been done already, perhaps at the same time mean was changed.)

Adding tests to check if var and mean behave equal to their non-masked counterparts.

shoeffner · 2020-03-16T12:48:06Z

That is indeed copied; I thought the idea was – exactly as you said – to keep a higher precision for the intermediate divisions. I changed it so that it's more consistent to the behaviour of ndarray.var. I also added your example using different dtypes as a test to check if the behavior stays the same for ndarray and ma.

Thank you for your thorough tests and feedback, @WarrenWeckesser!

charris · 2020-03-16T14:19:29Z

@WarrenWeckesser float16 arithmetic is implemented in C by casting to float32, doing the operation, and downcasting after. You can see that in the ufunc loops. So there is probably no problem with doing that here. The idea is that float16 is mostly a storage format rather than for computation, although I think that has changed over the years with some gpus implementing the operations. Converting whole arrays is probably not memory efficient, but no doubt much better numerically.

numpy/ma/tests/test_core.py

seberg · 2020-03-16T16:43:20Z

I do not think there is a general policy, right now it seems pretty random. A general policy would be nice, but not sure how feasible it is in the short term. Some reduction-like operation up-cast for sure, in general ufunc reductions upcast along the fast axis. That means that along the fast axis, you have rounded float32 precision, but along the slow axis you are limited by float16 precision (a somewhat annoying side effect is that it is not just less precise but also slower).

A long term policy could be to try and use always float32 precision for float16 calculations, but at least for reductions the difference between slow and fast axis would need to be removed, so that it is transparent when the downcast to float16 occurs. As Chuck says, it is mostly meant to be a storage format, so I am not sure that is even a good policy...

shoeffner · 2020-03-18T15:07:30Z

For 982637f, the azure pipelines failed for pypy3 (numpy/core/tests/test_einsum.py ...........Fatal Python error: Illegal instruction), I hope this is unrelated...

So should I now keep the behavior as it is right now (i.e., identical behavior between ma and ndarray) or should I switch to up/down-casting for all operations?

Do you have any other things I can address for the PR? :)

mattip · 2020-03-18T15:24:14Z

Indeed the PyPy failure is both unrelated and annoying.

DOC: typos in np.average

1395a94

shoeffner force-pushed the feature/masked_weights branch from 4a07047 to b2a9e92 Compare October 9, 2019 19:39

shoeffner force-pushed the feature/masked_weights branch from b2a9e92 to f4dd770 Compare October 9, 2019 23:10

mattip added 01 - Enhancement component: numpy.ma masked arrays triage review Issue/PR to be discussed at the next triage meeting labels Dec 8, 2019

mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Dec 18, 2019

WarrenWeckesser self-assigned this Dec 18, 2019

WarrenWeckesser requested changes Jan 6, 2020

View reviewed changes

shoeffner added 3 commits January 8, 2020 01:56

TST: Adding dtype tests to ma.var and ma.mean

4218fd8

MAINT: numpy.ma: Use scalar types instead of dtype

a1348a2

Uses scalar types like np.float64 instead of dtype instances in ma.mean and ma.var.

MAINT: numpy.ma: Masking DIV0 entries

37b4948

To retain backwards compatibility, division by 0 entries in masked arrays are masked out in the results by using true_divide from numpy.ma rather than a simple /.

eric-wieser reviewed Jan 8, 2020

View reviewed changes

numpy/ma/extras.py Outdated Show resolved Hide resolved

DOC: ma.average: Setting versionchanged to 1.19.0

dcebaa6

shoeffner requested a review from WarrenWeckesser January 11, 2020 18:37

charris changed the title ~~Return data types of np.average and np.ma.average behave the same. Weights for np.ma.average can be masked.~~ ENH: Update return value of ma.average and allow weights to be masked. Mar 13, 2020

MAINT: Removing upcast to float32 in var method.

0fc1c0d

Adding tests to check if var and mean behave equal to their non-masked counterparts.

FIX: Removing np.float128 from ma.var/ma.mean tests.

982637f

seberg reviewed Mar 16, 2020

View reviewed changes

numpy/ma/tests/test_core.py Outdated Show resolved Hide resolved

FIX: Using np.[long]double in ma.var/ma.mean tests.

9d70a42

Base automatically changed from master to main March 4, 2021 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Update return value of ma.average and allow weights to be masked. #14668

ENH: Update return value of ma.average and allow weights to be masked. #14668

shoeffner commented Oct 9, 2019

shoeffner commented Oct 9, 2019 •

edited

Loading

WarrenWeckesser left a comment

WarrenWeckesser commented Jan 6, 2020

shoeffner commented Jan 6, 2020

shoeffner commented Jan 8, 2020

WarrenWeckesser commented Mar 13, 2020

WarrenWeckesser commented Mar 16, 2020 •

edited

Loading

WarrenWeckesser commented Mar 16, 2020

shoeffner commented Mar 16, 2020

charris commented Mar 16, 2020

seberg commented Mar 16, 2020

shoeffner commented Mar 18, 2020

mattip commented Mar 18, 2020

ENH: Update return value of ma.average and allow weights to be masked. #14668

Are you sure you want to change the base?

ENH: Update return value of ma.average and allow weights to be masked. #14668

Conversation

shoeffner commented Oct 9, 2019

shoeffner commented Oct 9, 2019 • edited Loading

WarrenWeckesser left a comment

Choose a reason for hiding this comment

WarrenWeckesser commented Jan 6, 2020

shoeffner commented Jan 6, 2020

shoeffner commented Jan 8, 2020

WarrenWeckesser commented Mar 13, 2020

WarrenWeckesser commented Mar 16, 2020 • edited Loading

WarrenWeckesser commented Mar 16, 2020

shoeffner commented Mar 16, 2020

charris commented Mar 16, 2020

seberg commented Mar 16, 2020

shoeffner commented Mar 18, 2020

mattip commented Mar 18, 2020

shoeffner commented Oct 9, 2019 •

edited

Loading

WarrenWeckesser commented Mar 16, 2020 •

edited

Loading