Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add prod to masked_reductions #33442

Merged
merged 16 commits into from
Apr 12, 2020
Merged

ENH: Add prod to masked_reductions #33442

merged 16 commits into from
Apr 12, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Apr 9, 2020

Adding prod to /core/array_algos/masked_reductions.py and using them for IntegerArray and BooleanArray. This also seems to offer a decent speedup over nanops like the other reductions:

# Branch

[ins] In [3]: %timeit arr.prod()  # arr = pd.Series([None, 0, 1, 2] * 10_000, dtype="Int64")
102 µs ± 379 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

[ins] In [5]: %timeit arr.prod()  # arr = pd.Series([0, 0, 1, 2] * 10_000, dtype="Int64")
82.6 µs ± 752 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Master

[ins] In [4]: %timeit arr.prod()  # arr = pd.Series([None, 0, 1, 2] * 10_000, dtype="Int64")
291 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

[ins] In [6]: %timeit arr.prod()  # arr = pd.Series([0, 0, 1, 2] * 10_000, dtype="Int64")
78.6 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@jorisvandenbossche jorisvandenbossche changed the title REF: Add prod to masked_reductions ENH: Add prod to masked_reductions Apr 9, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Added a small comment

else:
subset = values[~mask]
if subset.size:
if not mask.all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a mask.all() might on average be expensive here, since in most cases you will actually need the subset afterwards

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I see what you're saying. I'll switch that back

@jorisvandenbossche
Copy link
Member

Can you also add prod to the existing whatsnew note about this?

@jorisvandenbossche jorisvandenbossche added Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance and removed Enhancement labels Apr 9, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 9, 2020
@dsaxton
Copy link
Member Author

dsaxton commented Apr 9, 2020

Hmm, looks like we are hitting an overflow issue in /tests/extension/base/reduce.test_reduce_series. This is interesting because I think the current implementation is only passing this test "by accident" since the input gets cast to float64 whenever it contains NA values (without this casting we would still get an overflow, which would be the case if the test input happened not to have any missing data): https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/integer.py#L571

Is this test actually correct (i.e., if an integer input would overflow, do we "expect" the output we would get if it was first cast as float)? https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/reduce.py#L19

@jbrockmendel
Copy link
Member

@dsaxton shot in the dark maybe algorithms.checked_add_with_arr has something useful?

@jorisvandenbossche
Copy link
Member

I suppose this is just the test that was not correct. Also for int64 dtype (default numpy one), we let it silently overflow instead of casting to float:

In [1]: a = np.arange(1, 101)  

In [2]: pd.Series(a).prod()   
Out[2]: 0

# no missing values -> use int algo from numpy
In [3]: pd.Series(a, dtype="Int64").prod()     
Out[3]: 0

# on this branch, also this returns 0
In [4]: pd.Series(list(a) + [None], dtype="Int64").prod()     
Out[4]: 93326215443944102188325606108575267240944254854960571509166910400407995064242937148632694030450512898042989296944474898258737204311236641477561877016501813248

So certainly having it differ depending on the presence of missing values or not as the current implementation gives is clearly a bug.

@pep8speaks
Copy link

pep8speaks commented Apr 10, 2020

Hello @dsaxton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-11 16:42:43 UTC

result = op(data, mask, skipna=skipna, **kwargs)

# if we have numeric op that would result in an int, coerce to int if possible
if name == "prod" and notna(result):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you need to add this back?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting a failure on some builds for \tests\arrays\boolean\test_reduction.py where we're checking that the product has int dtype so it might be needed:

        elif op == "prod":
>           assert isinstance(getattr(s, op)(), np.int64)
E           AssertionError: assert False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then maybe the test needs to be edited. In any case, we should investigate why it is failing / what we should be expecting.

From a quick test, it seems that numpy returns int64:

In [6]: np.array([], dtype="bool").prod()                                                                                                                                                                          
Out[6]: 1

In [7]: type(_)                                                                                                                                                                                                    
Out[7]: numpy.int64

In [8]: np.array([True, False], dtype="bool").prod()                                                                                                                                                               
Out[8]: 0

In [9]: type(_)                                                                                                                                                                                                    
Out[9]: numpy.int64

So I think we should follow that behaviour here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op = getattr(masked_reductions, name)
return op(data, mask, skipna=skipna, **kwargs)
result = op(data, mask, skipna=skipna, **kwargs)

Copy link
Contributor

@jreback jreback Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try using maybe_cast_result_dtype here (you will have to add the prod operation in side that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, that should not be needed. Numpy should give the desired result for booleans.

result = op(data, mask, skipna=skipna, **kwargs)
dtype = maybe_cast_result_dtype(dtype=data.dtype, how=name)
if notna(result) and (dtype != result.dtype):
result = result.astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure we actually need to do this. We could also choose to follow numpy's behaviour to return platform int (any idea what we do for "bool" dtype?)

BTW, this should also change the result for sum, so this was not tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I think you're right and it was actually the test that needed to change here (np.int64 -> np.int_)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in any case that's what I did for sum, I see now (so if we decide on following numpy vs always returning int64, we should do it for both sum and prod)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but I am fine with following numpy)

@jreback jreback merged commit 6658d89 into pandas-dev:master Apr 12, 2020
@jreback
Copy link
Contributor

jreback commented Apr 12, 2020

thanks @dsaxton

@dsaxton dsaxton deleted the masked-prod branch April 12, 2020 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants