Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: masked ops for reductions (sum) #30982

Merged
merged 24 commits into from
Mar 27, 2020

Conversation

jorisvandenbossche
Copy link
Member

The current nanops has quite some complexity that is not needed for the masked arrays. This is a small proof of concept to have separate implementations for our nullable masked arrays, taking the sum case (and still ignoring the additional kwargs).

This is also quite a bit faster. On master:

In [1]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]*1000)  

In [2]: s1 = pd.Series(a, dtype="Int64")

In [3]: s2 = pd.Series(a, dtype="float64") 

In [4]: %timeit s1.sum()   
79.1 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit s2.sum() 
79.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

the nullable Int64 basically does the same as the nanops implementation for float.
With this PR:

In [4]: %timeit s1.sum()  
21.5 µs ± 69.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit s2.sum() 
79.8 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(when using bigger arrays, the speed-up becomes less big in relative factor. Here it's almost 4x with 10k elements, but with 1M it's aound 2x).

I think it would be interesting to gradually implement some of the ops specifically for the nullable masked arrays. Personally I think it will be easier to do this in new functions than trying to fit it into the existing nanops function.

@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jan 13, 2020
@jbrockmendel
Copy link
Member

Looks promising

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Jan 15, 2020
@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020
@jorisvandenbossche jorisvandenbossche removed the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 13, 2020
@jorisvandenbossche jorisvandenbossche changed the title POC masked ops for reductions PERF: masked ops for reductions (sum) Feb 13, 2020
@jorisvandenbossche
Copy link
Member Author

I updated this to finalize the "sum" case (basically added min_count support).

cc @TomAugspurger @jreback

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite nice. I also think it's worth doing these separately, rather than trying to shoehorn it into our current nanops functions (which I've tried and failed to do a couple times).

Numpy array with the values (can be of any dtype that support the
operation).
mask : np.ndarray
Boolean numpy array (False for missing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "False for missing" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure, as it should actually be "True for missing" if it is to explain that the True values in the mask are missing values (maybe I was first planning to pass an inversed mask).
Will update that.

@jorisvandenbossche
Copy link
Member Author

More comments?

return np.sum(values, where=~mask)


def _below_min_count(values, mask, min_count):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already do this in the nanops sum yes?

can u de duplicate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's here:

pandas/pandas/core/nanops.py

Lines 1238 to 1271 in 37a7006

def _maybe_null_out(
result: np.ndarray,
axis: Optional[int],
mask: Optional[np.ndarray],
shape: Tuple,
min_count: int = 1,
) -> float:
"""
Returns
-------
Dtype
The product of all elements on a given axis. ( NaNs are treated as 1)
"""
if mask is not None and axis is not None and getattr(result, "ndim", False):
null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0
if np.any(null_mask):
if is_numeric_dtype(result):
if np.iscomplexobj(result):
result = result.astype("c16")
else:
result = result.astype("f8")
result[null_mask] = np.nan
else:
# GH12941, use None to auto cast null
result[null_mask] = None
elif result is not NaT:
if mask is not None:
null_mask = mask.size - mask.sum()
else:
null_mask = np.prod(shape)
if null_mask < min_count:
result = np.nan
return result

But I would prefer not to deduplicate, as the one in nanops is mixed with other logic / more complex because it needs to handle more dimensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disagree this is more code to maintain which keeps happening with ios

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what's "with ios" ?

if _np_version_under1p17:
return np.sum(values[~mask])
else:
return np.sum(values, where=~mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, is this more efficient than the version used in older numpy?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. With the array in my top post it gives only a slight boost (a 5-8% speed-up), but it varies quite a bit (I could also have it slower with more np.nan values).
But when going to larger arrays, there seems to be a clearer difference:

In [40]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]*1_000_000)   

In [41]: mask = np.isnan(a)  

In [42]: %timeit np.sum(a, where=~mask)    
17.7 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [43]: %timeit np.sum(a[~mask])  
43.3 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Although the difference gets smaller when there are more missing values (eg with 50% missing values instead of 10% in the above, it becomes 25 vs 32ms).

But I assume there is also a memory benefit (avoiding the temporary array), although I am not fully familiar with the inner workings of this in numpy.

@@ -533,13 +533,14 @@ def test_sum_inf(self):
res = nanops.nansum(arr, axis=1)
assert np.isinf(res).all()

@pytest.mark.parametrize("dtype", ["float64", "Int64", "boolean"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should object be included here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added object dtype

The required number of valid values to perform the operation. If fewer than
``min_count`` non-NA values are present the result will be NA.
"""
if not skipna:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the code here is not sum-specific. is the idea to eventually make a template that gets re-used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's possible. I mainly want to implement first a single one, so we can review this / agree on the principles. Afterwards there can be more PRs implementing the other reductions, and then we should indeed see what's the best way to do this to avoid duplication.

@jbrockmendel
Copy link
Member

Some of the code here looks similar to the min/max methods in DatetimeIndex/TimedeltaIndex (specifically the Indexes, not the EAs) since those use their cached masks. It'd be nice if some of that could be shared

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 20, 2020

Some of the code here looks similar to the min/max methods in DatetimeIndex/TimedeltaIndex (specifically the Indexes, not the EAs) since those use their cached masks. It'd be nice if some of that could be shared

Do you mean this one?

def min(self, axis=None, skipna=True, *args, **kwargs):
"""
Return the minimum value of the Index or minimum along
an axis.
See Also
--------
numpy.ndarray.min
Series.min : Return the minimum value in a Series.
"""
nv.validate_min(args, kwargs)
nv.validate_minmax_axis(axis)
if not len(self):
return self._na_value
i8 = self.asi8
try:
# quick check
if len(i8) and self.is_monotonic:
if i8[0] != iNaT:
return self._box_func(i8[0])
if self.hasnans:
if skipna:
min_stamp = self[~self._isnan].asi8.min()
else:
return self._na_value
else:
min_stamp = i8.min()
return self._box_func(min_stamp)
except ValueError:
return self._na_value

There are indeed some similar concepts, but I am not sure if it is that easy to share code (if you have specific ideas, feel free to share). Eg the above doesn't need to handle null_count, in my version I can hardcode pd.NA instead of needing to make this a variable. And it would only be the lines 262-269 in the above code that could be replaced with a function call.

Maybe if we have a "templated" version (eg where you pass the numpy function to call as variable), as mentioned in one of the other inline comments, this would become easier. But I would personally only look into it when also implementing min/max in mask_ops (sum/prod are a bit special with the min_count).

@jreback
Copy link
Contributor

jreback commented Feb 20, 2020

@jorisvandenbossche we need to share more code with the very well tested internal functions

adding more and more exclusively for the extension types just adds more bugs

i would refactor much sooner than later (now)

rather than continuing to add things

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 20, 2020

So I know this is adding code, but reasons to do this separately:

  • The current nanops have a lot of complexity that is not needed for the masked arrays:
    • It deals with bottleneck switching, which is not needed here without NaNs
    • It deals with 2D, which is not needed here
  • The nanops also take a different approach: it fills the array with a fill_value appropriate for the specific op / dtype combo, while here I am applying the mask to filter out missing values
  • Trying to integrate the code and concepts I use here into nanops, will make the already complex code in nanops even more complex. So while in this PR I am adding a separated implementation, I think it actually gives a simpler and easier to understand end result than trying to combine both.
  • This new implementation here is (IMO, but I also wrote it, so I am biased) much more readable than the existing nanops.

I could try to do a POC of combining so we have something to compare with. But even if possible with not too much hooks, for the above reasons I prefer to keep the "masked ops" separate.

Personally, I would (short term) rather put effort in sharing the testing code instead of putting effort in sharing the actual implementation: improving testing by ensuring we re-use a lot of more of our extensive, existings tests for the numeric ops (so run those tests also for the nullable dtypes, instead of testing it separately as we do now).

Another way of sharing code could also be the other way around: instead of trying to integrate this into nanops; we could also start looking into gradually using those "masked ops" more for some of the existing dtypes. For example, as @jbrockmendel mentioned above, the Datetime/TimedeltaIndex min/max functions could use a masked op version of min/max.

@jbrockmendel
Copy link
Member

There are indeed some similar concepts, but I am not sure if it is that easy to share code (if you have specific ideas, feel free to share).

The closest thing I had to a specific idea was "huh, this looks kinda similar to that other thing"

@jorisvandenbossche
Copy link
Member Author

OK, moved, so this is ready for another look

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small followon request; this needs a whatsnew note?

pandas/core/nanops.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Mar 26, 2020

does this need asv's?

@jorisvandenbossche
Copy link
Member Author

does this need asv's?

Added the Int64/boolean dtype to the parametrization of some existing benchmarks (Any/All/NanOps and FrameOps).

And added a short whatsnew notice.

@jreback
Copy link
Contributor

jreback commented Mar 26, 2020

lgtm @jbrockmendel @TomAugspurger if any comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 26, 2020 via email

@@ -229,6 +229,8 @@ Performance improvements
sparse values from ``scipy.sparse`` matrices using the
:meth:`DataFrame.sparse.from_spmatrix` constructor (:issue:`32821`,
:issue:`32825`, :issue:`32826`, :issue:`32856`, :issue:`32858`).
- Performance improvement in :meth:`Series.sum` for nullable (integer and boolean) dtypes (:issue:`30982`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is DataFrame.sum affected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since we don't use the EA implementation for reductions in DataFrame reductions, that's my other PR

from pandas.core.nanops import _below_min_count


def sum(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid overriding a builtin-in name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid overriding a builtin-in name?

Do we care? In practice this is only used as masked_reductions.sum(..), so there is no risk of confusing it (eg numpy also has a sum ..).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have a problem with renaming it to eg masked_sum, though (but all together it then becomes a bit long masked_reductions.masked_sum(..))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nansum? Not a strong enough opinion to be a blocker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no nan's involved here ;), that's exactly the difference with nanops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha, sounds good

Boolean numpy array (True values indicate missing values).
skipna : bool, default True
Whether to skip NA.
min_count : int, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required >=0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a count, so yes

@@ -1238,7 +1238,7 @@ def _maybe_null_out(
result: np.ndarray,
axis: Optional[int],
mask: Optional[np.ndarray],
shape: Tuple,
shape: Tuple[int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Tuple[int] describe a 1-tuple, or any-length-tuple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, apparently that needs to be Tuple[int, ...] for a variable length tuple

@jreback
Copy link
Contributor

jreback commented Mar 27, 2020

@jbrockmendel merge if you are good with responses

@jbrockmendel jbrockmendel merged commit 5e3a5f6 into pandas-dev:master Mar 27, 2020
@jbrockmendel
Copy link
Member

thanks @jorisvandenbossche

@jorisvandenbossche
Copy link
Member Author

And thanks for the reviews!

@jorisvandenbossche jorisvandenbossche deleted the masked-ops branch March 27, 2020 20:00
if _np_version_under1p17:
return np.sum(values[~mask])
else:
return np.sum(values, where=~mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche This seems like a useful pattern for other reductions like min and max (maybe a special case is needed if we have a min_count?), wondering if it might make sense to generalize this to those? Could be more performant than using nanops, and could also be done in a way that's extensible to string dtype (where nanops can't be used)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's indeed the goal of this PR (I just wanted to limit it to a single function for the initial PR). For min/max, I think it will be better to write a separate function in module since they don't use min_count, but prod should be able to reuse this.
I was planning to do follow-up PRs the and of this week, but feel free to pick it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants