PERF: masked ops for reductions (sum) #30982

jorisvandenbossche · 2020-01-13T20:48:20Z

The current nanops has quite some complexity that is not needed for the masked arrays. This is a small proof of concept to have separate implementations for our nullable masked arrays, taking the sum case (and still ignoring the additional kwargs).

This is also quite a bit faster. On master:

In [1]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]*1000)  

In [2]: s1 = pd.Series(a, dtype="Int64")

In [3]: s2 = pd.Series(a, dtype="float64") 

In [4]: %timeit s1.sum()   
79.1 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit s2.sum() 
79.1 µs ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

the nullable Int64 basically does the same as the nanops implementation for float.
With this PR:

In [4]: %timeit s1.sum()  
21.5 µs ± 69.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit s2.sum() 
79.8 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(when using bigger arrays, the speed-up becomes less big in relative factor. Here it's almost 4x with 10k elements, but with 1M it's aound 2x).

I think it would be interesting to gradually implement some of the ops specifically for the nullable masked arrays. Personally I think it will be easier to do this in new functions than trying to fit it into the existing nanops function.

jbrockmendel · 2020-01-13T22:25:44Z

Looks promising

jorisvandenbossche · 2020-02-13T11:09:47Z

I updated this to finalize the "sum" case (basically added min_count support).

cc @TomAugspurger @jreback

TomAugspurger

This looks quite nice. I also think it's worth doing these separately, rather than trying to shoehorn it into our current nanops functions (which I've tried and failed to do a couple times).

TomAugspurger · 2020-02-13T13:41:42Z

pandas/core/ops/mask_ops.py

+        Numpy array with the values (can be of any dtype that support the
+        operation).
+    mask : np.ndarray
+        Boolean numpy array (False for missing)


What does "False for missing" mean here?

Hmm, not sure, as it should actually be "True for missing" if it is to explain that the True values in the mask are missing values (maybe I was first planning to pass an inversed mask).
Will update that.

jorisvandenbossche · 2020-02-20T15:04:51Z

More comments?

jreback · 2020-02-20T15:20:41Z

pandas/core/ops/mask_ops.py

+            return np.sum(values, where=~mask)
+
+
+def _below_min_count(values, mask, min_count):


we already do this in the nanops sum yes?

can u de duplicate

Yes, it's here:

pandas/pandas/core/nanops.py

Lines 1238 to 1271 in 37a7006

def _maybe_null_out(

result: np.ndarray,

axis: Optional[int],

mask: Optional[np.ndarray],

shape: Tuple,

min_count: int = 1,

) -> float:

"""

Returns

-------

Dtype

The product of all elements on a given axis. ( NaNs are treated as 1)

"""

if mask is not None and axis is not None and getattr(result, "ndim", False):

null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0

if np.any(null_mask):

if is_numeric_dtype(result):

if np.iscomplexobj(result):

result = result.astype("c16")

else:

result = result.astype("f8")

result[null_mask] = np.nan

else:

# GH12941, use None to auto cast null

result[null_mask] = None

elif result is not NaT:

if mask is not None:

null_mask = mask.size - mask.sum()

else:

null_mask = np.prod(shape)

if null_mask < min_count:

result = np.nan

return result

But I would prefer not to deduplicate, as the one in nanops is mixed with other logic / more complex because it needs to handle more dimensions.

disagree this is more code to maintain which keeps happening with ios

Sorry, what's "with ios" ?

jbrockmendel · 2020-02-20T15:30:05Z

pandas/core/ops/mask_ops.py

+        if _np_version_under1p17:
+            return np.sum(values[~mask])
+        else:
+            return np.sum(values, where=~mask)


out of curiosity, is this more efficient than the version used in older numpy?

Good question. With the array in my top post it gives only a slight boost (a 5-8% speed-up), but it varies quite a bit (I could also have it slower with more np.nan values).
But when going to larger arrays, there seems to be a clearer difference:

In [40]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]*1_000_000) In [41]: mask = np.isnan(a) In [42]: %timeit np.sum(a, where=~mask) 17.7 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [43]: %timeit np.sum(a[~mask]) 43.3 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Although the difference gets smaller when there are more missing values (eg with 50% missing values instead of 10% in the above, it becomes 25 vs 32ms).

But I assume there is also a memory benefit (avoiding the temporary array), although I am not fully familiar with the inner workings of this in numpy.

jbrockmendel · 2020-02-20T15:30:39Z

pandas/tests/reductions/test_reductions.py

@@ -533,13 +533,14 @@ def test_sum_inf(self):
        res = nanops.nansum(arr, axis=1)
        assert np.isinf(res).all()

+    @pytest.mark.parametrize("dtype", ["float64", "Int64", "boolean"])


should object be included here?

maybe None?

Added object dtype

jbrockmendel · 2020-02-20T15:33:08Z

pandas/core/ops/mask_ops.py

+        The required number of valid values to perform the operation. If fewer than
+        ``min_count`` non-NA values are present the result will be NA.
+    """
+    if not skipna:


most of the code here is not sum-specific. is the idea to eventually make a template that gets re-used?

Yes, that's possible. I mainly want to implement first a single one, so we can review this / agree on the principles. Afterwards there can be more PRs implementing the other reductions, and then we should indeed see what's the best way to do this to avoid duplication.

jbrockmendel · 2020-02-20T15:34:44Z

Some of the code here looks similar to the min/max methods in DatetimeIndex/TimedeltaIndex (specifically the Indexes, not the EAs) since those use their cached masks. It'd be nice if some of that could be shared

jorisvandenbossche · 2020-02-20T16:03:19Z

Some of the code here looks similar to the min/max methods in DatetimeIndex/TimedeltaIndex (specifically the Indexes, not the EAs) since those use their cached masks. It'd be nice if some of that could be shared

Do you mean this one?

pandas/pandas/core/indexes/datetimelike.py

Lines 239 to 271 in 37a7006

    
               def min(self, axis=None, skipna=True, *args, **kwargs): 
        
                   """ 
        
                   Return the minimum value of the Index or minimum along 
        
                   an axis. 
        
                   See Also 
        
                   -------- 
        
                   numpy.ndarray.min 
        
                   Series.min : Return the minimum value in a Series. 
        
                   """ 
        
                   nv.validate_min(args, kwargs) 
        
                   nv.validate_minmax_axis(axis) 
        
                   if not len(self): 
        
                       return self._na_value 
        
                   i8 = self.asi8 
        
                   try: 
        
                       # quick check 
        
                       if len(i8) and self.is_monotonic: 
        
                           if i8[0] != iNaT: 
        
                               return self._box_func(i8[0]) 
        
                       if self.hasnans: 
        
                           if skipna: 
        
                               min_stamp = self[~self._isnan].asi8.min() 
        
                           else: 
        
                               return self._na_value 
        
                       else: 
        
                           min_stamp = i8.min() 
        
                       return self._box_func(min_stamp) 
        
                   except ValueError: 
        
                       return self._na_value

There are indeed some similar concepts, but I am not sure if it is that easy to share code (if you have specific ideas, feel free to share). Eg the above doesn't need to handle null_count, in my version I can hardcode pd.NA instead of needing to make this a variable. And it would only be the lines 262-269 in the above code that could be replaced with a function call.

Maybe if we have a "templated" version (eg where you pass the numpy function to call as variable), as mentioned in one of the other inline comments, this would become easier. But I would personally only look into it when also implementing min/max in mask_ops (sum/prod are a bit special with the min_count).

jreback · 2020-02-20T16:47:26Z

@jorisvandenbossche we need to share more code with the very well tested internal functions

adding more and more exclusively for the extension types just adds more bugs

i would refactor much sooner than later (now)

rather than continuing to add things

jorisvandenbossche · 2020-02-20T17:08:30Z

So I know this is adding code, but reasons to do this separately:

The current nanops have a lot of complexity that is not needed for the masked arrays:
- It deals with bottleneck switching, which is not needed here without NaNs
- It deals with 2D, which is not needed here
The nanops also take a different approach: it fills the array with a fill_value appropriate for the specific op / dtype combo, while here I am applying the mask to filter out missing values
Trying to integrate the code and concepts I use here into nanops, will make the already complex code in nanops even more complex. So while in this PR I am adding a separated implementation, I think it actually gives a simpler and easier to understand end result than trying to combine both.
This new implementation here is (IMO, but I also wrote it, so I am biased) much more readable than the existing nanops.

I could try to do a POC of combining so we have something to compare with. But even if possible with not too much hooks, for the above reasons I prefer to keep the "masked ops" separate.

Personally, I would (short term) rather put effort in sharing the testing code instead of putting effort in sharing the actual implementation: improving testing by ensuring we re-use a lot of more of our extensive, existings tests for the numeric ops (so run those tests also for the nullable dtypes, instead of testing it separately as we do now).

Another way of sharing code could also be the other way around: instead of trying to integrate this into nanops; we could also start looking into gradually using those "masked ops" more for some of the existing dtypes. For example, as @jbrockmendel mentioned above, the Datetime/TimedeltaIndex min/max functions could use a masked op version of min/max.

jbrockmendel · 2020-02-21T01:59:16Z

There are indeed some similar concepts, but I am not sure if it is that easy to share code (if you have specific ideas, feel free to share).

The closest thing I had to a specific idea was "huh, this looks kinda similar to that other thing"

jorisvandenbossche · 2020-03-25T13:04:58Z

OK, moved, so this is ready for another look

jreback

lgtm. small followon request; this needs a whatsnew note?

pandas/core/nanops.py

jreback · 2020-03-26T01:31:02Z

does this need asv's?

jorisvandenbossche · 2020-03-26T09:24:37Z

does this need asv's?

Added the Int64/boolean dtype to the parametrization of some existing benchmarks (Any/All/NanOps and FrameOps).

And added a short whatsnew notice.

pandas/core/array_algos/masked_reductions.py

jreback · 2020-03-26T13:09:17Z

lgtm @jbrockmendel @TomAugspurger if any comments

TomAugspurger · 2020-03-26T13:18:45Z

Don't think I'll have any today. Last I looked (a while ago) I was happy with the overall strategy.

…

On Thu, Mar 26, 2020 at 8:09 AM Jeff Reback ***@***.***> wrote: lgtm @jbrockmendel <https://github.com/jbrockmendel> @TomAugspurger <https://github.com/TomAugspurger> if any comments — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30982 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIUVSY2EJRRMQC6TORLRJNHYZANCNFSM4KGISNPA> .

jbrockmendel · 2020-03-26T15:49:05Z

doc/source/whatsnew/v1.1.0.rst

@@ -229,6 +229,8 @@ Performance improvements
  sparse values from ``scipy.sparse`` matrices using the
  :meth:`DataFrame.sparse.from_spmatrix` constructor (:issue:`32821`,
  :issue:`32825`,  :issue:`32826`, :issue:`32856`, :issue:`32858`).
+- Performance improvement in :meth:`Series.sum` for nullable (integer and boolean) dtypes (:issue:`30982`).


is DataFrame.sum affected?

No, since we don't use the EA implementation for reductions in DataFrame reductions, that's my other PR

jbrockmendel · 2020-03-26T15:50:06Z

pandas/core/array_algos/masked_reductions.py

+from pandas.core.nanops import _below_min_count
+
+
+def sum(


can we avoid overriding a builtin-in name?

can we avoid overriding a builtin-in name?

Do we care? In practice this is only used as masked_reductions.sum(..), so there is no risk of confusing it (eg numpy also has a sum ..).

I also don't have a problem with renaming it to eg masked_sum, though (but all together it then becomes a bit long masked_reductions.masked_sum(..))

nansum? Not a strong enough opinion to be a blocker.

There are no nan's involved here ;), that's exactly the difference with nanops

ha, sounds good

pandas/core/array_algos/masked_reductions.py

jbrockmendel · 2020-03-26T15:50:31Z

pandas/core/array_algos/masked_reductions.py

+        Boolean numpy array (True values indicate missing values).
+    skipna : bool, default True
+        Whether to skip NA.
+    min_count : int, default 0


required >=0?

It's a count, so yes

jbrockmendel · 2020-03-26T15:52:05Z

pandas/core/nanops.py

@@ -1238,7 +1238,7 @@ def _maybe_null_out(
    result: np.ndarray,
    axis: Optional[int],
    mask: Optional[np.ndarray],
-    shape: Tuple,
+    shape: Tuple[int],


does Tuple[int] describe a 1-tuple, or any-length-tuple?

Ah, yes, apparently that needs to be Tuple[int, ...] for a variable length tuple

jreback · 2020-03-27T15:48:08Z

@jbrockmendel merge if you are good with responses

jbrockmendel · 2020-03-27T16:08:32Z

thanks @jorisvandenbossche

jorisvandenbossche · 2020-03-27T20:00:55Z

And thanks for the reviews!

dsaxton · 2020-03-29T21:10:42Z

pandas/core/array_algos/masked_reductions.py

+        if _np_version_under1p17:
+            return np.sum(values[~mask])
+        else:
+            return np.sum(values, where=~mask)


@jorisvandenbossche This seems like a useful pattern for other reductions like min and max (maybe a special case is needed if we have a min_count?), wondering if it might make sense to generalize this to those? Could be more performant than using nanops, and could also be done in a way that's extensible to string dtype (where nanops can't be used)

Yes, that's indeed the goal of this PR (I just wanted to limit it to a single function for the initial PR). For min/max, I think it will be better to write a separate function in module since they don't use min_count, but prod should be able to reuse this.
I was planning to do follow-up PRs the and of this week, but feel free to pick it up!

POC masked ops for reductions

9795f35

jorisvandenbossche added this to the 1.1 milestone Jan 13, 2020

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Jan 15, 2020

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020

Merge remote-tracking branch 'upstream/master' into masked-ops

7fb1f88

jorisvandenbossche removed the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 13, 2020

jorisvandenbossche added 2 commits February 13, 2020 11:05

fix mask for older numpy

cd26920

also use in boolean

28cd331

jorisvandenbossche changed the title ~~POC masked ops for reductions~~ PERF: masked ops for reductions (sum) Feb 13, 2020

jorisvandenbossche added 3 commits February 13, 2020 12:01

add min_count support

6298fbd

fix preserve_dtypes test

735a741

passthrough min_count for boolean as well

6df454f

TomAugspurger reviewed Feb 13, 2020

View reviewed changes

fix comment

5eb48d6

jorisvandenbossche mentioned this pull request Feb 14, 2020

Implement some reductions for string Series #31757

Closed

5 tasks

Merge remote-tracking branch 'upstream/master' into masked-ops

3ef1331

jreback requested changes Feb 20, 2020

View reviewed changes

jbrockmendel reviewed Feb 20, 2020

View reviewed changes

add object to empty reduction test case

d2230fd

move sum algo from ops to array_algos

76c5149

jreback requested changes Mar 26, 2020

View reviewed changes

pandas/core/nanops.py Outdated Show resolved Hide resolved

jorisvandenbossche added 3 commits March 26, 2020 09:13

Merge remote-tracking branch 'upstream/master' into masked-ops

476f768

add Int64/boolean to some benchmarks

b2162dc

add whatsnew

d4746f5

simonjayhawkins reviewed Mar 26, 2020

View reviewed changes

pandas/core/array_algos/masked_reductions.py Show resolved Hide resolved

add skipna default in function signature

d9c2cbf

jreback approved these changes Mar 26, 2020

View reviewed changes

jbrockmendel reviewed Mar 26, 2020

View reviewed changes

pandas/core/array_algos/masked_reductions.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Mar 26, 2020

View reviewed changes

jorisvandenbossche added 2 commits March 26, 2020 17:16

update type hint + deprivatize

f8705c2

update another type hint

1a43e10

jbrockmendel merged commit 5e3a5f6 into pandas-dev:master Mar 27, 2020

jorisvandenbossche deleted the masked-ops branch March 27, 2020 20:00

dsaxton reviewed Mar 29, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Apr 3, 2020

PERF: masked ops for reductions (min/max) #33261

Merged

jorisvandenbossche mentioned this pull request Jun 13, 2020

ENH: add masked algorithm for mean() #34754

Closed

jorisvandenbossche mentioned this pull request Jul 15, 2020

Overview of impact / achievements of the CZI grant (round 1) TomAugspurger/pandas-czi#6

Open

jorisvandenbossche mentioned this pull request Oct 29, 2020

ENH: support masked arrays in groupby cython algos #37493

Closed

10 tasks

		return np.sum(values, where=~mask)


		def _below_min_count(values, mask, min_count):

	def _maybe_null_out(
	result: np.ndarray,
	axis: Optional[int],
	mask: Optional[np.ndarray],
	shape: Tuple,
	min_count: int = 1,
	) -> float:
	"""
	Returns
	-------
	Dtype
	The product of all elements on a given axis. ( NaNs are treated as 1)
	"""
	if mask is not None and axis is not None and getattr(result, "ndim", False):
	null_mask = (mask.shape[axis] - mask.sum(axis) - min_count) < 0
	if np.any(null_mask):
	if is_numeric_dtype(result):
	if np.iscomplexobj(result):
	result = result.astype("c16")
	else:
	result = result.astype("f8")
	result[null_mask] = np.nan
	else:
	# GH12941, use None to auto cast null
	result[null_mask] = None
	elif result is not NaT:
	if mask is not None:
	null_mask = mask.size - mask.sum()
	else:
	null_mask = np.prod(shape)
	if null_mask < min_count:
	result = np.nan

	return result

PERF: masked ops for reductions (sum) #30982

PERF: masked ops for reductions (sum) #30982

Conversation

jorisvandenbossche commented Jan 13, 2020

jbrockmendel commented Jan 13, 2020

jorisvandenbossche commented Feb 13, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Feb 20, 2020

jorisvandenbossche commented Feb 20, 2020 • edited Loading

jreback commented Feb 20, 2020

jorisvandenbossche commented Feb 20, 2020 • edited Loading

jbrockmendel commented Feb 21, 2020

jorisvandenbossche commented Mar 25, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Mar 26, 2020

jorisvandenbossche commented Mar 26, 2020

jreback commented Mar 26, 2020

TomAugspurger commented Mar 26, 2020 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 27, 2020

jbrockmendel commented Mar 27, 2020

jorisvandenbossche commented Mar 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 20, 2020 •

edited

Loading

jorisvandenbossche Feb 20, 2020 •

edited

Loading

jorisvandenbossche commented Feb 20, 2020 •

edited

Loading

jorisvandenbossche commented Feb 20, 2020 •

edited

Loading