PERF: improve 2D array access / transpose() for masked dtypes #52083

jorisvandenbossche · 2023-03-20T12:24:11Z

xref #52016

For transpose() or operations that do a transpose under the hood (eg typically operations with axis=1), this is generally expensive if you have extension dtypes. This explores the possibilities to add a fast path for the specific case of a DataFrame with all-masked dtypes.

Using the example dataframe from #52016, this gives a speed-up

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))

np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

%timeit pd_mask.transpose()
# 9.13 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <-- main
# 689 ms ± 28.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- PR

Note there are some various different optimizations added together (that can be splitted, and still needs to be cleaned-up in general).
The main speed-up comes from calling BooleanArray(..) instead of BooleanArray._from_sequence(..) (or in general with any other masked array class), because this generic _from_sequence has a lot of overhead (that adds up when calling it 250_000 times as in this example). To be able to use the main __init__, I added a version of mgr.as_array() to return separate data and mask arrays.

Further, after the arrays are constructed, the BlockManager construction can be optimized by passing verify_integrity=False and by knowing that we have all 1D EAs (so no need to do block dtype grouping etc).
Another tiny optimization is in the BooleanArray(..) to avoid creating a new dtype instance every time (this can easily be split of, we might want to do that generally for all primitive dtypes, and maybe rather in the init of the dtype)

jorisvandenbossche · 2023-03-20T12:24:56Z

pandas/core/frame.py

+            if isinstance(dtype, BaseMaskedDtype):
+                data, mask = self._mgr.as_array_masked()
+                new_values = [arr_type(data[i], mask[i]) for i in range(self.shape[0])]
+            else:
+                values = self.values
+                new_values = [
+                    arr_type._from_sequence(row, dtype=dtype) for row in values
+                ]


So this is the main change (bringing the large part of the speed-up for transpose)

jorisvandenbossche · 2023-03-20T12:29:55Z

pandas/core/frame.py

+        # if len(df._mgr) > 0:
+        #     common_dtype = find_common_type(list(df._mgr.get_dtypes()))
+        #     is_masked_ea = isinstance(common_dtype, BaseMaskedDtype)
+        #     is_np = isinstance(common_dtype, np.dtype)
+        # else:
+        #     common_dtype = None
+
+        # if axis == 1 and common_dtype and is_masked_ea:
+        #     data, mask = self._mgr.as_array_masked()
+        #     ea2d = common_dtype.construct_array_type()(data, mask)
+        #     result = ea2d._reduce(name, axis=axis, skipna=skipna, **kwds)
+        #     labels = self._get_agg_axis(axis)
+        #     result = self._constructor_sliced(result, index=labels, copy=False)
+        #     return result


@rhshadrach this relates to #51923, and making use of the as_array_masked added in this PR could be a potential different way of tackling it.

Manually running one of the benchmarks you mentioned in #51955 (the bottom row of - 8.94±0s 24.3±0.3ms 0.00 stat_ops.FrameOps.time_op('mean', 'Int64', 1)), with the above uncommented, I get:

In [1]: values = np.random.randn(100000, 4) ...: df = pd.DataFrame(values.astype(int)).astype("Int64") In [2]: %timeit df.mean(axis=1) 5.97 s ± 434 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- main 1.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- PR

Now, this is probably a bit too optimistic, because the as_array_masked is currently assuming all the same dtypes, but with the use of find_common_type here, it would need to handle some extra cases as well. But in general it should also give a way to speed-up this specific case (while using the actual masked reduction implementations, which already support 2D with axis=1)

Now, this is probably a bit too optimistic, because the as_array_masked is currently assuming all the same dtypes, but with the use of find_common_type here, it would need to handle some extra cases as well.

As long as we're not doing inference, the transpose is guaranteed to be a single dtype. This is because two different dtypes in the transpose could only arise from two different dtypes in the rows of the original. So we can determine the common dtype, cast to a single dtype, and then transpose.

Doing a cast beforehand might indeed be the easiest, then as_array_masked doesn't need to handle the case of multiple dtypes itself.

jbrockmendel · 2023-03-20T15:40:41Z

Is the concern about transpose specifically or more about the axis=1 reductions? If the latter, would a reduce_axis1 approach that avoids the transpose altogether be more robust? It'd be nice to avoid special-casing our internal EAs.

Can any of this be extended eventually to arrow-backed cases?

rhshadrach · 2023-03-20T21:14:59Z

Is the concern about transpose specifically or more about the axis=1 reductions? If the latter, would a reduce_axis1 approach that avoids the transpose altogether be more robust?

For those ops where there is a simple online algorithm (e.g. mean, sum, stddev, variance, sem, prod, kurtosis), I do think this would be a good option to look into. However there will be ops for which this won't work (median, quantile, rank, mode, nunique).

jorisvandenbossche · 2023-03-20T21:18:22Z

Yes, I also mentioned that in #52016 (comment), but I think we can certainly try to avoid transpose altogether, but improving transpose itself as well might also be worth it? (edit: plus for methods where the transpose cannot easily be avoided, as Richard just commented)
(the main new function here that I added, as_array_masked, could also be used for both, see my inline comment above at #52083 (comment))

Can any of this be extended eventually to arrow-backed cases?

Yes, in general the arrow dtypes will suffer the same problems for row based operations (as long as arrow doesn't add some specific functionality for that). I think it could be quite easy to expand the approach here to work for arrow dtypes as well (converting their bitmask to a bytemask)

rhshadrach · 2023-04-14T01:48:27Z

pandas/core/internals/managers.py

+        # # TODO(CoW) handle case where resulting array is a view
+        # if len(self.blocks) == 0:
+        #     arr = np.empty(self.shape, dtype=float)
+        #     return arr.transpose()


@jorisvandenbossche - I don't follow this todo - can you elaborate?

jbrockmendel · 2023-04-14T15:41:40Z

pandas/core/frame.py

@@ -2544,6 +2548,7 @@ def _from_arrays(
            dtype=dtype,
            verify_integrity=verify_integrity,
            typ=manager,
+            is_1d_ea_only=is_1d_ea_only,


arrays_to_mgr already has a consolidate keyword. could we use that instead?

github-actions · 2023-05-15T00:05:23Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-08-01T17:17:05Z

Closing to clear the queue, but feel free to reopen when you have time to revisit

PERF: improve 2D array access / transpose() for masked dtypes

d4a05a6

jorisvandenbossche added Performance Memory or execution speed performance NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 20, 2023

jorisvandenbossche commented Mar 20, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 20, 2023

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

Closed

3 tasks

rhshadrach mentioned this pull request Apr 5, 2023

REGR: Revert GH51335 #52250

Closed

5 tasks

rhshadrach reviewed Apr 14, 2023

View reviewed changes

jbrockmendel reviewed Apr 14, 2023

View reviewed changes

rhshadrach mentioned this pull request Apr 16, 2023

REGR: Performance of DataFrame reduction ops with axis=1 #52689

Closed

5 tasks

jorisvandenbossche mentioned this pull request Apr 21, 2023

PERF: Faster transposition of frames with masked arrays #52836

Merged

topper-123 mentioned this pull request May 2, 2023

ENH: better dtype inference when doing DataFrame reductions #52788

Merged

1 task

github-actions bot added the Stale label May 15, 2023

mroeschke closed this Aug 1, 2023

mroeschke added the Mothballed Temporarily-closed PR the author plans to return to label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improve 2D array access / transpose() for masked dtypes #52083

PERF: improve 2D array access / transpose() for masked dtypes #52083

jorisvandenbossche commented Mar 20, 2023

jorisvandenbossche Mar 20, 2023

jorisvandenbossche Mar 20, 2023

rhshadrach Mar 20, 2023

jorisvandenbossche Mar 20, 2023

jbrockmendel commented Mar 20, 2023

rhshadrach commented Mar 20, 2023

jorisvandenbossche commented Mar 20, 2023

rhshadrach Apr 14, 2023

jbrockmendel Apr 14, 2023

github-actions bot commented May 15, 2023

mroeschke commented Aug 1, 2023

PERF: improve 2D array access / transpose() for masked dtypes #52083

PERF: improve 2D array access / transpose() for masked dtypes #52083

Conversation

jorisvandenbossche commented Mar 20, 2023

jorisvandenbossche Mar 20, 2023

Choose a reason for hiding this comment

jorisvandenbossche Mar 20, 2023

Choose a reason for hiding this comment

rhshadrach Mar 20, 2023

Choose a reason for hiding this comment

jorisvandenbossche Mar 20, 2023

Choose a reason for hiding this comment

jbrockmendel commented Mar 20, 2023

rhshadrach commented Mar 20, 2023

jorisvandenbossche commented Mar 20, 2023

rhshadrach Apr 14, 2023

Choose a reason for hiding this comment

jbrockmendel Apr 14, 2023

Choose a reason for hiding this comment

github-actions bot commented May 15, 2023

mroeschke commented Aug 1, 2023