Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve 2D array access / transpose() for masked dtypes #52083

Conversation

jorisvandenbossche
Copy link
Member

xref #52016

For transpose() or operations that do a transpose under the hood (eg typically operations with axis=1), this is generally expensive if you have extension dtypes. This explores the possibilities to add a fast path for the specific case of a DataFrame with all-masked dtypes.

Using the example dataframe from #52016, this gives a speed-up

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))

np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

%timeit pd_mask.transpose()
# 9.13 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <-- main
# 689 ms ± 28.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- PR

Note there are some various different optimizations added together (that can be splitted, and still needs to be cleaned-up in general).
The main speed-up comes from calling BooleanArray(..) instead of BooleanArray._from_sequence(..) (or in general with any other masked array class), because this generic _from_sequence has a lot of overhead (that adds up when calling it 250_000 times as in this example). To be able to use the main __init__, I added a version of mgr.as_array() to return separate data and mask arrays.

Further, after the arrays are constructed, the BlockManager construction can be optimized by passing verify_integrity=False and by knowing that we have all 1D EAs (so no need to do block dtype grouping etc).
Another tiny optimization is in the BooleanArray(..) to avoid creating a new dtype instance every time (this can easily be split of, we might want to do that generally for all primitive dtypes, and maybe rather in the init of the dtype)

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 20, 2023
Comment on lines +3618 to +3625
if isinstance(dtype, BaseMaskedDtype):
data, mask = self._mgr.as_array_masked()
new_values = [arr_type(data[i], mask[i]) for i in range(self.shape[0])]
else:
values = self.values
new_values = [
arr_type._from_sequence(row, dtype=dtype) for row in values
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the main change (bringing the large part of the speed-up for transpose)

Comment on lines +10954 to +10967
# if len(df._mgr) > 0:
# common_dtype = find_common_type(list(df._mgr.get_dtypes()))
# is_masked_ea = isinstance(common_dtype, BaseMaskedDtype)
# is_np = isinstance(common_dtype, np.dtype)
# else:
# common_dtype = None

# if axis == 1 and common_dtype and is_masked_ea:
# data, mask = self._mgr.as_array_masked()
# ea2d = common_dtype.construct_array_type()(data, mask)
# result = ea2d._reduce(name, axis=axis, skipna=skipna, **kwds)
# labels = self._get_agg_axis(axis)
# result = self._constructor_sliced(result, index=labels, copy=False)
# return result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach this relates to #51923, and making use of the as_array_masked added in this PR could be a potential different way of tackling it.

Manually running one of the benchmarks you mentioned in #51955 (the bottom row of - 8.94±0s 24.3±0.3ms 0.00 stat_ops.FrameOps.time_op('mean', 'Int64', 1)), with the above uncommented, I get:

In [1]: values = np.random.randn(100000, 4)
   ...: df = pd.DataFrame(values.astype(int)).astype("Int64")

In [2]: %timeit df.mean(axis=1)
5.97 s ± 434 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <-- main
1.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- PR

Now, this is probably a bit too optimistic, because the as_array_masked is currently assuming all the same dtypes, but with the use of find_common_type here, it would need to handle some extra cases as well. But in general it should also give a way to speed-up this specific case (while using the actual masked reduction implementations, which already support 2D with axis=1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this is probably a bit too optimistic, because the as_array_masked is currently assuming all the same dtypes, but with the use of find_common_type here, it would need to handle some extra cases as well.

As long as we're not doing inference, the transpose is guaranteed to be a single dtype. This is because two different dtypes in the transpose could only arise from two different dtypes in the rows of the original. So we can determine the common dtype, cast to a single dtype, and then transpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a cast beforehand might indeed be the easiest, then as_array_masked doesn't need to handle the case of multiple dtypes itself.

@jbrockmendel
Copy link
Member

Is the concern about transpose specifically or more about the axis=1 reductions? If the latter, would a reduce_axis1 approach that avoids the transpose altogether be more robust? It'd be nice to avoid special-casing our internal EAs.

Can any of this be extended eventually to arrow-backed cases?

@rhshadrach
Copy link
Member

Is the concern about transpose specifically or more about the axis=1 reductions? If the latter, would a reduce_axis1 approach that avoids the transpose altogether be more robust?

For those ops where there is a simple online algorithm (e.g. mean, sum, stddev, variance, sem, prod, kurtosis), I do think this would be a good option to look into. However there will be ops for which this won't work (median, quantile, rank, mode, nunique).

@jorisvandenbossche
Copy link
Member Author

Yes, I also mentioned that in #52016 (comment), but I think we can certainly try to avoid transpose altogether, but improving transpose itself as well might also be worth it? (edit: plus for methods where the transpose cannot easily be avoided, as Richard just commented)
(the main new function here that I added, as_array_masked, could also be used for both, see my inline comment above at #52083 (comment))

Can any of this be extended eventually to arrow-backed cases?

Yes, in general the arrow dtypes will suffer the same problems for row based operations (as long as arrow doesn't add some specific functionality for that). I think it could be quite easy to expand the approach here to work for arrow dtypes as well (converting their bitmask to a bytemask)

@rhshadrach rhshadrach mentioned this pull request Apr 5, 2023
5 tasks
Comment on lines +1758 to +1761
# # TODO(CoW) handle case where resulting array is a view
# if len(self.blocks) == 0:
# arr = np.empty(self.shape, dtype=float)
# return arr.transpose()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche - I don't follow this todo - can you elaborate?

@@ -2544,6 +2548,7 @@ def _from_arrays(
dtype=dtype,
verify_integrity=verify_integrity,
typ=manager,
is_1d_ea_only=is_1d_ea_only,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrays_to_mgr already has a consolidate keyword. could we use that instead?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 15, 2023
@mroeschke
Copy link
Member

Closing to clear the queue, but feel free to reopen when you have time to revisit

@mroeschke mroeschke closed this Aug 1, 2023
@mroeschke mroeschke added the Mothballed Temporarily-closed PR the author plans to return to label Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Mothballed Temporarily-closed PR the author plans to return to NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants