Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Performance of DataFrame reduction ops with axis=1 #52689

Closed

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Siphoning off the largest impact from #52083.

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape), dtype=pd.BooleanDtype())
%timeit mask.transpose()
# 4.28 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- main
# 1.12 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- PR

@rhshadrach rhshadrach added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Reduction Operations sum, mean, min, max, etc. labels Apr 16, 2023
@rhshadrach rhshadrach added this to the 2.0.1 milestone Apr 16, 2023
-------
arr : ndarray
"""
# # TODO(CoW) handle case where resulting array is a view
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche - I don't understand this TODO, can you give more details?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose that with the currently implementation, the resulting arrays are never views but always copies? (since result_data and result_mask are newly created arrays (and then the values are copied in)
In that case I think you can just remove the comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that was just copied from the as_array implementation above, then can certainly be removed. Sorry for the confusion!

@jbrockmendel
Copy link
Member

I've been thinking about this and #52083. This is a workaround for lack of 2D support in MaskedArray (not unlike #52261). Maybe its time to revisit actually supporting 2D in MaskedArray?

@rhshadrach
Copy link
Member Author

I've been thinking about this and #52083. This is a workaround for lack of 2D support in MaskedArray (not unlike #52261). Maybe its time to revisit actually supporting 2D in MaskedArray?

Agreed this is a workaround, no opposition to 2D for masked dtypes.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 26, 2023
@datapythonista datapythonista modified the milestones: 2.0.2, 2.0.3 May 26, 2023
@rhshadrach rhshadrach closed this May 27, 2023
@rhshadrach rhshadrach deleted the reduction_ops_perf_transpose branch September 27, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants