-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: ensure DataFrame.select_dtypes() returns a copy #48176
REGR: ensure DataFrame.select_dtypes() returns a copy #48176
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix @jorisvandenbossche !
the original patch was for performance this preserves? |
can u show the asv results |
Using the same code snippet as in #42611, I got:
The slowdown compared to main is fully due to the copy. All the other improvements from #42611 still give a 2x improvement compared to before. |
great ty this actually somewhat answers the question of how CoW would impact (asked in another issue) |
if we do end up adding |
pandas/core/frame.py
Outdated
@@ -4706,7 +4706,7 @@ def predicate(arr: ArrayLike) -> bool: | |||
|
|||
return True | |||
|
|||
mgr = self._mgr._get_data_subset(predicate) | |||
mgr = self._mgr._get_data_subset(predicate, copy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding a keyword in the Manager method could we just do a .copy after _get_data_subset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I was originally thinking to add it here to ensure that the manager can be smarter to know what needs to be copied and what not. But since the predicate is based on the block dtype, this can never split any blocks, so I suppose there will never be such a case where the subset might already be a copy. So yes, doing a .copy()
here is indeed much simpler.
There is one difference though: combine
(used by _get_data_subset
) doesn't do consolidation, while copy()
does. So if your original dataframe is not fully consolidated, the current PR keeps more of the performance improvement of #42611. Of course, at the cost of an unconsolidated return value and a potential later consolidation.
Either way is fine for me. Doing a simpler copy()
will also keep it closer to the previous behaviour since iloc
also consolidates I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do the copy outside of internals
all green, I think all comments addressed. will merge later today if no objections. |
b4 1.5.0rc0 |
This comment was marked as resolved.
This comment was marked as resolved.
…pes() returns a copy) (#48219)
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This was caused by the performance improvement in #42611, because
_mgr._get_data_subset
doesn't return a copy in contrast to indexing columns withiloc
. This just adds a copy option to_get_data_subset
, to keep all other improvements from #42611 (the performance obviously decreases again because of the additional copy, but it's still faster than before #42611)