Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "DEPR: make_block (#56422)" #56481

Merged
merged 1 commit into from
Jan 10, 2024
Merged

Conversation

jorisvandenbossche
Copy link
Member

This reverts commit b0ffccd.

See the discussion in #56422

@jorisvandenbossche jorisvandenbossche added Internals Related to non-user accessible pandas implementation Compat pandas objects compatability with Numpy or Python functions labels Dec 13, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.2 milestone Dec 21, 2023
@lithomas1
Copy link
Member

@jorisvandenbossche

Should this go in before the RC?

xref comment here #56422 (comment)

@jorisvandenbossche
Copy link
Member Author

Ideally yes

@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Dec 22, 2023
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 22, 2023

I added an expanded comment on this issue (#56422 (comment)). And to clear, this is not a critical blocker for the RC (just for the final release). Nothing will break for people testing the RC if this is not in, but if they are running tests where they also use pyarrow, they will see a flood of additional warnings (that we can easily avoid by merging this), that will distract of actual relevant deprecation warnings we want them to check for the RC.

@lithomas1
Copy link
Member

cc @pandas-dev/pandas-core for more thoughts on this as this is the last remaining blocker for the release.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 8, 2024

I added an expanded comment on this issue (#56422 (comment)). And to clear, this is not a critical blocker for the RC (just for the final release). Nothing will break for people testing the RC if this is not in, but if they are running tests where they also use pyarrow, they will see a flood of additional warnings (that we can easily avoid by merging this), that will distract of actual relevant deprecation warnings we want them to check for the RC.

Is there a way to not issue the warning if coming from pyarrow, but issue the warning if another library uses make_block() ?

I understand that @jbrockmendel would like to "force" pyarrow to not use make_block() , but also am sympathetic to what @jorisvandenbossche says with respect to the performance implications for pyarrow and that there is not a reasonable alternative that doesn't affect pyarrow performance. So I don't think we can deprecate without that alternative.

@jbrockmendel
Copy link
Member

and that there is not a reasonable alternative that doesn't affect pyarrow performance

That is incorrect, several reasonable alternatives have been offered in #56422.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 8, 2024

and that there is not a reasonable alternative that doesn't affect pyarrow performance

That is incorrect, several reasonable alternatives have been offered in #56422.

My read of that discussion is that there is debate about whether the 6X performance hit is "reasonable".

@jbrockmendel
Copy link
Member

"6x" is less accurate than "100us"

@bashtage
Copy link
Contributor

bashtage commented Jan 8, 2024

Deprecating doesn't impose the performance penalty - it only imposes the message on users. Once can always improve performance of the blessed paths if this is a real bottle neck, but I do suspect that 35us vs 135 us is going to be noise in almost all cases.

It does seem that the incentive for pyarrow to change their usage is nil until this makes it into a release.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 8, 2024

and that there is not a reasonable alternative that doesn't affect pyarrow performance

That is incorrect, several reasonable alternatives have been offered in #56422.

@jbrockmendel I would recommend that you re-read that discussion then. Yes, there are alternatives, and I also acknowledged that, but only ones (that are zero-copy) that can be used in the future, not right now (note that pyarrow cannot enable optional features like CoW, that's something for the user to do)

To be explicit: the alternative that is 400µs slower (I agree that a slowdown like this will mostly be noise, although have to check the impact of more dtypes/columns) is not possible right now for pyarrow to implement (with public APIs).

@mroeschke
Copy link
Member

I think the current deprecation is OK to leave in for 2.2, but this deprecation should not be enforced (or upgraded to a FutureWarning) without an agreed upon alternative (might be 4.0 timeline at the earliest). There may be other libraries besides pyarrow using this API that either can use an existing alternative or can help comment on what a suitable alternative looks like that satisfies pyarrow + any other library.

@jorisvandenbossche
Copy link
Member Author

Given that there is a lot of ongoing discussion about this (what's the exact alternative, what should be the timeline, etc), can we simply revert this for now so we can have this discussion before committing to certain changes?

This warning will appear in CI for every project that uses pandas and pyarrow (e.g. we already see that in geopandas). So this is annoying a lot of other projects, just because we are committing the deprecation a bit early.

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 8, 2024

Reverting it now means delaying enforcing it from 2025 to 2026. That is bad.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 8, 2024

@jorisvandenbossche wrote:

This warning will appear in CI for every project that uses pandas and pyarrow (e.g. we already see that in geopandas). So this is annoying a lot of other projects, just because we are committing the deprecation a bit early.

@jbrockmendel wrote:

Reverting it now means delaying enforcing it from 2024 to 2025. That is bad.

To me, the timing seems to be the critical difference in opinion. I think there is general agreement that make_block() should be removed at some point, which would require a deprecation warning, followed by removal. The question is when this should be done.

@jorisvandenbossche What timing do you propose for this deprecation and removal?

@jbrockmendel
Copy link
Member

would using a PendingDeprecationWarning address some of the CI issues?

@jorisvandenbossche
Copy link
Member Author

According to our 2-year support window of pyarrow, the earliest moment we can enforce this deprecation would be spring 2026. Even if we are laxer (as we currently are) and only support for 1 year, the earliest moment would be in spring 2025 (because the earliest version of pyarrow that can support this will be in spring 2024).
So the timeline of enforcing this deprecation is >1 year anyway. Given that, personally I wouldn't worry about delaying the initial deprecation a few months.

If we only change Deprecation->FutureWarning in 4.0, what does it matter that the DeprecationWarning is added in 2.2 or 3.0 or 3.1?

Reverting it now means delaying enforcing it from 2024 to 2025. That is bad.

And can you explain why this is "bad"? What we are deprecating here is a small helper function (only for external usage, not used internally) that has very little maintenance overhead (assumption based on the fact it was hardly touched the last three years). Keeping it a year longer doesn't seem much of an issue.

@jorisvandenbossche
Copy link
Member Author

would using a PendingDeprecationWarning address some of the CI issues?

Pytest shows both DeprecationWarning and PendingDeprecationWarning by default, so that would be no difference AFAIK

I think there is general agreement that make_block() should be removed at some point,

I have no problem with deprecating make_block, if there is a clear alternative for the valid use cases we are aware of. But that's the part where the details are still being discussed (the current main alternative is concat+reindex, but one could also imagine a new API to construct a DataFrame from arrays), something that ideally happens before actually doing the deprecation.

@WillAyd
Copy link
Member

WillAyd commented Jan 8, 2024

For the uninformed what huge benefit is there to a downstream library using make_block versus say constructing a dataframe with a dict of arrays?

@jorisvandenbossche
Copy link
Member Author

Zero-copy construction of 2D blocks. We currently don't have a public API that accepts 2D arrays (except for the case of a single dtype for your full DataFrame).

@jbrockmendel
Copy link
Member

Reverting it now means delaying enforcing it from 2024 to 2025. That is bad.

And can you explain why this is "bad"?

That was a typo on my part. Should read "from 2025 to 2026". It is bad because 2026 is a long way away and enforcing the deprecation blocks weaning pyarrow off of constructing Managers which blocks any potential changes in Managers (as demonstrated by ArrayManager never being workable in the pyarrow-touching IO tests)

@jbrockmendel
Copy link
Member

For the uninformed what huge benefit is there to a downstream library using make_block versus say constructing a dataframe with a dict of arrays?

Zero-copy construction of 2D blocks. We currently don't have a public API that accepts 2D arrays (except for the case of a single dtype for your full DataFrame).

Kind of. In many (all?) cases pyarrow already does a copy when converting its 1D arrays to 2D arrays. The reasoning that was given for this years ago was that this prevented post-construction performance hits bc pandas did silent consolidation. But silent consolidation was removed in 2.0, and there has been a lot of very frustrating goalpost-shifting.

@phofl
Copy link
Member

phofl commented Jan 8, 2024

I have no problem with deprecating make_block, if there is a clear alternative for the valid use cases we are aware of. But that's the part where the details are still being discussed (the current main alternative is concat+reindex, but one could also imagine a new API to construct a DataFrame from arrays), something that ideally happens before actually doing the deprecation.

So let's revert the deprecation for 2.2 and then reintroduce for 3.0 so that we can enforce this either in 4.0 (less likely) or 5.0 (more likely because of the support window) at the latest. We have to run some tests to see if concat + reindex is sufficient or if we need a more efficient path that takes a list of arrays directly, but that should be relatively easy I guess?

@jbrockmendel
Copy link
Member

My concern with that plan is that Joris will infinitely re-litigate "clear alternative for the valid use cases" a few months from now. As bashtage said, pyarrow has no incentive to actually change until this is in a release. pyarrow is not being a good citizen by refusing to use public APIs.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 8, 2024

My concern with that plan is that Joris will infinitely re-litigate "clear alternative for the valid use cases" a few months from now. As bashtage said, pyarrow has no incentive to actually change until this is in a release. pyarrow is not being a good citizen by refusing to use public APIs.

FWIW, it seems that the people coding that part of pyarrow are @jorisvandenbossche and @wesm so they are representing both communities on this topic.....

@jbrockmendel
Copy link
Member

Signing off for the day. I'll consent to reverting for 2.2 if there is firm commitment to deprecating in 3.0 and enforcing in 4.0. pyarrow shouldn't get to abuse our internals indefinitely over a few hundred microseconds.

@lithomas1
Copy link
Member

merging as we agreed on reverting the deprecation for 2.2 and deprecating again for 3.0

@lithomas1 lithomas1 merged commit 1acf0d6 into main Jan 10, 2024
56 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 10, 2024
@WillAyd WillAyd deleted the revert-56422-depr-make_block branch January 10, 2024 19:10
@mroeschke mroeschke mentioned this pull request Jan 10, 2024
@lithomas1
Copy link
Member

Looks like I made CI red with this (should've rebased before merging).

Will take a look at the failures.

lithomas1 pushed a commit that referenced this pull request Jan 11, 2024
…") (#56814)

Backport PR #56481: Revert "DEPR: make_block (#56422)"

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants